|
GPU-Accelerated Cuckoo Filter
|
A high-performance, lock-free CUDA implementation of the Cuckoo Filter. This library is the companion code for the paper **"Cuckoo-GPU: Accelerating Cuckoo Filters on Modern GPUs"**.
This library provides a GPU-accelerated Cuckoo Filter implementation optimized for high-throughput batch operations. Cuckoo Filters are space-efficient probabilistic data structures that support insertion, lookup, and deletion operations with a configurable false positive rate.

Benchmarks at 95% load factor on an NVIDIA GH200 (H100 HBM3, 3.4 TB/s) with 16-bit fingerprints and equivalent space allocation for the Blocked Bloom Filter. The PCF runs on an Intel Xeon W9-3595X CPU (120 threads).
Cuckoo-GPU is compared against:
| Comparison | Insert | Query | Delete | FPR |
|---|---|---|---|---|
| Cuckoo-GPU vs PCF | 175× faster | 351× faster | N/A | 0.046% vs 0.011% |
| Cuckoo-GPU vs TCF | 4× faster | 35× faster | 108× faster | 0.046% vs 0.409% |
| Cuckoo-GPU vs GQF | 378× faster | 6× faster | 258× faster | 0.046% vs 0.001% |
| Cuckoo-GPU vs GBBF | 0.35× (slower) | 1.2× faster | N/A | 0.046% vs 2.503% |
| Cuckoo-GPU vs BCHT | 11.3× faster | 40.9× faster | N/A | 0.046% vs 0% |
| Comparison | Insert | Query | Delete | FPR |
|---|---|---|---|---|
| Cuckoo-GPU vs PCF | 69× faster | 143× faster | N/A | 0.044% vs 0.010% |
| Cuckoo-GPU vs TCF | 2.1× faster | 9.9× faster | 44.9× faster | 0.044% vs 0.467% |
| Cuckoo-GPU vs GQF | 10.1× faster | 2.6× faster | 3.6× faster | 0.044% vs 0.001% |
| Cuckoo-GPU vs GBBF | 0.7× (slower) | 0.9× (slower) | N/A | 0.044% vs 6.092% |
| Cuckoo-GPU vs BCHT | 8.5× faster | 15.9× faster | N/A | 0.044% vs 0% |
[!NOTE] A much more comprehensive evaluation, including additional systems and analyses, is presented in the accompanying thesis.
Benchmarks and tests are built automatically when this is a standalone project, but skipped when it is used as a subproject. You can override this with Meson feature options:
| Option | Behavior |
|---|---|
-Dbenchmarks=auto (default) | Build benchmarks when standalone, skip when subproject |
-Dbenchmarks=enabled | Always build benchmarks |
-Dbenchmarks=disabled | Never build benchmarks |
-Dtests=auto (default) | Build tests when standalone, skip when subproject |
-Dtests=enabled | Always build tests |
-Dtests=disabled | Never build tests |
Examples:
The Config template accepts the following parameters:
| Parameter | Description | Default |
|---|---|---|
T | Key type | - |
bitsPerTag | Fingerprint size in bits (8, 16, 32) | - |
maxEvictions | Maximum eviction attempts before failure | 500 |
blockSize | CUDA block size | 256 |
bucketSize | Slots per bucket (must be power of 2) | 16 |
AltBucketPolicy | Alternate bucket calculation policy | XorAltBucketPolicy |
evictionPolicy | Eviction strategy (DFS or BFS) | BFS |
WordType | Atomic type (uint32_t or uint64_t) | uint64_t |
For workloads that exceed single GPU capacity: