A high-performance, lock-free CUDA implementation of the Cuckoo Filter. This library is the companion code for the paper **"Cuckoo-GPU: Accelerating Cuckoo Filters on Modern GPUs"**.

Overview

This library provides a GPU-accelerated Cuckoo Filter implementation optimized for high-throughput batch operations. Cuckoo Filters are space-efficient probabilistic data structures that support insertion, lookup, and deletion operations with a configurable false positive rate.

Features

CUDA-accelerated batch insert, lookup, and delete operations
Configurable fingerprint size and bucket size
Multiple eviction policies (DFS, BFS)
Sorted insertion mode for improved memory coalescing
Multi-GPU support via gossip
Experimental IPC support for cross-process filter sharing
Header-only library design

Performance

Benchmarks at 95% load factor on an NVIDIA GH200 (H100 HBM3, 3.4 TB/s) with 16-bit fingerprints and equivalent space allocation for the Blocked Bloom Filter. The PCF runs on an Intel Xeon W9-3595X CPU (120 threads).

Cuckoo-GPU is compared against:

L2-Resident (4M items, ~8 MiB)

Comparison	Insert	Query	Delete	FPR
Cuckoo-GPU vs PCF	175× faster	351× faster	N/A	0.046% vs 0.011%
Cuckoo-GPU vs TCF	4× faster	35× faster	108× faster	0.046% vs 0.409%
Cuckoo-GPU vs GQF	378× faster	6× faster	258× faster	0.046% vs 0.001%
Cuckoo-GPU vs GBBF	0.35× (slower)	1.2× faster	N/A	0.046% vs 2.503%
Cuckoo-GPU vs BCHT	11.3× faster	40.9× faster	N/A	0.046% vs 0%

DRAM-Resident (268M items, ~512 MiB)

Comparison	Insert	Query	Delete	FPR
Cuckoo-GPU vs PCF	69× faster	143× faster	N/A	0.044% vs 0.010%
Cuckoo-GPU vs TCF	2.1× faster	9.9× faster	44.9× faster	0.044% vs 0.467%
Cuckoo-GPU vs GQF	10.1× faster	2.6× faster	3.6× faster	0.044% vs 0.001%
Cuckoo-GPU vs GBBF	0.7× (slower)	0.9× (slower)	N/A	0.044% vs 6.092%
Cuckoo-GPU vs BCHT	8.5× faster	15.9× faster	N/A	0.044% vs 0%

‍[!NOTE] A much more comprehensive evaluation, including additional systems and analyses, is presented in the accompanying thesis.

Requirements

CUDA Toolkit (>= 12.9)
C++20 compatible compiler
Meson build system (>= 1.3.0)

Building

meson setup build

meson compile -C build

Benchmarks and tests are built automatically when this is a standalone project, but skipped when it is used as a subproject. You can override this with Meson feature options:

Option	Behavior
`-Dbenchmarks=auto` (default)	Build benchmarks when standalone, skip when subproject
`-Dbenchmarks=enabled`	Always build benchmarks
`-Dbenchmarks=disabled`	Never build benchmarks
`-Dtests=auto` (default)	Build tests when standalone, skip when subproject
`-Dtests=enabled`	Always build tests
`-Dtests=disabled`	Never build tests

Examples:

# Standalone: build everything (default)
meson setup build
 
# Standalone: skip benchmarks and tests
meson setup build -Dbenchmarks=disabled -Dtests=disabled
 
# Subproject: force benchmarks and tests on
meson setup build -Dbenchmarks=enabled -Dtests=enabled

Usage

#include <cuckoogpu/CuckooFilter.cuh>
 
// Configure the filter: key type, fingerprint bits, max evictions, block size, bucket size
using Config = cuckoogpu::Config<uint64_t, 16, 500, 256, 16>;
 
// Create a filter with the desired capacity
cuckoogpu::Filter<Config> filter(1 << 20);  // capacity for ~1M items
 
// Insert keys (d_keys is a device pointer)
filter.insertMany(d_keys, numKeys);
 
// Or use sorted insertion
filter.insertManySorted(d_keys, numKeys);
 
// Check membership
filter.containsMany(d_keys, numKeys, d_results);
 
// Delete keys
filter.deleteMany(d_keys, numKeys, d_results);

Configuration Options

The Config template accepts the following parameters:

Parameter	Description	Default
`T`	Key type	-
`bitsPerTag`	Fingerprint size in bits (8, 16, 32)	-
`maxEvictions`	Maximum eviction attempts before failure	500
`blockSize`	CUDA block size	256
`bucketSize`	Slots per bucket (must be power of 2)	16
`AltBucketPolicy`	Alternate bucket calculation policy	`XorAltBucketPolicy`
`evictionPolicy`	Eviction strategy (DFS or BFS)	`BFS`
`WordType`	Atomic type (uint32_t or uint64_t)	`uint64_t`

Multi-GPU Support

For workloads that exceed single GPU capacity:

#include <cuckoogpu/CuckooFilterMultiGPU.cuh>
 
cuckoogpu::FilterMultiGPU<Config> filter(numGPUs, totalCapacity);
filter.insertMany(h_keys, numKeys);
filter.containsMany(h_keys, numKeys, h_results);

Project Structure

include/cuckoogpu/   - Header files
  CuckooFilter.cuh           - Main filter implementation
  CuckooFilterMultiGPU.cuh   - Multi-GPU implementation
  CuckooFilterIPC.cuh        - IPC support
  bucket_policies.cuh        - Alternative bucket policies
  helpers.cuh                - Helper functions
src/                 - Example applications
benchmarks/          - benchmarks
tests/               - Unit tests
scripts/             - Scripts for running/plotting benchmarks