A high-performance CUDA implementation of the Cuckoo Filter data structure, developed as part of the thesis "Design and Evaluation of a GPU-Accelerated Cuckoo Filter".

Overview

This library provides a GPU-accelerated Cuckoo Filter implementation optimized for high-throughput batch operations. Cuckoo Filters are space-efficient probabilistic data structures that support insertion, lookup, and deletion operations with a configurable false positive rate.

Features

CUDA-accelerated batch insert, lookup, and delete operations
Configurable fingerprint size and bucket size
Multiple eviction policies (DFS, BFS)
Sorted insertion mode for improved memory coalescing
Multi-GPU support via gossip
IPC support for cross-process filter sharing
Header-only library design

Performance

Benchmarks at 80% load factor on an NVIDIA GH200 (H100 HBM3, 3.4 TB/s) with 16-bit fingerprints and equivalent space allocation for the Blocked Bloom Filter.

The GPU Cuckoo Filter is compared against:

L2-Resident (4M items, ~8 MiB)

Comparison	Insert	Query	Delete	FPR
GPU vs CPU Cuckoo	360× faster	973× faster	N/A	0.041% vs 0.005%
Cuckoo vs TCF	6× faster	42× faster	100× faster	0.041% vs 0.305%
Cuckoo vs GQF	585× faster	6× faster	273× faster	0.041% vs 0.001%
Cuckoo vs Bloom	0.6× (slower)	1.4× faster	N/A	0.041% vs 1.336%

DRAM-Resident (268M items, ~512 MiB)

Comparison	Insert	Query	Delete	FPR
GPU vs CPU Cuckoo	583× faster	1504× faster	N/A	0.039% vs 0.004%
Cuckoo vs TCF	1.9× faster	11.3× faster	35.3× faster	0.039% vs 0.394%
Cuckoo vs GQF	9.6× faster	2.6× faster	3.8× faster	0.039% vs 0.001%
Cuckoo vs Bloom	0.7× (slower)	1.0× (equal)	N/A	0.039% vs 4.083%

‍[!NOTE] A much more comprehensive evaluation, including additional systems and analyses, is presented in the accompanying thesis.

Requirements

CUDA Toolkit (>= 12.9)
C++20 compatible compiler
Meson build system (>= 1.3.0)

Building

meson setup build

meson compile -C build

Benchmarks and tests are built by default. To disable them:

meson setup build -DBUILD_BENCHMARKS=false -DBUILD_TESTS=false

Usage

#include <cuckoogpu/CuckooFilter.cuh>
 
// Configure the filter: key type, fingerprint bits, max evictions, block size, bucket size
using Config = cuckoogpu::Config<uint64_t, 16, 500, 256, 16>;
 
// Create a filter with the desired capacity
cuckoogpu::Filter<Config> filter(1 << 20);  // capacity for ~1M items
 
// Insert keys (d_keys is a device pointer)
filter.insertMany(d_keys, numKeys);
 
// Or use sorted insertion
filter.insertManySorted(d_keys, numKeys);
 
// Check membership
filter.containsMany(d_keys, numKeys, d_results);
 
// Delete keys
filter.deleteMany(d_keys, numKeys, d_results);

Configuration Options

The Config template accepts the following parameters:

Parameter	Description	Default
`T`	Key type	-
`bitsPerTag`	Fingerprint size in bits (8, 16, 32)	-
`maxEvictions`	Maximum eviction attempts before failure	500
`blockSize`	CUDA block size	256
`bucketSize`	Slots per bucket (must be power of 2)	16
`AltBucketPolicy`	Alternate bucket calculation policy	`XorAltBucketPolicy`
`evictionPolicy`	Eviction strategy (DFS or BFS)	`BFS`
`WordType`	Atomic type (uint32_t or uint64_t)	`uint64_t`

Multi-GPU Support

For workloads that exceed single GPU capacity:

#include <cuckoogpu/CuckooFilterMultiGPU.cuh>
 
cuckoogpu::FilterMultiGPU<Config> filter(numGPUs, totalCapacity);
filter.insertMany(h_keys, numKeys);
filter.containsMany(h_keys, numKeys, h_results);

Project Structure

include/cuckoogpu/   - Header files
  CuckooFilter.cuh           - Main filter implementation
  CuckooFilterMultiGPU.cuh   - Multi-GPU implementation
  CuckooFilterIPC.cuh        - IPC support
  bucket_policies.cuh        - Alternative bucket policies
  helpers.cuh                - Helper functions
src/                 - Example applications
benchmark/           - benchmarks
tests/               - Unit tests
scripts/             - Scripts for running/plotting benchmarks