GPU-Accelerated Cuckoo Filter
Loading...
Searching...
No Matches
GPU-Accelerated Cuckoo Filter

Documentation

A high-performance CUDA implementation of the Cuckoo Filter data structure, developed as part of the thesis "Design and Evaluation of a GPU-Accelerated Cuckoo Filter".

Overview

This library provides a GPU-accelerated Cuckoo Filter implementation optimized for high-throughput batch operations. Cuckoo Filters are space-efficient probabilistic data structures that support insertion, lookup, and deletion operations with a configurable false positive rate.

Features

  • CUDA-accelerated batch insert, lookup, and delete operations
  • Configurable fingerprint size and bucket size
  • Multiple eviction policies (DFS, BFS)
  • Sorted insertion mode for improved memory coalescing
  • Multi-GPU support via gossip
  • IPC support for cross-process filter sharing
  • Header-only library design

Performance

Benchmarks at 80% load factor on an NVIDIA GH200 (H100 HBM3, 3.4 TB/s) with 16-bit fingerprints and equivalent space allocation for the Blocked Bloom Filter.

The GPU Cuckoo Filter is compared against:

L2-Resident (4M items, ~8 MiB)

Comparison Insert Query Delete FPR
GPU vs CPU Cuckoo 360× faster 973× faster N/A 0.041% vs 0.005%
Cuckoo vs TCF 6× faster 42× faster 100× faster 0.041% vs 0.305%
Cuckoo vs GQF 585× faster 6× faster 273× faster 0.041% vs 0.001%
Cuckoo vs Bloom 0.6× (slower) 1.4× faster N/A 0.041% vs 1.336%

DRAM-Resident (268M items, ~512 MiB)

Comparison Insert Query Delete FPR
GPU vs CPU Cuckoo 583× faster 1504× faster N/A 0.039% vs 0.004%
Cuckoo vs TCF 1.9× faster 11.3× faster 35.3× faster 0.039% vs 0.394%
Cuckoo vs GQF 9.6× faster 2.6× faster 3.8× faster 0.039% vs 0.001%
Cuckoo vs Bloom 0.7× (slower) 1.0× (equal) N/A 0.039% vs 4.083%

‍[!NOTE] A much more comprehensive evaluation, including additional systems and analyses, is presented in the accompanying thesis.

Requirements

  • CUDA Toolkit (>= 12.9)
  • C++20 compatible compiler
  • Meson build system (>= 1.3.0)

Building

meson setup build
meson compile -C build

Benchmarks and tests are built by default. To disable them:

meson setup build -DBUILD_BENCHMARKS=false -DBUILD_TESTS=false

Usage

// Configure the filter: key type, fingerprint bits, max evictions, block size, bucket size
// Create a filter with the desired capacity
cuckoogpu::Filter<Config> filter(1 << 20); // capacity for ~1M items
// Insert keys (d_keys is a device pointer)
filter.insertMany(d_keys, numKeys);
// Or use sorted insertion
filter.insertManySorted(d_keys, numKeys);
// Check membership
filter.containsMany(d_keys, numKeys, d_results);
// Delete keys
filter.deleteMany(d_keys, numKeys, d_results);
A CUDA-accelerated Cuckoo Filter implementation.
Configuration structure for the Cuckoo Filter.

Configuration Options

The Config template accepts the following parameters:

Parameter Description Default
T Key type -
bitsPerTag Fingerprint size in bits (8, 16, 32) -
maxEvictions Maximum eviction attempts before failure 500
blockSize CUDA block size 256
bucketSize Slots per bucket (must be power of 2) 16
AltBucketPolicy Alternate bucket calculation policy XorAltBucketPolicy
evictionPolicy Eviction strategy (DFS or BFS) BFS
WordType Atomic type (uint32_t or uint64_t) uint64_t

Multi-GPU Support

For workloads that exceed single GPU capacity:

cuckoogpu::FilterMultiGPU<Config> filter(numGPUs, totalCapacity);
filter.insertMany(h_keys, numKeys);
filter.containsMany(h_keys, numKeys, h_results);
A multi-GPU implementation of the Cuckoo Filter.

Project Structure

include/cuckoogpu/ - Header files
CuckooFilter.cuh - Main filter implementation
CuckooFilterMultiGPU.cuh - Multi-GPU implementation
CuckooFilterIPC.cuh - IPC support
bucket_policies.cuh - Alternative bucket policies
helpers.cuh - Helper functions
src/ - Example applications
benchmark/ - benchmarks
tests/ - Unit tests
scripts/ - Scripts for running/plotting benchmarks