GPU-Accelerated Cuckoo Filter
Loading...
Searching...
No Matches
Cuckoo-GPU

Documentation arXiv

A high-performance, lock-free CUDA implementation of the Cuckoo Filter. This library is the companion code for the paper **"Cuckoo-GPU: Accelerating Cuckoo Filters on Modern GPUs"**.

Overview

This library provides a GPU-accelerated Cuckoo Filter implementation optimized for high-throughput batch operations. Cuckoo Filters are space-efficient probabilistic data structures that support insertion, lookup, and deletion operations with a configurable false positive rate.

Features

  • CUDA-accelerated batch insert, lookup, and delete operations
  • Configurable fingerprint size and bucket size
  • Multiple eviction policies (DFS, BFS)
  • Sorted insertion mode for improved memory coalescing
  • Multi-GPU support via gossip
  • Experimental IPC support for cross-process filter sharing
  • Header-only library design

Performance

image

Benchmarks at 95% load factor on an NVIDIA GH200 (H100 HBM3, 3.4 TB/s) with 16-bit fingerprints and equivalent space allocation for the Blocked Bloom Filter. The PCF runs on an Intel Xeon W9-3595X CPU (120 threads).

Cuckoo-GPU is compared against:

L2-Resident (4M items, ~8 MiB)

Comparison Insert Query Delete FPR
Cuckoo-GPU vs PCF 175× faster 351× faster N/A 0.046% vs 0.011%
Cuckoo-GPU vs TCF 4× faster 35× faster 108× faster 0.046% vs 0.409%
Cuckoo-GPU vs GQF 378× faster 6× faster 258× faster 0.046% vs 0.001%
Cuckoo-GPU vs GBBF 0.35× (slower) 1.2× faster N/A 0.046% vs 2.503%
Cuckoo-GPU vs BCHT 11.3× faster 40.9× faster N/A 0.046% vs 0%

DRAM-Resident (268M items, ~512 MiB)

Comparison Insert Query Delete FPR
Cuckoo-GPU vs PCF 69× faster 143× faster N/A 0.044% vs 0.010%
Cuckoo-GPU vs TCF 2.1× faster 9.9× faster 44.9× faster 0.044% vs 0.467%
Cuckoo-GPU vs GQF 10.1× faster 2.6× faster 3.6× faster 0.044% vs 0.001%
Cuckoo-GPU vs GBBF 0.7× (slower) 0.9× (slower) N/A 0.044% vs 6.092%
Cuckoo-GPU vs BCHT 8.5× faster 15.9× faster N/A 0.044% vs 0%

‍[!NOTE] A much more comprehensive evaluation, including additional systems and analyses, is presented in the accompanying thesis.

Requirements

  • CUDA Toolkit (>= 12.9)
  • C++20 compatible compiler
  • Meson build system (>= 1.3.0)

Building

meson setup build
meson compile -C build

Benchmarks and tests are built automatically when this is a standalone project, but skipped when it is used as a subproject. You can override this with Meson feature options:

Option Behavior
-Dbenchmarks=auto (default) Build benchmarks when standalone, skip when subproject
-Dbenchmarks=enabled Always build benchmarks
-Dbenchmarks=disabled Never build benchmarks
-Dtests=auto (default) Build tests when standalone, skip when subproject
-Dtests=enabled Always build tests
-Dtests=disabled Never build tests

Examples:

# Standalone: build everything (default)
meson setup build
# Standalone: skip benchmarks and tests
meson setup build -Dbenchmarks=disabled -Dtests=disabled
# Subproject: force benchmarks and tests on
meson setup build -Dbenchmarks=enabled -Dtests=enabled

Usage

// Configure the filter: key type, fingerprint bits, max evictions, block size, bucket size
// Create a filter with the desired capacity
cuckoogpu::Filter<Config> filter(1 << 20); // capacity for ~1M items
// Insert keys (d_keys is a device pointer)
filter.insertMany(d_keys, numKeys);
// Or use sorted insertion
filter.insertManySorted(d_keys, numKeys);
// Check membership
filter.containsMany(d_keys, numKeys, d_results);
// Delete keys
filter.deleteMany(d_keys, numKeys, d_results);
A CUDA-accelerated Cuckoo Filter implementation.
Configuration structure for the Cuckoo Filter.

Configuration Options

The Config template accepts the following parameters:

Parameter Description Default
T Key type -
bitsPerTag Fingerprint size in bits (8, 16, 32) -
maxEvictions Maximum eviction attempts before failure 500
blockSize CUDA block size 256
bucketSize Slots per bucket (must be power of 2) 16
AltBucketPolicy Alternate bucket calculation policy XorAltBucketPolicy
evictionPolicy Eviction strategy (DFS or BFS) BFS
WordType Atomic type (uint32_t or uint64_t) uint64_t

Multi-GPU Support

For workloads that exceed single GPU capacity:

cuckoogpu::FilterMultiGPU<Config> filter(numGPUs, totalCapacity);
filter.insertMany(h_keys, numKeys);
filter.containsMany(h_keys, numKeys, h_results);
A multi-GPU implementation of the Cuckoo Filter.

Project Structure

include/cuckoogpu/ - Header files
CuckooFilter.cuh - Main filter implementation
CuckooFilterMultiGPU.cuh - Multi-GPU implementation
CuckooFilterIPC.cuh - IPC support
bucket_policies.cuh - Alternative bucket policies
helpers.cuh - Helper functions
src/ - Example applications
benchmarks/ - benchmarks
tests/ - Unit tests
scripts/ - Scripts for running/plotting benchmarks