|
cuSBF
|
cuSBF is a high-performance GPU implementation of the Super Bloom filter, optimized for high-throughput batch k-mer insertion and query on nucleotide (DNA) and protein sequences (or any other sequence type as long as a valid alphabet is provided).
It exploits the streaming nature of sequence-derived k-mers by using minimizers to group consecutive k-mers sharing the same minimiser into super-k-mers, assigning all k-mers of a super-k-mer to the same 256-bit memory shard. This amortizes random memory accesses across consecutive k-mer queries, reducing memory-bandwidth pressure. The findere scheme further reduces false positives dramatically by inserting overlapping s-mers and requiring a full run of consecutive s-mer matches.
Benchmarks (Config<31, 28, 16, 4>) comparing cuSBF against NVIDIA's cuco bloom_filter on an NVIDIA RTX PRO 6000 Blackwell GPU with random DNA sequence inputs. Throughput is reported in billions of k-mers per second (Gk-mers/s). Timings include GPU-side sequence encoding for cuco, so both methods consume the same input sequence.
| Dataset Size | Operation | cuSBF | cuco Bloom | Speed-up |
|---|---|---|---|---|
| ~4M k-mers | Insert | 62.9 Gk-mers/s | 31.5 Gk-mers/s | 2.0× |
| ~4M k-mers | Query | 109 Gk-mers/s | 44.2 Gk-mers/s | 2.5× |
| ~268M k-mers | Insert | 46.6 Gk-mers/s | 5.9 Gk-mers/s | 7.9× |
| ~268M k-mers | Query | 101 Gk-mers/s | 13.2 Gk-mers/s | 7.7× |
The findere scheme (s-mer width) provides strong false-positive reduction at equivalent memory. For Config<31, 28, 16, 4> on ~4.6M inserted k-mers queried against 10⁹ random k-mers:
| Bits/k-mer | cuSBF FPR | cuco Bloom FPR |
|---|---|---|
| 58 | 0.0035% | 0.064% |
| 116 | 0.00036% | 0.017% |
| 231 | 0.000092% | 0.0054% |
Benchmarks can be reproduced with:
Benchmarks and tests are built automatically when this is a standalone project, but skipped when used as a subproject. Control with Meson feature options:
| Option | Behaviour |
|---|---|
-Dbenchmarks=auto (default) | Build benchmarks when standalone, skip when subproject |
-Dbenchmarks=enabled | Always build benchmarks |
-Dbenchmarks=disabled | Never build benchmarks |
-Dtests=auto (default) | Build tests when standalone, skip when subproject |
-Dtests=enabled | Always build tests |
-Dtests=disabled | Never build tests |
-Dexamples=auto (default) | Build examples when standalone, skip when subproject |
-Dexamples=enabled | Always build examples |
-Dexamples=disabled | Never build examples |
-Dparam_sweep=disabled (default) | Never build parameter-sweep benchmark (s, m) for DNA or protein |
-Dparam_sweep=enabled | Build parameter-sweep benchmark (s, m) for DNA or protein |
-Dparam_sweep_alphabet=dna | Alphabet for param_sweep: dna or protein |
[!IMPORTANT] The parameter sweep is disabled by default for a reason, there are 208 binaries for the entire sweep when using the DNA alphabet.
Examples:
The Config template accepts the following parameters:
| Parameter | Description | Default |
|---|---|---|
K | k-mer length (max depends on alphabet) | - |
S | s-mer width for findere Bloom hash seed (1-K) | - |
M | Minimiser width for shard selection (1-K) | - |
HashCount | Number of independent Bloom hash functions (1-16) | 4 |
CudaBlockSize | CUDA threads per block | 256 |
Alphabet | Symbol encoding (DNA or protein) | DnaAlphabet |