Core Algorithms

Bloom Filters: The Probabilistic Data Structure You Can't Ignore

Understand Bloom filters - the probabilistic data structure behind Bigtable, Cassandra SSTable lookups, Chrome malicious URL checks, and CDN cache routing. Learn how to tune false positives.

bloom-filterprobabilisticfalse-positiveshash-functionsCassandraBigtable

Bloom Filters: The Probabilistic Data Structure You Can't Ignore

I still remember the first time I encountered the "million missing element problem." We were building a web crawler at a startup, and we needed to track every URL we had ever seen. A set of 500 million URLs, each averaging 100 bytes, would require 50 GB of RAM just for the keys. We didn't have that kind of memory. A Bloom filter cut the memory requirement to 600 MB with an acceptable false positive rate of 1%. That was the moment I fell in love with probabilistic data structures.

Bloom filters answer one question with extreme space efficiency: "Have I seen this element before?" They give a definitive "no" or a probabilistic "yes." False positives are possible; false negatives are not. This tradeoff—trading perfect accuracy for massive memory savings—makes Bloom filters indispensable in systems serving billions of requests.

Where you already use Bloom filters: Every time you check if a URL is malicious in Chrome, every time Cassandra avoids a disk read for a key that doesn't exist, every time an HBase region server checks if a row is in an SSTable—you're using Bloom filters. They run silently in the infrastructure powering your daily internet experience.


How Bloom Filters Work

A Bloom filter is an m-bit array and k independent hash functions. Initially, all bits are set to 0.

The Data Structure

Insertion Flow

When inserting an element, you compute k hash functions, each producing a position between 0 and m-1, and set those bits to 1.

Membership Check Flow

To check if an element is in the set, compute the same k hash functions. If any of the bits at those positions is 0, the element is definitely not in the set. If all bits are 1, the element might be in the set (with a probability of false positive).

OperationStepsTime Complexity
InsertCompute k hashes, set k bitsO(k)
QueryCompute k hashes, check k bitsO(k)
False negativeImpossible0%
False positivePossible~(1 - e^(-kn/m))^k
DeletionNot supported (standard)N/A

Tuning the False Positive Rate

The false positive rate is a function of three parameters:

  • m: size of the bit array (bits)
  • n: number of elements inserted
  • k: number of hash functions

The false positive probability formula:

p = (1 - (1 - 1/m)^(kn))^k ≈ (1 - e^(-kn/m))^k

Finding the Optimal k

The optimal number of hash functions for given m and n is:

k_opt = (m / n) * ln(2)

At this optimum, the false positive rate is:

p_opt = (1/2)^k_opt = (0.6185)^(m/n)

Practical Parameter Table

m/n (bits per element)k (optimal hash count)False Positive RateExample for 10M elements
4315.2%5 MB, 153k false positives
645.6%7.5 MB, 56k false positives
862.1%10 MB, 21k false positives
1070.8%12.5 MB, 8k false positives
1280.3%15 MB, 3k false positives
14100.1%17.5 MB, 1k false positives
16110.04%20 MB, 400 false positives
20140.006%25 MB, 60 false positives
⚠️

The 80/20 rule of Bloom filters: Most systems can survive a 1% false positive rate. Going from 1% to 0.1% doubles your memory. Going from 0.1% to 0.01% doubles it again. The last few nines are exponentially expensive. Always measure your application's tolerance for false positives before tuning.

Choosing Hash Functions

You need k independent hash functions. Don't implement k different hash algorithms—use a technique called double hashing:

h_i(x) = h1(x) + i * h2(x) + i^2 mod m

Where h1 and h2 are two independent hash functions (e.g., MurmurHash3 and xxHash). This gives you k effectively independent hash positions with just two actual hash computations.

Hash FunctionSpeedDistributionUse Case
MurmurHash3FastExcellentGeneral purpose
xxHashVery fastExcellentHigh-throughput systems
SHA-256SlowPerfectSecurity-sensitive filters
FNV-1aFastGoodSmall filters, embedded systems

Real-World Applications

Chrome Malicious URL Checking

When you visit a URL in Chrome, the browser checks it against Google's database of malicious URLs. This database contains billions of entries. Chrome downloads a Bloom filter of the entire database (approximately 2-4 MB). Every URL you visit is checked locally against this filter.

The Bloom filter means the common case—safe URLs—requires zero network requests. Only when the filter says "maybe malicious" does Chrome need to phone home. This single use case reduces latency for billions of URL checks daily.

Cassandra SSTable Lookups

Cassandra uses Bloom filters as the first line of defense against unnecessary disk I/O. Each SSTable (Sorted String Table) on disk has a Bloom filter in memory. Before reading from an SSTable, Cassandra checks the Bloom filter.

StepActionIf FalseIf True
1Check partition key against SSTable Bloom filterSkip this SSTableCheck further
2Check partition key cacheN/ADirect read
3Check compression offset metadataN/APartial read
4Read the data from diskN/AReturn result

Without Bloom filters, every read would check every SSTable, causing O(N) disk reads per query where N is the number of SSTables. With Bloom filters, Cassandra typically checks 1-2 SSTables per query regardless of how many SSTables exist.

Cassandra Bloom filter sizing: Cassandra maintains a Bloom filter of about 10-12 bits per element. With default settings, Bloom filters use about 1-2 GB of memory per TB of data on disk. You can tune bloom_filter_fp_chance per table. Setting it to 0.01 gives 1% false positive rate; 0.1 gives 10% but uses less memory.

Bitcoin Simplified Payment Verification (SPV)

Bitcoin SPV clients don't download the full blockchain. Instead, they request transactions from peers using Bloom filters. The client sends a Bloom filter to a Bitcoin node, which returns matching transactions. The filter is tuned so that it returns some false positives, preventing the node from knowing exactly which addresses the client cares about.

This is a privacy-preserving use case: the false positives create plausible deniability about which transactions the client is interested in.


Counting Bloom Filters: Supporting Deletions

Standard Bloom filters don't support deletions. Once you set a bit to 1, you can't unset it because multiple elements might share that bit. The Counting Bloom Filter solves this by replacing each bit with a small counter (typically 2-4 bits).

PropertyStandard Bloom FilterCounting Bloom Filter
Bit per position1 bit2-4 bits
Memory overheadm bits2m to 4m bits
Deletion supportNoYes
Overflow riskN/ACounter wraps (rare, mitigated with 4-bit)
Use caseStatic setsDynamic sets with deletions

The tradeoff is memory: a Counting Bloom Filter uses 2-4x more memory. The counter size must be large enough to avoid overflow. With 4-bit counters (max value 15), overflow is statistically unlikely if k* n / m is reasonable.


Blocked Bloom Filters: Optimizing for CPU Cache

Standard Bloom filters require k random memory accesses per query. Each access is to a potentially different cache line. When m is large, this means k cache misses per query—expensive.

Blocked Bloom Filters divide the bit array into blocks that fit in a single CPU cache line (typically 512 bits or 64 bytes). Each element is first hashed to a block, then k hash functions are computed within that block.

Filter TypeMemory AccessesCPU Cache FriendlyFalse Positive Rate
Standardk randomPoorBaseline
Blocked1 block + k in-blockExcellentSlightly higher
Cuckoo2-3 randomGoodLower

The downside is a slightly higher false positive rate because elements competing for the same block have fewer distinct bit positions. In practice, the CPU cache efficiency gain more than compensates for the extra false positives.


Scalable Bloom Filters

Standard Bloom filters require you to know n (the number of elements) in advance. If you underestimate n, the false positive rate skyrockets. If you overestimate, you waste memory.

Scalable Bloom Filters solve this by creating a series of Bloom filters with geometrically growing capacity. When the current filter reaches its capacity, a new filter (with double the capacity and tighter false positive bounds) is appended.

The total false positive rate is bounded by the sum of individual filter rates. Using a geometric series where each filter has half the false positive rate of the previous one keeps the total bounded at approximately 2x the first filter's rate.

PropertyStandard Bloom FilterScalable Bloom Filter
n known in advanceRequiredNot required
Memory allocationFixedGrows incrementally
False positive rateFixed (predictable)Increases slightly with each new filter
Deletion supportNoNo
Query timeO(k)O(k * num_filters)
Use caseFixed-size membershipCaches, streaming data

Bigtable / HBase Bloom Filter Usage

Google's Bigtable (and its open-source counterpart HBase) uses Bloom filters at the block level within SSTables. Each SSTable is divided into 64 KB blocks. Each block has its own Bloom filter.

This multi-level filtering means Bigtable might have 10,000 SSTables but only needs to read 1-2 blocks (64 KB each) per query. Without Bloom filters, it would need to read the block index and potentially decompress multiple blocks.

Row-Level vs Row+Column-Level Filters

HBase supports two modes:

Filter ModeWhat It FiltersMemory per SSTableFalse Positive Impact
ROWEntire row present~10 bits per rowMay decompress block unnecessarily
ROWCOLSpecific column in row~10 bits per (row, family, qualifier)Much more precise

ROWCOL filters are more precise but use significantly more memory. For tables with many columns per row, ROW is usually sufficient.


CDN Cache Routing with Bloom Filters

CDNs use Bloom filters at edge nodes to determine which origin server might have a cached copy of a resource. This is the "cache of caches" pattern.

When a user requests a resource, the edge node:

  1. Checks its local cache (fast, local)
  2. If missed, checks a Bloom filter to see if any peer edge node has it
  3. If the filter says "maybe," sends a conditional request to the peer
  4. If the filter says "no," goes directly to the origin
ScenarioActionLatency Impact
Local cache hitServe immediatelyNone
Peer cache hit (filter true positive)Conditional request to peer+5-10 ms
Peer cache miss (filter false positive)Conditional request, then origin+10-20 ms (wasted check)
Filter missDirect to originOptimal path

The false positive cost is a wasted peer check—typically 5-10 ms. This is acceptable when it saves a full origin round trip (50-200 ms) for cache hits at peer nodes.


Bloom Filter Variants in Depth

Partitioned Bloom Filters

The m-bit array is divided into k equal-sized regions. Each hash function maps to exactly one region. This improves CPU cache behavior because each hash lookup is confined to a smaller memory region.

VariantMemoryCache BehaviorFP Rate
Standardm bitsPoor (random access)Baseline
Partitionedm bitsBetter (region-local)Slightly higher
Blockedm bitsBest (cache-line-local)Slightly higher

Stable Bloom Filters

For streaming applications where the set changes over time and old elements should be "forgotten," Stable Bloom Filters introduce a cell eviction mechanism. Each cell has a probability of being reset to 0 on each insert. This creates a sliding window of recent elements.

Shifting Bloom Filters

A recent research variant that reduces false positive rates by encoding additional metadata in the bit positions. Used in network monitoring to track flow-level statistics with minimal memory.


When NOT to Use Bloom Filters

Bloom filters are not the right tool for every job. Here are situations where alternatives like Cuckoo filters, Quotient filters, or exact data structures are better.

LimitationAlternativeWhy Better
Need deletionCounting BF, Cuckoo filterCuckoo supports deletion natively with no memory overhead
Need to list elementsHash set, B-treeBloom filters don't store elements, just membership
Very low FP requiredCuckoo filter (less than 0.1%)Cuckoo achieves 95%+ load factor with lower FP
Small setExact hash setFor n under 1M, a hash set is often competitive in memory
Need dynamic sizingScalable Bloom filterCan grow without knowing n in advance
Need to count distinct elementsHyperLogLogHLL estimates cardinality, not membership

Cuckoo Filters: A Quick Comparison

Cuckoo filters are the most popular Bloom filter alternative. They use cuckoo hashing instead of bit arrays.

PropertyBloom FilterCuckoo Filter
Space per item~6.4 bits at 1% FP~7 bits at 1% FP
False positive rateTuneableTuneable
DeletionCounting variant onlyNative
Lookupsk memory accesses2 memory accesses
InsertionAlways O(k)May need relocation
Load factor~50% (bits set)Up to 95%
Max capacityFixed at creationFixed at creation

My rule of thumb: Use Bloom filters when you need space efficiency above all else, false positives are acceptable at the 0.1-1% level, and you don't need deletions. Use Cuckoo filters when you need lower false positive rates, deletion support, or better cache performance. For cardinality estimation, use HyperLogLog. For frequency estimation, use Count-Min Sketch.


Practical Implementation Guidance

Encoding and Serialization

When sending Bloom filters over the network (e.g., Chrome downloading a filter from Google), the bit array is usually compressed with a run-length encoding or LZ4 compression. A 2 MB Bloom filter typically compresses to 500-800 KB.

Thread Safety

Standard Bloom filter inserts are NOT thread-safe without synchronization. Two threads inserting concurrently can race on bit setting. Solutions:

  • Per-thread filters: Each thread has its own filter; merge periodically by OR-ing bit arrays
  • Atomic bit operations: Use compare-and-swap on the word containing the target bit
  • Striped locking: Partition the bit array into stripes with per-stripe locks

Monitoring and Observability

MetricHow to MeasureAction if Abnormal
False positive rateTrack query results where filter said "present" but element wasn't insertedIncrease m (memory) if too high
Fill ratioCount 1-bits / total bitsMore than 80% means FP rate is degrading
Insert throughputOperations per secondCheck hash function CPU overhead
Query latencyP99 of check timeCheck for cache misses on bit array
⚠️

The silent degradation trap: As a Bloom filter fills up, the false positive rate degrades silently—no errors are thrown, no alarms fire. Your system just starts doing more fallback checks. Always monitor the fill ratio and false positive rate directly. Set an alert when fill ratio exceeds 70% so you can plan to resize or rebuild the filter.


Summary

Bloom filters are the ultimate example of trading accuracy for efficiency. They solve the "have I seen this before" problem using pennies of memory where exact data structures would require dollars.

ConceptKey Takeaway
Core ideam-bit array with k hash functions
GuaranteeFalse negatives impossible; false positives possible
Memory~10 bits per element for 1% false positive rate
Optimal k(m/n) * ln(2)
Counting BFAdds deletion at 2-4x memory cost
Blocked BFCache-efficient variant
Production useChrome, Cassandra, Bitcoin, Bigtable, CDNs

The beauty of Bloom filters is that they force you to think probabilistically. In distributed systems, certainty is expensive. A 1% chance of being wrong, combined with a fallback check, is often 100x cheaper than being certain all the time. This is a mindset shift: embrace probability, and you can build systems that would be impossible with exact data structures alone.

In the next tutorial, we will explore the Write-Ahead Log (WAL) — the algorithm that guarantees your data survives crashes and powers the durability guarantees of PostgreSQL, Kafka, and every serious database.

"A Bloom filter is like a bouncer who definitely knows who isn't on the list but might accidentally let in a few people who aren't." — Unknown