A fast, scalable Python implementation of Bloom filters - space-efficient probabilistic data structures for set membership testing.
A Bloom filter is a space-efficient probabilistic data structure that is used to test whether an element is a member of a set. It can tell you with certainty that an element is not in the set, or that it might be in the set (with a controllable false positive probability). Bloom filters never produce false negatives.
Use cases:
- Web crawlers: Track visited URLs to avoid re-crawling
- Caching systems: Quick membership checks before expensive lookups
- Database optimization: Pre-filter queries to avoid unnecessary disk reads
- Distributed systems: Efficient set reconciliation
- Network routers: Fast packet filtering
- Spell checkers: Quick dictionary lookups
-
Two implementations:
BloomFilter: Fixed-capacity filter for known dataset sizesScalableBloomFilter: Automatically grows to accommodate more elements
-
High performance: Uses xxHash for 128-bit hashing - one of the fastest non-cryptographic hash functions available
-
Set operations: Union and intersection support for combining filters
-
Serialization: Efficient binary file format for saving/loading filters
-
Python 3.6+: Modern Python 3 implementation without legacy Python 2 baggage
-
Optimized space usage: Tightening ratio of 0.9 provides excellent space efficiency across a wide range of growth patterns
Install from PyPI:
pip install pybloom-liveOr install from source:
git clone https://github.com/joseph-fox/python-bloomfilter.git
cd python-bloomfilter
pip install .Verify installation:
pip show pybloom-live
# Should show version 4.0.0 or higherimport pybloom_live
# Create a Bloom Filter with capacity for 1000 elements
# and a 0.1% false positive rate
f = pybloom_live.BloomFilter(capacity=1000, error_rate=0.001)
# Add elements
f.add("apple")
f.add("banana")
f.add("orange")
# Check membership
print("apple" in f) # True
print("grape" in f) # False
# The add() method returns True if the element might already exist
already_exists = f.add("apple") # Returns True
already_exists = f.add("mango") # Returns FalseThe ScalableBloomFilter automatically grows as you add more elements, maintaining the target error rate:
import pybloom_live
# Create a scalable filter with initial capacity of 100
sbf = pybloom_live.ScalableBloomFilter(
initial_capacity=100,
error_rate=0.001,
mode=pybloom_live.ScalableBloomFilter.LARGE_SET_GROWTH
)
# Add many elements - it scales automatically
for i in range(10000):
sbf.add(i)
print(f"Elements: {len(sbf)}") # ~10000
print(f"Capacity: {sbf.capacity}") # Automatically expanded
print(f"Filters: {len(sbf.filters)}") # Multiple internal filters
# Check membership
print(5000 in sbf) # True
print(10000 in sbf) # FalseGrowth modes:
SMALL_SET_GROWTH = 2: Slower growth, uses less memoryLARGE_SET_GROWTH = 4: Faster growth (default), better for large datasets
import pybloom_live
# Track visited URLs to avoid re-crawling
visited = pybloom_live.ScalableBloomFilter(
initial_capacity=10000,
error_rate=0.001
)
urls_to_crawl = [
"https://example.com/page1",
"https://example.com/page2",
"https://example.com/page1", # duplicate
"https://example.com/page3",
]
for url in urls_to_crawl:
if url in visited:
print(f"SKIP (already crawled): {url}")
else:
visited.add(url)
print(f"CRAWL: {url}")
# crawl_page(url)Combine filters using union and intersection:
import pybloom_live
# Create two filters with same capacity and error rate
f1 = pybloom_live.BloomFilter(capacity=100, error_rate=0.01)
f2 = pybloom_live.BloomFilter(capacity=100, error_rate=0.01)
# Add elements
for i in range(20):
f1.add(i)
for i in range(10, 30):
f2.add(i)
# Union - elements in either filter
f_union = f1 | f2 # or f1.union(f2)
print(0 in f_union) # True (in f1)
print(15 in f_union) # True (in both)
print(25 in f_union) # True (in f2)
# Intersection - elements likely in both filters
f_intersect = f1 & f2 # or f1.intersection(f2)
print(5 in f_intersect) # False (only in f1)
print(15 in f_intersect) # True (in both)
print(25 in f_intersect) # False (only in f2)Save and load filters from disk:
import pybloom_live
# Create and populate a filter
bf = pybloom_live.BloomFilter(capacity=10000, error_rate=0.001)
for i in range(1000):
bf.add(f"user_{i}")
# Save to file
with open('bloom_filter.bin', 'wb') as f:
bf.tofile(f)
# Load from file
with open('bloom_filter.bin', 'rb') as f:
loaded_bf = pybloom_live.BloomFilter.fromfile(f)
# Verify
print("user_500" in loaded_bf) # True
print("user_9999" in loaded_bf) # Falseimport pybloom_live
# Create a filter with 0.1% error rate
f = pybloom_live.BloomFilter(capacity=1000, error_rate=0.001)
# Add exactly capacity elements
for i in range(1000):
f.add(i)
# Test false positive rate
false_positives = 0
tests = 10000
for i in range(1000, 1000 + tests):
if i in f: # These were never added
false_positives += 1
actual_error_rate = false_positives / tests
print(f"Requested error rate: {f.error_rate}")
print(f"Actual error rate: {actual_error_rate:.4f}")
# Actual rate should be close to 0.001Constructor:
BloomFilter(capacity, error_rate=0.001)capacity: Maximum number of elementserror_rate: False positive probability (0 < error_rate < 1)
Methods:
add(key, skip_check=False): Add element; returnsTrueif element might already exist__contains__(key): Check membership usinginoperator__len__(): Return count of elements addedcopy(): Create a copy of the filterunion(other)/__or__: Combine with another filter (bitwise OR)intersection(other)/__and__: Intersect with another filter (bitwise AND)tofile(f): Serialize to file objectfromfile(f, n=-1): Deserialize from file (class method)
Properties:
capacity: Maximum capacityerror_rate: Target false positive ratecount: Number of elements addednum_bits: Total bits in filternum_slices: Number of hash functions
Constructor:
ScalableBloomFilter(initial_capacity=100, error_rate=0.001, mode=LARGE_SET_GROWTH)initial_capacity: Starting capacityerror_rate: Target false positive probabilitymode: Growth mode (SMALL_SET_GROWTH=2orLARGE_SET_GROWTH=4)
Methods:
add(key): Add element to filter__contains__(key): Check membership__len__(): Return total count across all internal filtersunion(other): Combine with another scalable filtertofile(f)/fromfile(f): Serialization support
Properties:
capacity: Total capacity (sum of all internal filters)count: Total elements addederror_rate: Target error ratefilters: List of internalBloomFilterinstances
The library automatically selects the optimal hash function based on the number of bits required:
| Bits Required | Hash Function | Type |
|---|---|---|
| ≤ 128 | xxh3_128 | Non-cryptographic (fastest) |
| 129-160 | SHA-1 | Cryptographic |
| 161-256 | SHA-256 | Cryptographic |
| 257-384 | SHA-384 | Cryptographic |
| > 384 | SHA-512 | Cryptographic |
This implementation follows the Scalable Bloom Filter algorithm from:
P. Almeida, C.Baquero, N. Preguiça, D. Hutchison, Scalable Bloom Filters, (GLOBECOM 2007), IEEE, 2007.
Key optimizations:
- Tightening ratio of 0.9: Provides better average space usage across a wide range of growth patterns
- Exponential growth: New filters are 2x or 4x larger than the previous one
- Error rate tightening: Each new filter has a tighter error probability to maintain overall target rate
- Optimal bit calculation: Uses
M ≈ n × |ln(P)| / (ln(2)²)where n is capacity and P is error rate
Version 4.0.0 introduces significant performance improvements but is not backward compatible with earlier versions:
- Hash function change: Replaced MD5 with xxHash (xxh3_128) for 128-bit hashes
- Performance: Significantly faster hash computation (see benchmarks)
- Incompatibility: Files saved with versions < 4.0 cannot be loaded with version 4.x
Migration: If you have serialized filters from earlier versions, you must regenerate them using version 4.0+.
Typical performance characteristics (results may vary by system):
- Add operation: ~500k-1M elements/second
- Lookup operation: ~700k-2M checks/second
- Memory usage: ~1.2 bits per element (with default 0.1% error rate)
- Python 3.6+
- Python 3.7+
- Python 3.8+
- Python 3.9+
- Python 3.10+
- Python 3.11+
- Python 3.12+
Note: Python 2 support has been dropped in version 4.0+. For Python 2 compatibility, use an earlier version.
We follow the git branching model described by Vincent Driessen.
Running tests:
pip install pytest
pytest pybloom_live/test_pybloom.pyMIT License - see LICENSE.txt for details.
- Original
pybloomby Jay Baird - Maintained and enhanced by the
pybloom_livecontributors - Uses xxHash for high-performance hashing