-
-
Notifications
You must be signed in to change notification settings - Fork 91
Description
Why ArcadeDB Needed a Hash Index
ArcadeDB's existing LSM-Tree index is optimized for range scans and sequential access. However, the most common database operation — point lookups (primary key access, JOIN resolution, edge traversal) —
doesn't need ordering. LSM pays for it anyway: multiple sorted levels, compaction overhead, and 2-3 page reads with binary searches per level.
A hash index achieves O(1) lookups with 1-2 page reads, purpose-built for equality queries.
Benchmark Results (10M records)
┌───────────────────────────┬─────────────┬─────────────┬─────────────┐
│ Operation │ LSM Tree │ Hash Index │ Improvement │
├───────────────────────────┼─────────────┼─────────────┼─────────────┤
│ Bulk Insert │ 18.6K/s │ 43.8K/s │ 2.4x faster │
├───────────────────────────┼─────────────┼─────────────┼─────────────┤
│ Point Lookup │ 16.5K ops/s │ 125K ops/s │ 7.6x faster │
├───────────────────────────┼─────────────┼─────────────┼─────────────┤
│ SQL Lookup (WHERE id = ?) │ 13.5K ops/s │ 38.5K ops/s │ 2.8x faster │
├───────────────────────────┼─────────────┼─────────────┼─────────────┤
│ Update by Key │ 5.8K ops/s │ 13.2K ops/s │ 2.3x faster │
├───────────────────────────┼─────────────┼─────────────┼─────────────┤
│ Delete + Re-insert │ 3.9K ops/s │ 4.7K ops/s │ 1.2x faster │
└───────────────────────────┴─────────────┴─────────────┴─────────────┘
Design: Extendible Hashing
The implementation uses extendible hashing — a proven disk-oriented algorithm with 40+ years of production history:
- No compaction needed — unlike LSM, there are no background merge operations, no write amplification, no throughput degradation at scale
- Constant-time lookups — hash key → read directory → read bucket page (2 page reads, both cacheable)
- Local splits — when a bucket is full, only that bucket splits; no global reorganization
- Space efficient — no tombstones, no duplicate entries across levels
When to Use Which
┌───────────────────────────────────────┬───────────────────┐
│ Use Case │ Recommended Index │
├───────────────────────────────────────┼───────────────────┤
│ WHERE id = ?, JOINs, edge traversal │ Hash │
├───────────────────────────────────────┼───────────────────┤
│ WHERE age > 30, ORDER BY, range scans │ LSM Tree │
└───────────────────────────────────────┴───────────────────┘
The two indexes are complementary. Hash is the better default for primary keys and any field accessed by equality, while LSM remains the right choice for range queries and ordered access.