Skip to content

Feature: support hash index for faster lookup when no range queries are needed #3527

@lvca

Description

@lvca

Why ArcadeDB Needed a Hash Index

ArcadeDB's existing LSM-Tree index is optimized for range scans and sequential access. However, the most common database operation — point lookups (primary key access, JOIN resolution, edge traversal) —
doesn't need ordering. LSM pays for it anyway: multiple sorted levels, compaction overhead, and 2-3 page reads with binary searches per level.

A hash index achieves O(1) lookups with 1-2 page reads, purpose-built for equality queries.

Benchmark Results (10M records)

┌───────────────────────────┬─────────────┬─────────────┬─────────────┐
│         Operation         │  LSM Tree   │ Hash Index  │ Improvement │
├───────────────────────────┼─────────────┼─────────────┼─────────────┤
│ Bulk Insert               │ 18.6K/s     │ 43.8K/s     │ 2.4x faster │
├───────────────────────────┼─────────────┼─────────────┼─────────────┤
│ Point Lookup              │ 16.5K ops/s │ 125K ops/s  │ 7.6x faster │
├───────────────────────────┼─────────────┼─────────────┼─────────────┤
│ SQL Lookup (WHERE id = ?) │ 13.5K ops/s │ 38.5K ops/s │ 2.8x faster │
├───────────────────────────┼─────────────┼─────────────┼─────────────┤
│ Update by Key             │ 5.8K ops/s  │ 13.2K ops/s │ 2.3x faster │
├───────────────────────────┼─────────────┼─────────────┼─────────────┤
│ Delete + Re-insert        │ 3.9K ops/s  │ 4.7K ops/s  │ 1.2x faster │
└───────────────────────────┴─────────────┴─────────────┴─────────────┘

Design: Extendible Hashing

The implementation uses extendible hashing — a proven disk-oriented algorithm with 40+ years of production history:

  • No compaction needed — unlike LSM, there are no background merge operations, no write amplification, no throughput degradation at scale
  • Constant-time lookups — hash key → read directory → read bucket page (2 page reads, both cacheable)
  • Local splits — when a bucket is full, only that bucket splits; no global reorganization
  • Space efficient — no tombstones, no duplicate entries across levels

When to Use Which

  ┌───────────────────────────────────────┬───────────────────┐
  │               Use Case                │ Recommended Index │
  ├───────────────────────────────────────┼───────────────────┤
  │ WHERE id = ?, JOINs, edge traversal   │ Hash              │
  ├───────────────────────────────────────┼───────────────────┤
  │ WHERE age > 30, ORDER BY, range scans │ LSM Tree          │
  └───────────────────────────────────────┴───────────────────┘

The two indexes are complementary. Hash is the better default for primary keys and any field accessed by equality, while LSM remains the right choice for range queries and ordered access.

Metadata

Metadata

Assignees

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions