Skip to content

Lucene103 blocktree TrieReader regression: 4x slower seekExact for _id lookups vs lucene90 FST #15820

@Vikasht34

Description

@Vikasht34

Description

Component: core/codecs

Description:

The lucene103 blocktree codec replaced the in-memory FST term index with
an on-disk TrieReader. This causes a significant performance regression
for workloads that perform high-frequency seekExact() calls on the _id
field during document indexing.

Environment

  • OpenSearch 3.3 (Lucene 10.x with lucene103 codec) vs OpenSearch 2.19 (Lucene 9.12.0 with lucene90 codec)
  • JDK: Amazon Corretto 21.0.8
  • Workload: 32 KNN indices, 6 shards each, mixed ingest+query (50/50),
    bulk indexing with explicit _id (UUID), ~400 segments per index at
    refresh_interval=1s

Problem

Every indexed document with an explicit _id triggers
PerThreadIDVersionAndSeqNoLookup.getDocID() which calls
SegmentTermsEnum.seekExact(BytesRef) on every segment to check for
version conflicts. With ~400 segments per index, each document requires
~400 seekExact calls.

In lucene90, seekExact navigates an in-memory FST (heap-resident).
In lucene103, seekExact navigates a TrieReader via memory-mapped file
reads, where each read triggers MemorySessionImpl.checkValidStateRaw()
(Panama Foreign Memory API bounds check).

JFR Evidence

Write thread profiling (JFR ExecutionSample) shows:

lucene10.3.1 : 10.0% of write thread time in seekExact path
DataInput.readVLong()
SegmentTermsEnumFrame.loadBlock()
SegmentTermsEnum.lambda$prepareSeekExact$1(BytesRef)
SegmentTermsEnum.seekExact(BytesRef)
PerThreadIDVersionAndSeqNoLookup.getDocID()

lucene9.12 : 2.6% of write thread time in seekExact path
FST$Arc$BitTable.isBitSet()
FST.findTargetArc()
SegmentTermsEnum.seekExact(BytesRef)
PerThreadIDVersionAndSeqNoLookup.getDocID()

Additionally, 6.6% of write thread time is spent in
MemorySessionImpl.checkValidStateRaw() on memory-mapped reads triggered
by the TrieReader navigation.

Combined: 16.6% write thread overhead vs 2.6% = 6.4x regression for
this code path.

Impact

At 256,000 seekExact calls/sec (32 TPS × 20 docs/bulk × 400 segments),
this overhead causes:

  • 1.9x per-document indexing latency (577µs vs 303µs)
  • Search thread saturation under mixed workload (queries slow down due
    to CPU contention)
  • Ingestion stalls at 297k docs/tenant vs 600k+ on lucene90

Increasing refresh_interval from 1s to 30s (reducing segments from ~400
to ~13) mitigates the issue by reducing seekExact calls 30x, pushing
the stall point from 297k to 497k.

Root Cause

Two compounding factors:

  1. TrieReader replaces in-memory FST with on-disk trie navigation.
    The FST was loaded into Java heap at segment open time — navigation
    was pure CPU (BitTable.isBitSet). The TrieReader reads from
    memory-mapped files, adding I/O indirection.

  2. Each memory-mapped read triggers checkValidStateRaw() — the Panama
    Foreign Memory API bounds check that verifies the Arena is still
    open. This is called on every byte read from the mmap file.

The _id field is special: it is looked up via seekExact on every single
document indexed. It has a random access pattern (UUIDs) that does not
benefit from the TrieReader's sequential access optimizations.

How to Reproduce

  1. Create an index with many small segments (refresh_interval=1s,
    continuous ingestion)
  2. Bulk index documents with explicit _id (UUIDs)
  3. Profile write threads with JFR
  4. Compare seekExact time between lucene90 and lucene103 codecs

Version and environment details

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions