Fix CJKBigramFilter inconsistent positions with outputUnigrams disabled by herley-shaori · Pull Request #15825 · apache/lucene

herley-shaori · 2026-03-14T22:53:56Z

Summary

CJKBigramFilter produces different token positions for the same input depending on whether outputUnigrams is true or false. This breaks phrase queries when index-time and search-time analyzers use different outputUnigrams settings — a common optimization pattern for CJK search.

Root cause

In flushBigram(), when outputUnigrams=false, bigrams are emitted with the default positionIncrement=1, but a bigram conceptually spans two character positions. After a word break (punctuation, whitespace, or non-CJK text), subsequent tokens are assigned positions that are off by 1 compared to the outputUnigrams=true case.

Example with input "一二、三":

outputUnigrams=true:  一(pos0) 一二(pos0) 二(pos1) 三(pos2)
outputUnigrams=false: 一二(pos0) 三(pos1) ← should be pos2

Fix

Following the principle suggested by @rmuir — outputUnigrams=false should behave as if unigrams were emitted, then later removed — this PR tracks whether bigrams were emitted from the current CJK segment and defers an extra position increment (+1) to apply to the first token after a segment boundary.

Two new fields in CJKBigramFilter:

hadBigrams: set true when a bigram is flushed in no-unigram mode
deferredPosInc: accumulated extra position increment, applied at the next segment transition (unaligned offsets, non-CJK token, or end of stream)

The deferred increment is applied in flushBigram(), flushUnigram(), and the non-CJK passthrough path in incrementToken().

Changes

CJKBigramFilter.java: Added position tracking logic across CJK segment boundaries
TestCJKBigramFilter.java: Added 3 new test cases reproducing the bug; updated testHanOnly expected positions
TestWithCJKBigramFilter.java (ICU): Updated expected positions in testJa2, testMix, testMix2, testReusableTokenStream, and testFinalOffset
CHANGES.txt: Added bug fix entry

Test plan

All 15 CJKBigramFilter tests pass (including 3 new tests)
All 12 ICU TestWithCJKBigramFilter tests pass
Code formatting verified via ./gradlew tidy
testBigramPositionsConsistentAcrossWordBreak — reproduces exact scenario from issue
testBigramPositionsMultipleSegments — verifies across multiple CJK segments with breaks
testBigramPositionsBeforeNonCJK — verifies CJK bigram followed by non-CJK text

…ams disabled (apache#15812) When outputUnigrams=false, CJKBigramFilter produced different token positions compared to outputUnigrams=true. A bigram spans two character positions but only advanced the position counter by 1. After a word break (punctuation, whitespace, or non-CJK text), subsequent tokens were assigned incorrect positions, breaking phrase queries in combined unigram+bigram indexing strategies. The fix tracks whether bigrams were emitted from the current CJK segment and defers an extra position increment (+1) to apply to the first token after a segment boundary. This ensures outputUnigrams=false behaves "as if unigrams were emitted then removed", keeping positions aligned across both settings. Example: "一二、三" Before: 一二(pos0) 三(pos1) — wrong, positions don't match After: 一二(pos0) 三(pos2) — correct, matches outputUnigrams=true

rmuir · 2026-03-14T23:03:06Z

Looks great! I think CJKAnalyzer may use this filter, and now it's position increments will have changed. Can you glance at the failing tests?

github-actions bot added the module:analysis label Mar 14, 2026

github-actions bot added this to the 11.0.0 milestone Mar 14, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix CJKBigramFilter inconsistent positions with outputUnigrams disabled#15825

Fix CJKBigramFilter inconsistent positions with outputUnigrams disabled#15825
herley-shaori wants to merge 1 commit intoapache:mainfrom
herley-shaori:fix/15812-cjk-bigram-position-inconsistency

herley-shaori commented Mar 14, 2026

Uh oh!

rmuir commented Mar 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

herley-shaori commented Mar 14, 2026

Summary

Root cause

Fix

Changes

Test plan

Uh oh!

rmuir commented Mar 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants