Skip to content

Fix CJKBigramFilter inconsistent positions with outputUnigrams disabled#15825

Open
herley-shaori wants to merge 1 commit intoapache:mainfrom
herley-shaori:fix/15812-cjk-bigram-position-inconsistency
Open

Fix CJKBigramFilter inconsistent positions with outputUnigrams disabled#15825
herley-shaori wants to merge 1 commit intoapache:mainfrom
herley-shaori:fix/15812-cjk-bigram-position-inconsistency

Conversation

@herley-shaori
Copy link

Summary

Fixes #15812

CJKBigramFilter produces different token positions for the same input depending on whether outputUnigrams is true or false. This breaks phrase queries when index-time and search-time analyzers use different outputUnigrams settings — a common optimization pattern for CJK search.

Root cause

In flushBigram(), when outputUnigrams=false, bigrams are emitted with the default positionIncrement=1, but a bigram conceptually spans two character positions. After a word break (punctuation, whitespace, or non-CJK text), subsequent tokens are assigned positions that are off by 1 compared to the outputUnigrams=true case.

Example with input "一二、三":

outputUnigrams=true:  一(pos0) 一二(pos0) 二(pos1) 三(pos2)
outputUnigrams=false: 一二(pos0) 三(pos1) ← should be pos2

Fix

Following the principle suggested by @rmuiroutputUnigrams=false should behave as if unigrams were emitted, then later removed — this PR tracks whether bigrams were emitted from the current CJK segment and defers an extra position increment (+1) to apply to the first token after a segment boundary.

Two new fields in CJKBigramFilter:

  • hadBigrams: set true when a bigram is flushed in no-unigram mode
  • deferredPosInc: accumulated extra position increment, applied at the next segment transition (unaligned offsets, non-CJK token, or end of stream)

The deferred increment is applied in flushBigram(), flushUnigram(), and the non-CJK passthrough path in incrementToken().

Changes

  • CJKBigramFilter.java: Added position tracking logic across CJK segment boundaries
  • TestCJKBigramFilter.java: Added 3 new test cases reproducing the bug; updated testHanOnly expected positions
  • TestWithCJKBigramFilter.java (ICU): Updated expected positions in testJa2, testMix, testMix2, testReusableTokenStream, and testFinalOffset
  • CHANGES.txt: Added bug fix entry

Test plan

  • All 15 CJKBigramFilter tests pass (including 3 new tests)
  • All 12 ICU TestWithCJKBigramFilter tests pass
  • Code formatting verified via ./gradlew tidy
  • testBigramPositionsConsistentAcrossWordBreak — reproduces exact scenario from issue
  • testBigramPositionsMultipleSegments — verifies across multiple CJK segments with breaks
  • testBigramPositionsBeforeNonCJK — verifies CJK bigram followed by non-CJK text

…ams disabled (apache#15812)

When outputUnigrams=false, CJKBigramFilter produced different token
positions compared to outputUnigrams=true. A bigram spans two character
positions but only advanced the position counter by 1. After a word
break (punctuation, whitespace, or non-CJK text), subsequent tokens
were assigned incorrect positions, breaking phrase queries in combined
unigram+bigram indexing strategies.

The fix tracks whether bigrams were emitted from the current CJK
segment and defers an extra position increment (+1) to apply to the
first token after a segment boundary. This ensures outputUnigrams=false
behaves "as if unigrams were emitted then removed", keeping positions
aligned across both settings.

Example: "一二、三"
  Before: 一二(pos0) 三(pos1) — wrong, positions don't match
  After:  一二(pos0) 三(pos2) — correct, matches outputUnigrams=true
@github-actions github-actions bot added this to the 11.0.0 milestone Mar 14, 2026
@rmuir
Copy link
Member

rmuir commented Mar 14, 2026

Looks great! I think CJKAnalyzer may use this filter, and now it's position increments will have changed. Can you glance at the failing tests?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CJKBigramFilter produces inconsistent token positions with outputUnigrams enabled vs disabled

2 participants