Fix CJKBigramFilter inconsistent positions with outputUnigrams disabled#15825
Open
herley-shaori wants to merge 1 commit intoapache:mainfrom
Open
Fix CJKBigramFilter inconsistent positions with outputUnigrams disabled#15825herley-shaori wants to merge 1 commit intoapache:mainfrom
herley-shaori wants to merge 1 commit intoapache:mainfrom
Conversation
…ams disabled (apache#15812) When outputUnigrams=false, CJKBigramFilter produced different token positions compared to outputUnigrams=true. A bigram spans two character positions but only advanced the position counter by 1. After a word break (punctuation, whitespace, or non-CJK text), subsequent tokens were assigned incorrect positions, breaking phrase queries in combined unigram+bigram indexing strategies. The fix tracks whether bigrams were emitted from the current CJK segment and defers an extra position increment (+1) to apply to the first token after a segment boundary. This ensures outputUnigrams=false behaves "as if unigrams were emitted then removed", keeping positions aligned across both settings. Example: "一二、三" Before: 一二(pos0) 三(pos1) — wrong, positions don't match After: 一二(pos0) 三(pos2) — correct, matches outputUnigrams=true
Member
|
Looks great! I think CJKAnalyzer may use this filter, and now it's position increments will have changed. Can you glance at the failing tests? |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #15812
CJKBigramFilterproduces different token positions for the same input depending on whetheroutputUnigramsistrueorfalse. This breaks phrase queries when index-time and search-time analyzers use differentoutputUnigramssettings — a common optimization pattern for CJK search.Root cause
In
flushBigram(), whenoutputUnigrams=false, bigrams are emitted with the defaultpositionIncrement=1, but a bigram conceptually spans two character positions. After a word break (punctuation, whitespace, or non-CJK text), subsequent tokens are assigned positions that are off by 1 compared to theoutputUnigrams=truecase.Example with input
"一二、三":Fix
Following the principle suggested by @rmuir —
outputUnigrams=falseshould behave as if unigrams were emitted, then later removed — this PR tracks whether bigrams were emitted from the current CJK segment and defers an extra position increment (+1) to apply to the first token after a segment boundary.Two new fields in
CJKBigramFilter:hadBigrams: settruewhen a bigram is flushed in no-unigram modedeferredPosInc: accumulated extra position increment, applied at the next segment transition (unaligned offsets, non-CJK token, or end of stream)The deferred increment is applied in
flushBigram(),flushUnigram(), and the non-CJK passthrough path inincrementToken().Changes
CJKBigramFilter.java: Added position tracking logic across CJK segment boundariesTestCJKBigramFilter.java: Added 3 new test cases reproducing the bug; updatedtestHanOnlyexpected positionsTestWithCJKBigramFilter.java(ICU): Updated expected positions intestJa2,testMix,testMix2,testReusableTokenStream, andtestFinalOffsetCHANGES.txt: Added bug fix entryTest plan
./gradlew tidytestBigramPositionsConsistentAcrossWordBreak— reproduces exact scenario from issuetestBigramPositionsMultipleSegments— verifies across multiple CJK segments with breakstestBigramPositionsBeforeNonCJK— verifies CJK bigram followed by non-CJK text