Description
CJKBigramFilter produces different token positions for the same input depending on whether outputUnigrams is true or false. This causes phrase query mismatches when index-time and search-time analyzers use different outputUnigrams settings.
Steps to reproduce
Using the _analyze API (tested on ES 9.3.1 / Lucene 10.2.1):
With outputUnigrams: true:
POST /_analyze
{
"tokenizer": "standard",
"filter": [{ "type": "cjk_bigram", "output_unigrams": true }],
"text": "一二、三"
}
Result — 三 is at position 2:
一 position: 0 <SINGLE>
一二 position: 0 <DOUBLE> (positionLength: 2)
二 position: 1 <SINGLE>
三 position: 2 <SINGLE>
With outputUnigrams: false:
POST /_analyze
{
"tokenizer": "standard",
"filter": [{ "type": "cjk_bigram", "output_unigrams": false }],
"text": "一二、三"
}
Result — 三 is at position 1:
一二 position: 0 <DOUBLE>
三 position: 1 <SINGLE>
Expected behavior
三 should be at the same effective position regardless of the outputUnigrams setting. With outputUnigrams: false, 三 should be at position 2 (or equivalently, the bigram 一二 should account for occupying two character positions).
Analysis
The issue is in flushBigram(). When outputUnigrams=false, bigrams are emitted with the default positionIncrement=1 (from clearAttributes()), but a bigram conceptually spans two character positions. After a word break (e.g. punctuation 、), a subsequent lone CJK character gets a position that differs from the outputUnigrams=true case because the bigram only advanced the position counter by 1 instead of 2.
This specifically manifests when:
- A CJK segment is followed by a word break (punctuation, whitespace, non-CJK text)
- The preceding CJK segment has an even number of characters (so it is fully consumed by bigrams with no trailing unigram)
Impact
This breaks phrase search when using a combined unigram+bigram indexing strategy with bigram-only search queries, which is a common optimization pattern for CJK search. The workaround is to enable outputUnigrams on both index and search sides, at the cost of generating redundant unigrams at search time.
Version
Confirmed on Lucene 10.2.1 (Elasticsearch 9.3.1). Also present on Lucene main branch as of 2026-03-10.