Skip to content

CJKBigramFilter produces inconsistent token positions with outputUnigrams enabled vs disabled #15812

@amatoba

Description

@amatoba

Description

CJKBigramFilter produces different token positions for the same input depending on whether outputUnigrams is true or false. This causes phrase query mismatches when index-time and search-time analyzers use different outputUnigrams settings.

Steps to reproduce

Using the _analyze API (tested on ES 9.3.1 / Lucene 10.2.1):

With outputUnigrams: true:

POST /_analyze
{
  "tokenizer": "standard",
  "filter": [{ "type": "cjk_bigram", "output_unigrams": true }],
  "text": "一二、三"
}

Result — is at position 2:

一       position: 0  <SINGLE>
一二     position: 0  <DOUBLE> (positionLength: 2)
二       position: 1  <SINGLE>
三       position: 2  <SINGLE>

With outputUnigrams: false:

POST /_analyze
{
  "tokenizer": "standard",
  "filter": [{ "type": "cjk_bigram", "output_unigrams": false }],
  "text": "一二、三"
}

Result — is at position 1:

一二     position: 0  <DOUBLE>
三       position: 1  <SINGLE>

Expected behavior

should be at the same effective position regardless of the outputUnigrams setting. With outputUnigrams: false, should be at position 2 (or equivalently, the bigram 一二 should account for occupying two character positions).

Analysis

The issue is in flushBigram(). When outputUnigrams=false, bigrams are emitted with the default positionIncrement=1 (from clearAttributes()), but a bigram conceptually spans two character positions. After a word break (e.g. punctuation ), a subsequent lone CJK character gets a position that differs from the outputUnigrams=true case because the bigram only advanced the position counter by 1 instead of 2.

This specifically manifests when:

  1. A CJK segment is followed by a word break (punctuation, whitespace, non-CJK text)
  2. The preceding CJK segment has an even number of characters (so it is fully consumed by bigrams with no trailing unigram)

Impact

This breaks phrase search when using a combined unigram+bigram indexing strategy with bigram-only search queries, which is a common optimization pattern for CJK search. The workaround is to enable outputUnigrams on both index and search sides, at the cost of generating redundant unigrams at search time.

Version

Confirmed on Lucene 10.2.1 (Elasticsearch 9.3.1). Also present on Lucene main branch as of 2026-03-10.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions