CJKBigramFilter produces inconsistent token positions with outputUnigrams enabled vs disabled

### Description

`CJKBigramFilter` produces different token positions for the same input depending on whether `outputUnigrams` is `true` or `false`. This causes phrase query mismatches when index-time and search-time analyzers use different `outputUnigrams` settings.

### Steps to reproduce

Using the `_analyze` API (tested on ES 9.3.1 / Lucene 10.2.1):

**With `outputUnigrams: true`:**
```json
POST /_analyze
{
  "tokenizer": "standard",
  "filter": [{ "type": "cjk_bigram", "output_unigrams": true }],
  "text": "一二、三"
}
```

Result — `三` is at **position 2**:
```
一       position: 0  <SINGLE>
一二     position: 0  <DOUBLE> (positionLength: 2)
二       position: 1  <SINGLE>
三       position: 2  <SINGLE>
```

**With `outputUnigrams: false`:**
```json
POST /_analyze
{
  "tokenizer": "standard",
  "filter": [{ "type": "cjk_bigram", "output_unigrams": false }],
  "text": "一二、三"
}
```

Result — `三` is at **position 1**:
```
一二     position: 0  <DOUBLE>
三       position: 1  <SINGLE>
```

### Expected behavior

`三` should be at the same effective position regardless of the `outputUnigrams` setting. With `outputUnigrams: false`, `三` should be at position 2 (or equivalently, the bigram `一二` should account for occupying two character positions).

### Analysis

The issue is in `flushBigram()`. When `outputUnigrams=false`, bigrams are emitted with the default `positionIncrement=1` (from `clearAttributes()`), but a bigram conceptually spans two character positions. After a word break (e.g. punctuation `、`), a subsequent lone CJK character gets a position that differs from the `outputUnigrams=true` case because the bigram only advanced the position counter by 1 instead of 2.

This specifically manifests when:
1. A CJK segment is followed by a word break (punctuation, whitespace, non-CJK text)
2. The preceding CJK segment has an even number of characters (so it is fully consumed by bigrams with no trailing unigram)

### Impact

This breaks phrase search when using a combined unigram+bigram indexing strategy with bigram-only search queries, which is a common optimization pattern for CJK search. The workaround is to enable `outputUnigrams` on both index and search sides, at the cost of generating redundant unigrams at search time.

### Version

Confirmed on Lucene 10.2.1 (Elasticsearch 9.3.1). Also present on Lucene `main` branch as of 2026-03-10.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CJKBigramFilter produces inconsistent token positions with outputUnigrams enabled vs disabled #15812

Description

Steps to reproduce

Expected behavior

Analysis

Impact

Version

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

CJKBigramFilter produces inconsistent token positions with outputUnigrams enabled vs disabled #15812

Description

Description

Steps to reproduce

Expected behavior

Analysis

Impact

Version

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions