LineBasedTokenChunker.chunk_text infinite loop when split_by_token_limit snaps to word boundary producing empty head

## Bug Description

`LineBasedTokenChunker.chunk_text()` enters an infinite loop when processing text containing a long unbreakable token sequence preceded by a space. The process consumes 100% CPU indefinitely.

## Root Cause

In `split_by_token_limit()` (line_chunker.py), the `prefer_word_boundary` logic can snap `best_idx` back to 0, producing an empty head and returning the tail unchanged:

```python
if prefer_word_boundary:
    last_space_index = text[:best_idx].rfind(" ")
    if last_space_index >= 0:
        best_idx = last_space_index  # ← can snap to 0
```

When the text is e.g. `" aaaa...200×a...aaa"` (leading space then unbreakable blob):

1. Binary search correctly finds `best_idx ≈ 120` (fits within token limit)
2. `rfind(" ")` finds the space at index 0
3. `best_idx` becomes 0 → `head = ""`, `tail = text` (unchanged)
4. Back in `chunk_text`, `take` is empty, `remaining` is unchanged → **infinite loop at line 118**

The guard at line 208 (`if best_idx is None or best_idx <= 0: return "", text`) then fires, which feeds back into the `while True` loop in `chunk_text` with no progress.

## Reproduction

```python
from docling_core.transforms.chunker.line_chunker import LineBasedTokenChunker
from docling_core.transforms.chunker.tokenizer.huggingface import HuggingFaceTokenizer

tokenizer = HuggingFaceTokenizer.from_pretrained('openai-community/gpt2', max_tokens=30)
chunker = LineBasedTokenChunker(tokenizer=tokenizer)

# This hangs forever at 100% CPU
long_word = 'a' * 200
result = chunker.chunk_text(lines=['Header ' + long_word + ' Footer\n'])
```

## Suggested Fix

The word-boundary snap-back should not produce an empty or zero-progress head. For example:

```python
if prefer_word_boundary:
    last_space_index = text[:best_idx].rfind(" ")
    if last_space_index > 0:  # changed from >= 0 to > 0
        best_idx = last_space_index
```

Or more robustly, only apply the snap-back if the resulting head is non-empty and makes meaningful progress.

Additionally, `chunk_text` should detect zero-progress iterations (where `take` is empty) and force character-level splitting as a fallback, to prevent infinite loops regardless of `split_by_token_limit` behavior.

## Environment

- docling-core version: **2.67.0**
- Python: 3.13 (Apple Silicon / macOS)
- Tokenizer: HuggingFace GPT-2

## Impact

In production, this bug caused two zombie Python processes consuming 200% CPU for 8+ hours, starving other pipeline processes of compute resources.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LineBasedTokenChunker.chunk_text infinite loop when split_by_token_limit snaps to word boundary producing empty head #531

Bug Description

Root Cause

Reproduction

Suggested Fix

Environment

Impact

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

LineBasedTokenChunker.chunk_text infinite loop when split_by_token_limit snaps to word boundary producing empty head #531

Description

Bug Description

Root Cause

Reproduction

Suggested Fix

Environment

Impact

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions