-
Notifications
You must be signed in to change notification settings - Fork 135
Description
Bug Description
LineBasedTokenChunker.chunk_text() enters an infinite loop when processing text containing a long unbreakable token sequence preceded by a space. The process consumes 100% CPU indefinitely.
Root Cause
In split_by_token_limit() (line_chunker.py), the prefer_word_boundary logic can snap best_idx back to 0, producing an empty head and returning the tail unchanged:
if prefer_word_boundary:
last_space_index = text[:best_idx].rfind(" ")
if last_space_index >= 0:
best_idx = last_space_index # ← can snap to 0When the text is e.g. " aaaa...200×a...aaa" (leading space then unbreakable blob):
- Binary search correctly finds
best_idx ≈ 120(fits within token limit) rfind(" ")finds the space at index 0best_idxbecomes 0 →head = "",tail = text(unchanged)- Back in
chunk_text,takeis empty,remainingis unchanged → infinite loop at line 118
The guard at line 208 (if best_idx is None or best_idx <= 0: return "", text) then fires, which feeds back into the while True loop in chunk_text with no progress.
Reproduction
from docling_core.transforms.chunker.line_chunker import LineBasedTokenChunker
from docling_core.transforms.chunker.tokenizer.huggingface import HuggingFaceTokenizer
tokenizer = HuggingFaceTokenizer.from_pretrained('openai-community/gpt2', max_tokens=30)
chunker = LineBasedTokenChunker(tokenizer=tokenizer)
# This hangs forever at 100% CPU
long_word = 'a' * 200
result = chunker.chunk_text(lines=['Header ' + long_word + ' Footer\n'])Suggested Fix
The word-boundary snap-back should not produce an empty or zero-progress head. For example:
if prefer_word_boundary:
last_space_index = text[:best_idx].rfind(" ")
if last_space_index > 0: # changed from >= 0 to > 0
best_idx = last_space_indexOr more robustly, only apply the snap-back if the resulting head is non-empty and makes meaningful progress.
Additionally, chunk_text should detect zero-progress iterations (where take is empty) and force character-level splitting as a fallback, to prevent infinite loops regardless of split_by_token_limit behavior.
Environment
- docling-core version: 2.67.0
- Python: 3.13 (Apple Silicon / macOS)
- Tokenizer: HuggingFace GPT-2
Impact
In production, this bug caused two zombie Python processes consuming 200% CPU for 8+ hours, starving other pipeline processes of compute resources.