Skip to content

LineBasedTokenChunker.chunk_text infinite loop when split_by_token_limit snaps to word boundary producing empty head #531

@Steve-Allison

Description

@Steve-Allison

Bug Description

LineBasedTokenChunker.chunk_text() enters an infinite loop when processing text containing a long unbreakable token sequence preceded by a space. The process consumes 100% CPU indefinitely.

Root Cause

In split_by_token_limit() (line_chunker.py), the prefer_word_boundary logic can snap best_idx back to 0, producing an empty head and returning the tail unchanged:

if prefer_word_boundary:
    last_space_index = text[:best_idx].rfind(" ")
    if last_space_index >= 0:
        best_idx = last_space_index  # ← can snap to 0

When the text is e.g. " aaaa...200×a...aaa" (leading space then unbreakable blob):

  1. Binary search correctly finds best_idx ≈ 120 (fits within token limit)
  2. rfind(" ") finds the space at index 0
  3. best_idx becomes 0 → head = "", tail = text (unchanged)
  4. Back in chunk_text, take is empty, remaining is unchanged → infinite loop at line 118

The guard at line 208 (if best_idx is None or best_idx <= 0: return "", text) then fires, which feeds back into the while True loop in chunk_text with no progress.

Reproduction

from docling_core.transforms.chunker.line_chunker import LineBasedTokenChunker
from docling_core.transforms.chunker.tokenizer.huggingface import HuggingFaceTokenizer

tokenizer = HuggingFaceTokenizer.from_pretrained('openai-community/gpt2', max_tokens=30)
chunker = LineBasedTokenChunker(tokenizer=tokenizer)

# This hangs forever at 100% CPU
long_word = 'a' * 200
result = chunker.chunk_text(lines=['Header ' + long_word + ' Footer\n'])

Suggested Fix

The word-boundary snap-back should not produce an empty or zero-progress head. For example:

if prefer_word_boundary:
    last_space_index = text[:best_idx].rfind(" ")
    if last_space_index > 0:  # changed from >= 0 to > 0
        best_idx = last_space_index

Or more robustly, only apply the snap-back if the resulting head is non-empty and makes meaningful progress.

Additionally, chunk_text should detect zero-progress iterations (where take is empty) and force character-level splitting as a fallback, to prevent infinite loops regardless of split_by_token_limit behavior.

Environment

  • docling-core version: 2.67.0
  • Python: 3.13 (Apple Silicon / macOS)
  • Tokenizer: HuggingFace GPT-2

Impact

In production, this bug caused two zombie Python processes consuming 200% CPU for 8+ hours, starving other pipeline processes of compute resources.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions