Skip to content

feat: table aware chunking#527

Merged
PeterStaar-IBM merged 19 commits intodocling-project:mainfrom
odelliab:table_aware_chunking
Mar 4, 2026
Merged

feat: table aware chunking#527
PeterStaar-IBM merged 19 commits intodocling-project:mainfrom
odelliab:table_aware_chunking

Conversation

@odelliab
Copy link
Contributor

Summary
This PR adds table header duplication functionality to the HybridChunker and introduces a new LineBasedTokenChunker for improved table chunking. When tables are split across multiple chunks, the table headers are now automatically repeated in each chunk to maintain context.

Changes Made
Core Features

  1. Table Header Duplication: Added duplicate_table_header parameter to HybridChunker (default: True)-
  • When enabled, table headers are repeated in each chunk when a table is split
  • Uses the new LineBasedTokenTextChunker for table-specific chunking logic
    • New LineBasedTokenChunker: Created a specialized chunker for line-based text splitting
  • Supports prefix text that is prepended to each chunk (e.g., table headers)
  • Handles token-aware splitting with word boundary preservation
  • Includes comprehensive validation and error handling

Implementation Details

  • Modified HybridChunker._split_using_plain_text() to accept a doc_serializer parameter (as headers identification depends on the serializer)
  • Added new segment() method to handle table vs. non-table content differently
  • Integrated get_header_and_body_lines() from table serializers to extract headers

New Files
docling_core/transforms/chunker/line_chunker.py: New line-based chunker implementation

test/test_line_chunker.py: Comprehensive test suite (308 lines) covering:

  • Prefix handling
  • Long line splitting
  • Token limit enforcement
  • Word boundary preservation
  • Document chunking
  • Edge cases (empty documents, long content, etc.)

Testing
✅ Added comprehensive unit tests for LineBasedTokenTextChunker (15+ test cases)
✅ Added integration tests for table header duplication in test_hybrid_chunker.py
✅ Verified table headers are repeated when tables span multiple chunks
✅ Confirmed all body lines from tables appear in the chunked output
✅ Tested with real document data (2408.09869v3 with 996 texts, 5 tables, 13 pictures)
✅ Generated 84 chunks with 27 containing tables

Additional Notes
This enhancement significantly improves the usability of chunked table data by ensuring each chunk maintains the necessary context (table headers) for standalone interpretation. This is particularly valuable for:

  • RAG (Retrieval-Augmented Generation) applications
  • Embedding generation where each chunk needs to be self-contained
  • Search and retrieval systems that work with individual chunks

Currently, for get_header_and_body_lines(), a naive default implementation and an implementation for markdown table serializer exist. We should add implementation for other serializers (e.g. html).

The implementation is backward compatible - the duplicate_table_header parameter defaults to True but can be disabled if the old behavior is desired.

@github-actions
Copy link
Contributor

github-actions bot commented Feb 25, 2026

DCO Check Passed

Thanks @odelliab, all your commits are properly signed off. 🎉

@mergify
Copy link

mergify bot commented Feb 25, 2026

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

🟢 Require two reviewer for test updates

Wonderful, this rule succeeded.

When test data is updated, we require two reviewers

  • #approved-reviews-by >= 2

I, odelliab <[email protected]>, hereby add my Signed-off-by to this commit: 5cc61d9
I, odelliab <[email protected]>, hereby add my Signed-off-by to this commit: 91b43f9
I, odelliab <[email protected]>, hereby add my Signed-off-by to this commit: 5d17bda
I, odelliab <[email protected]>, hereby add my Signed-off-by to this commit: a50392e
I, odelliab <[email protected]>, hereby add my Signed-off-by to this commit: e589429
I, odelliab <[email protected]>, hereby add my Signed-off-by to this commit: 30c72a9
I, odelliab <[email protected]>, hereby add my Signed-off-by to this commit: 510e949
I, odelliab <[email protected]>, hereby add my Signed-off-by to this commit: 6c3a8f7
I, odelliab <[email protected]>, hereby add my Signed-off-by to this commit: 0642a07

Signed-off-by: odelliab <[email protected]>
@dosubot
Copy link

dosubot bot commented Feb 25, 2026

Documentation Updates

1 document(s) were updated by changes in this PR:

How can I find page numbers and bounding box information for content in a chunk produced by the hybrid chunker, and what is the structure of the doc_items list within a chunk?
View Changes
@@ -15,6 +15,24 @@
 
 **Why multiple doc_items per chunk?** The hybrid chunker combines consecutive document elements (doc_items) into a chunk until a token limit is reached. For example, if three paragraphs fit within the token limit, all three will be included in the chunk's `doc_items` list. This allows you to trace which original document elements contributed to each chunk.
 
+**Table-aware chunking:** When a table is too large to fit in a single chunk, the HybridChunker can automatically split it across multiple chunks while preserving context:
+
+- The `repeat_table_header` parameter (default: `True`) controls whether table headers are automatically repeated when tables are split
+- With header repetition enabled, each chunk containing part of the table will include the table header rows, ensuring that each chunk maintains context about what the columns represent
+- This behavior can be disabled by setting `repeat_table_header=False` when initializing the chunker
+- Table header repetition is currently supported for markdown-serialized tables
+
+**Example of HybridChunker with table header repetition:**
+```python
+from docling_core.transforms.chunker import HybridChunker
+
+# Default behavior: table headers are repeated
+chunker = HybridChunker(max_tokens=512)
+
+# Disable header repetition if needed
+chunker = HybridChunker(max_tokens=512, repeat_table_header=False)
+```
+
 **Limitations:**
 - Line numbers are not available.
 - DOCX files do not provide page or bounding box metadata; convert to PDF if you need this data.

How did I do? Any feedback?  Join Discord

@odelliab odelliab changed the title Table aware chunking feat: Table aware chunking Feb 25, 2026
@odelliab odelliab changed the title feat: Table aware chunking feat:\ Table aware chunking Feb 25, 2026
@codecov
Copy link

codecov bot commented Feb 26, 2026

Codecov Report

❌ Patch coverage is 95.90164% with 5 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
docling_core/transforms/serializer/base.py 25.00% 3 Missing ⚠️
docling_core/transforms/chunker/line_chunker.py 98.87% 1 Missing ⚠️
docling_core/transforms/serializer/markdown.py 87.50% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

@ceberam
Copy link
Member

ceberam commented Feb 26, 2026

Thanks @odelliab for creating this PR. While we're reviewing it, please make sure all the commits are signed-off (use the git commit -s when creating commits). You can check the DCO notes for remediation options:
https://github.com/docling-project/docling-core/pull/527/checks?check_run_id=65006046327

@ceberam ceberam self-requested a review February 26, 2026 16:14
I, odelliab <[email protected]>, hereby add my Signed-off-by to this commit: 9b9ef09
I, odelliab <[email protected]>, hereby add my Signed-off-by to this commit: b3699e3
I, odelliab <[email protected]>, hereby add my Signed-off-by to this commit: 9d393df

Signed-off-by: odelliab <[email protected]>
Copy link
Member

@ceberam ceberam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @odelliab for your contribution with a feature that was requested by several users!

Overall, I was wondering if the LineBasedTokenChunker is really necessary beyond the context of this PR (Table aware chunking)? The LineBasedTokenChunker is only used for its chunk_text() method, while the original intent of the Docling chunker (i.e., to chunk a DoclingDocument object by implementing the function chunk(self, dl_doc: DLDocument, **kwargs: Any)) is never used, even though an implementation is provided.
Do you have use cases where you need to create one chunk per line?
It's just an open question to understand if we really need to create a chunker to support the functionality within chunk_text_().

I have also added other technical comments and suggestions.

@ceberam ceberam changed the title feat:\ Table aware chunking feat: table aware chunking Mar 3, 2026
odelliab and others added 4 commits March 3, 2026 21:24
Co-authored-by: Cesar Berrospi Ramis <[email protected]>
Signed-off-by: odelliab <[email protected]>
Co-authored-by: Cesar Berrospi Ramis <[email protected]>
Signed-off-by: odelliab <[email protected]>
Signed-off-by: odelliab <[email protected]>
Signed-off-by: odelliab <[email protected]>
@odelliab
Copy link
Contributor Author

odelliab commented Mar 3, 2026

@ceberam , thanks for your review.
I fixed the PR according to your suggestions.
Regarding the general comment with respect to LineBasedTokenChunker : I believe it has merit beyond its usage for table items. We, as well as clients, used it for several benchmarks. In a way, it is similar to page chunker, but respects a given token limit.

Move the method get_default_tokenizer to module huggingface.py to avoid circular dependencies
and avoid an unconventional import in a method

Signed-off-by: Cesar Berrospi Ramis <[email protected]>
Copy link
Member

@ceberam ceberam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🏆
@odelliab FYI. I've just pushed an extra commit to refactor the default tokenizer and avoid the import within the function (I have you a wrong hint in my previous message).

@ceberam ceberam requested review from dolfim-ibm and vagenas March 4, 2026 10:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants