feat: table aware chunking#527
Conversation
|
✅ DCO Check Passed Thanks @odelliab, all your commits are properly signed off. 🎉 |
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🟢 Enforce conventional commitWonderful, this rule succeeded.Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
🟢 Require two reviewer for test updatesWonderful, this rule succeeded.When test data is updated, we require two reviewers
|
I, odelliab <[email protected]>, hereby add my Signed-off-by to this commit: 5cc61d9 I, odelliab <[email protected]>, hereby add my Signed-off-by to this commit: 91b43f9 I, odelliab <[email protected]>, hereby add my Signed-off-by to this commit: 5d17bda I, odelliab <[email protected]>, hereby add my Signed-off-by to this commit: a50392e I, odelliab <[email protected]>, hereby add my Signed-off-by to this commit: e589429 I, odelliab <[email protected]>, hereby add my Signed-off-by to this commit: 30c72a9 I, odelliab <[email protected]>, hereby add my Signed-off-by to this commit: 510e949 I, odelliab <[email protected]>, hereby add my Signed-off-by to this commit: 6c3a8f7 I, odelliab <[email protected]>, hereby add my Signed-off-by to this commit: 0642a07 Signed-off-by: odelliab <[email protected]>
|
Documentation Updates 1 document(s) were updated by changes in this PR: How can I find page numbers and bounding box information for content in a chunk produced by the hybrid chunker, and what is the structure of the doc_items list within a chunk?View Changes@@ -15,6 +15,24 @@
**Why multiple doc_items per chunk?** The hybrid chunker combines consecutive document elements (doc_items) into a chunk until a token limit is reached. For example, if three paragraphs fit within the token limit, all three will be included in the chunk's `doc_items` list. This allows you to trace which original document elements contributed to each chunk.
+**Table-aware chunking:** When a table is too large to fit in a single chunk, the HybridChunker can automatically split it across multiple chunks while preserving context:
+
+- The `repeat_table_header` parameter (default: `True`) controls whether table headers are automatically repeated when tables are split
+- With header repetition enabled, each chunk containing part of the table will include the table header rows, ensuring that each chunk maintains context about what the columns represent
+- This behavior can be disabled by setting `repeat_table_header=False` when initializing the chunker
+- Table header repetition is currently supported for markdown-serialized tables
+
+**Example of HybridChunker with table header repetition:**
+```python
+from docling_core.transforms.chunker import HybridChunker
+
+# Default behavior: table headers are repeated
+chunker = HybridChunker(max_tokens=512)
+
+# Disable header repetition if needed
+chunker = HybridChunker(max_tokens=512, repeat_table_header=False)
+```
+
**Limitations:**
- Line numbers are not available.
- DOCX files do not provide page or bounding box metadata; convert to PDF if you need this data. |
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
|
Thanks @odelliab for creating this PR. While we're reviewing it, please make sure all the commits are signed-off (use the |
I, odelliab <[email protected]>, hereby add my Signed-off-by to this commit: 9b9ef09 I, odelliab <[email protected]>, hereby add my Signed-off-by to this commit: b3699e3 I, odelliab <[email protected]>, hereby add my Signed-off-by to this commit: 9d393df Signed-off-by: odelliab <[email protected]>
ceberam
left a comment
There was a problem hiding this comment.
Thanks @odelliab for your contribution with a feature that was requested by several users!
Overall, I was wondering if the LineBasedTokenChunker is really necessary beyond the context of this PR (Table aware chunking)? The LineBasedTokenChunker is only used for its chunk_text() method, while the original intent of the Docling chunker (i.e., to chunk a DoclingDocument object by implementing the function chunk(self, dl_doc: DLDocument, **kwargs: Any)) is never used, even though an implementation is provided.
Do you have use cases where you need to create one chunk per line?
It's just an open question to understand if we really need to create a chunker to support the functionality within chunk_text_().
I have also added other technical comments and suggestions.
Co-authored-by: Cesar Berrospi Ramis <[email protected]> Signed-off-by: odelliab <[email protected]>
Co-authored-by: Cesar Berrospi Ramis <[email protected]> Signed-off-by: odelliab <[email protected]>
Signed-off-by: odelliab <[email protected]>
Signed-off-by: odelliab <[email protected]>
|
@ceberam , thanks for your review. |
Move the method get_default_tokenizer to module huggingface.py to avoid circular dependencies and avoid an unconventional import in a method Signed-off-by: Cesar Berrospi Ramis <[email protected]>
Summary
This PR adds table header duplication functionality to the HybridChunker and introduces a new LineBasedTokenChunker for improved table chunking. When tables are split across multiple chunks, the table headers are now automatically repeated in each chunk to maintain context.
Changes Made
Core Features
Implementation Details
New Files
docling_core/transforms/chunker/line_chunker.py: New line-based chunker implementation
test/test_line_chunker.py: Comprehensive test suite (308 lines) covering:
Testing
✅ Added comprehensive unit tests for LineBasedTokenTextChunker (15+ test cases)
✅ Added integration tests for table header duplication in test_hybrid_chunker.py
✅ Verified table headers are repeated when tables span multiple chunks
✅ Confirmed all body lines from tables appear in the chunked output
✅ Tested with real document data (2408.09869v3 with 996 texts, 5 tables, 13 pictures)
✅ Generated 84 chunks with 27 containing tables
Additional Notes
This enhancement significantly improves the usability of chunked table data by ensuring each chunk maintains the necessary context (table headers) for standalone interpretation. This is particularly valuable for:
Currently, for get_header_and_body_lines(), a naive default implementation and an implementation for markdown table serializer exist. We should add implementation for other serializers (e.g. html).
The implementation is backward compatible - the duplicate_table_header parameter defaults to True but can be disabled if the old behavior is desired.