fix(html): fix broken document tree and quadratic complexity in rich table cells#3025
fix(html): fix broken document tree and quadratic complexity in rich table cells#3025ivan-liminary wants to merge 5 commits intodocling-project:mainfrom
Conversation
|
✅ DCO Check Passed Thanks @ivan-liminary, all your commits are properly signed off. 🎉 |
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🔴 Require two reviewer for test updatesThis rule is failing.When test data is updated, we require two reviewers
🟢 Enforce conventional commitWonderful, this rule succeeded.Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
|
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
e31164d to
54a7f97
Compare
|
Thanks @ivan-liminary for spotting those bugs and tackling the issue of the low performance in parsing HTML rich tables. |
|
Thanks @ceberam for looking into this — I pushed a fix to address the failing regression test in this PR. Note: this PR depends on the companion core fix, so ideal merge order is: |
31e806e to
c8d97f2
Compare
@ivan-liminary Your PR docling-project/docling-core#525 on docling-core has been merged and the fix is available with the new release 2.67.1. |
c8d97f2 to
1cf4bd3
Compare
|
Thanks @ceberam! Everything is updated and ready for review — rebased on latest main, bumped docling-core to 2.67.1 (with #525 merged), regenerated fixtures, and trimmed test docstrings per your feedback on the other PR. FYI you can reproduce the original bug with |
ceberam
left a comment
There was a problem hiding this comment.
@ivan-liminary Could you please check the tests/test_backend_msword.py test?
I think that pinning the latest docling-core requires updating the ground truth of other backend processors that verify the markdown serialization.
…h table cells
Three related bugs in the HTML backend when processing table cells that
contain rich content (RichTableCell), as found on Wikipedia pages with
large reference, taxobox, or classification tables:
Bug 1 — orphaned InlineGroups causing broken parent/child relationships
------------------------------------------------------------------------
When _use_inline_group() created an InlineGroup node (for paragraphs
containing multiple hyperlinks, e.g. "text <a> and <a>"), it was added
as a child of the current parent via doc.add_group(), but its RefItem was
never appended to added_refs / provs_in_cell. This meant:
- group_cell_elements() reparented the text items inside the InlineGroup
(because their individual refs WERE in added_refs), moving them from
body → outer_group_element.
- The InlineGroup itself remained in body.children still pointing to
those same text items as its .children.
- Result: two nodes (InlineGroup and outer_group_element) claimed the
same child items, with contradictory .parent pointers. This broken
tree caused double-serialization of text items in export_to_markdown().
Fix: make _use_inline_group() yield the RefItem of the created group.
Callers (_flush_buffer, _handle_block, _handle_list) now track the
InlineGroup ref instead of individual leaf refs when a group was created.
group_cell_elements() then reparents the whole InlineGroup (with its
children intact) rather than orphaning it.
Bug 2 — quadratic PictureItem creation from stray outer image loop
-------------------------------------------------------------------
In _handle_block() for <table> tags, after parse_table_data() had already
walked the entire table subtree (including nested tables) and emitted
PictureItems for every <img>, there was an additional outer loop:
for img_tag in tag("img"):
im_ref2 = self._emit_image(tag, doc)
Because BeautifulSoup's .find_all("img") on a tag finds ALL descendant
<img> elements (including those in nested tables), this loop processed
every image in the entire subtree again. A table nested N levels deep
caused N*(N+1)/2 duplicate PictureItems per image (quadratic growth).
Fix: remove the outer loop. Images are already handled by parse_table_data()
-> _use_table_cell_context() -> _walk() -> _emit_image().
Bug 3 — missing space separator between nested table cell text
--------------------------------------------------------------
HTMLDocumentBackend.get_text() uses _extract_text_recursively(), which
only appended a trailing space for <p> and <li> tags. When a table cell
contained a nested <table>, adjacent <th> or <td> elements without
whitespace NavigableString nodes between them were concatenated directly
(e.g. "TypeSound" instead of "Type Sound").
Fix: add "th" and "td" to the trailing-space tag set so that the text
content of each cell is separated by a space.
Bug 1 and Bug 2 were introduced in docling v2.55.0 (commit c803abe) with
rich table cell support.
Signed-off-by: Ivan Traus <[email protected]>
Signed-off-by: Ivan Traus <[email protected]>
The Bug 3 fix (adding th/td to trailing-space tags in get_text()) affects the XBRL backend which internally uses HTMLDocumentBackend. Regenerate the mlac-20251231 fixture to match the corrected text extraction. Signed-off-by: Ivan Traus <[email protected]>
…m tests Update uv.lock to pull in the merged nested-table flattening fix (docling-core#525). Regenerate markdown fixtures that now show flattened text instead of invalid embedded table syntax. Trim verbose test docstrings and remove narrating comments. Signed-off-by: Ivan Traus <[email protected]> Signed-off-by: Ivan Traus <[email protected]>
Add Generator[RefItem | None, None, None] return type and Google-style Yields section to _use_inline_group. Regenerate docx ground truth fixtures affected by docling-core 2.67.1 nested-table flattening. Signed-off-by: Ivan Traus <[email protected]>
1cf4bd3 to
2f6a320
Compare
|
thanks @ceberam, I addressed both comments: added return type annotation and Google-style Yields: section to _use_inline_group, and regenerated all docx ground truth fixtures for docling-core 2.67.1. All tests pass locally (HTML, XBRL, msword). |
Fixes #3024
Summary
Three bugs in
html_backend.py's handling of<td>/<th>elements that contain block-level content (nested tables, lists, paragraphs):Bug 1 — Orphaned
InlineGroups in table cell content_build_table_cellopened anInlineGroupfor every cell but only closed it when returning aRichTableCell. When a plainTableCellwas returned instead, the openInlineGroupwas abandoned in the document body, causing orphaned items between tables.Fix: always close the
InlineGroupbefore the return-path decision.Bug 2 — Quadratic
PictureItemconstruction_build_table_cellcalled_get_direct_picture_items()once per cell to check for pictures, but that helper iterated the entire document item list — making the HTML backend O(cells × document_items). On large pages this caused a significant slowdown.Fix: replace the helper call with a direct
isinstancecheck on the locally constructed cell items — O(items_in_cell), no document scan.Bug 3 — Missing space separator in
get_text()for nested table cells_extract_text_recursivelydid not append a trailing space after<th>/<td>elements, so sibling-cell text was concatenated without any separator (e.g."cell Acell B"instead of"cell A cell B").Fix: add
"th"and"td"to the tag set that appends a trailing space (alongside"p"and"li", which already did so).