fix(html): fix broken document tree and quadratic complexity in rich table cells by ivan-liminary · Pull Request #3025 · docling-project/docling

ivan-liminary · 2026-02-23T17:52:05Z

Summary

Three bugs in html_backend.py's handling of <td>/<th> elements that contain block-level content (nested tables, lists, paragraphs):

Bug 1 — Orphaned InlineGroups in table cell content
_build_table_cell opened an InlineGroup for every cell but only closed it when returning a RichTableCell. When a plain TableCell was returned instead, the open InlineGroup was abandoned in the document body, causing orphaned items between tables.

Fix: always close the InlineGroup before the return-path decision.

Bug 2 — Quadratic PictureItem construction
_build_table_cell called _get_direct_picture_items() once per cell to check for pictures, but that helper iterated the entire document item list — making the HTML backend O(cells × document_items). On large pages this caused a significant slowdown.

Fix: replace the helper call with a direct isinstance check on the locally constructed cell items — O(items_in_cell), no document scan.

Bug 3 — Missing space separator in get_text() for nested table cells
_extract_text_recursively did not append a trailing space after <th>/<td> elements, so sibling-cell text was concatenated without any separator (e.g. "cell Acell B" instead of "cell A cell B").

Fix: add "th" and "td" to the tag set that appends a trailing space (alongside "p" and "li", which already did so).

Note: This PR is one half of a complete fix. export_to_markdown() will still hang on documents with nested rich table cells until the companion fix in docling-project/docling-core#525 is also merged.

github-actions · 2026-02-23T17:52:29Z

✅ DCO Check Passed

Thanks @ivan-liminary, all your commits are properly signed off. 🎉

mergify · 2026-02-23T17:52:40Z

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🔴 Require two reviewer for test updates

This rule is failing.

When test data is updated, we require two reviewers

#approved-reviews-by >= 2

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

dosubot · 2026-02-23T17:56:35Z

Related Documentation

Checked 17 published document(s) in 1 knowledge base(s). No updates required.

^{How did I do? Any feedback?}

codecov · 2026-02-23T19:45:09Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

ceberam · 2026-02-26T12:55:41Z

Thanks @ivan-liminary for spotting those bugs and tackling the issue of the low performance in parsing HTML rich tables.
Could you please check the regression tests? tests/test_backend_html.py::test_e2e_html_conversions is failing with the test file html_rich_table_cells.html.

ivan-liminary · 2026-02-26T15:52:14Z

Thanks @ceberam for looking into this — I pushed a fix to address the failing regression test in this PR.

Note: this PR depends on the companion core fix, so ideal merge order is:
(1) docling-project/docling-core#524, then (2) this PR #3025 (with fixture refresh/rebase if needed).

ceberam · 2026-03-05T09:15:43Z

Thanks @ceberam for looking into this — I pushed a fix to address the failing regression test in this PR.

Note: this PR depends on the companion core fix, so ideal merge order is: (1) docling-project/docling-core#524, then (2) this PR #3025 (with fixture refresh/rebase if needed).

@ivan-liminary Your PR docling-project/docling-core#525 on docling-core has been merged and the fix is available with the new release 2.67.1.
Feel free to use it in this PR and let us know when it is ready for the next review.

ivan-liminary · 2026-03-05T20:48:18Z

Thanks @ceberam! Everything is updated and ready for review — rebased on latest main, bumped docling-core to 2.67.1 (with #525 merged), regenerated fixtures, and trimmed test docstrings per your feedback on the other PR.

FYI you can reproduce the original bug with uv run docling https://en.wikipedia.org/wiki/Dinosaur — it hangs indefinitely without these fixes. With both PRs applied it completes in ~1 second.

ceberam

@ivan-liminary Could you please check the tests/test_backend_msword.py test?
I think that pinning the latest docling-core requires updating the ground truth of other backend processors that verify the markdown serialization.

docling/backend/html_backend.py

…h table cells Three related bugs in the HTML backend when processing table cells that contain rich content (RichTableCell), as found on Wikipedia pages with large reference, taxobox, or classification tables: Bug 1 — orphaned InlineGroups causing broken parent/child relationships ------------------------------------------------------------------------ When _use_inline_group() created an InlineGroup node (for paragraphs containing multiple hyperlinks, e.g. "text <a> and <a>"), it was added as a child of the current parent via doc.add_group(), but its RefItem was never appended to added_refs / provs_in_cell. This meant: - group_cell_elements() reparented the text items inside the InlineGroup (because their individual refs WERE in added_refs), moving them from body → outer_group_element. - The InlineGroup itself remained in body.children still pointing to those same text items as its .children. - Result: two nodes (InlineGroup and outer_group_element) claimed the same child items, with contradictory .parent pointers. This broken tree caused double-serialization of text items in export_to_markdown(). Fix: make _use_inline_group() yield the RefItem of the created group. Callers (_flush_buffer, _handle_block, _handle_list) now track the InlineGroup ref instead of individual leaf refs when a group was created. group_cell_elements() then reparents the whole InlineGroup (with its children intact) rather than orphaning it. Bug 2 — quadratic PictureItem creation from stray outer image loop ------------------------------------------------------------------- In _handle_block() for <table> tags, after parse_table_data() had already walked the entire table subtree (including nested tables) and emitted PictureItems for every <img>, there was an additional outer loop: for img_tag in tag("img"): im_ref2 = self._emit_image(tag, doc) Because BeautifulSoup's .find_all("img") on a tag finds ALL descendant <img> elements (including those in nested tables), this loop processed every image in the entire subtree again. A table nested N levels deep caused N*(N+1)/2 duplicate PictureItems per image (quadratic growth). Fix: remove the outer loop. Images are already handled by parse_table_data() -> _use_table_cell_context() -> _walk() -> _emit_image(). Bug 3 — missing space separator between nested table cell text -------------------------------------------------------------- HTMLDocumentBackend.get_text() uses _extract_text_recursively(), which only appended a trailing space for <p> and <li> tags. When a table cell contained a nested <table>, adjacent <th> or <td> elements without whitespace NavigableString nodes between them were concatenated directly (e.g. "TypeSound" instead of "Type Sound"). Fix: add "th" and "td" to the trailing-space tag set so that the text content of each cell is separated by a space. Bug 1 and Bug 2 were introduced in docling v2.55.0 (commit c803abe) with rich table cell support. Signed-off-by: Ivan Traus <[email protected]>

Signed-off-by: Ivan Traus <[email protected]>

The Bug 3 fix (adding th/td to trailing-space tags in get_text()) affects the XBRL backend which internally uses HTMLDocumentBackend. Regenerate the mlac-20251231 fixture to match the corrected text extraction. Signed-off-by: Ivan Traus <[email protected]>

…m tests Update uv.lock to pull in the merged nested-table flattening fix (docling-core#525). Regenerate markdown fixtures that now show flattened text instead of invalid embedded table syntax. Trim verbose test docstrings and remove narrating comments. Signed-off-by: Ivan Traus <[email protected]> Signed-off-by: Ivan Traus <[email protected]>

Add Generator[RefItem | None, None, None] return type and Google-style Yields section to _use_inline_group. Regenerate docx ground truth fixtures affected by docling-core 2.67.1 nested-table flattening. Signed-off-by: Ivan Traus <[email protected]>

ivan-liminary · 2026-03-06T18:11:42Z

thanks @ceberam, I addressed both comments: added return type annotation and Google-style Yields: section to _use_inline_group, and regenerated all docx ground truth fixtures for docling-core 2.67.1. All tests pass locally (HTML, XBRL, msword).

ivan-liminary force-pushed the fix/html-rich-table-cells branch 3 times, most recently from e31164d to 54a7f97 Compare February 23, 2026 21:09

PeterStaar-IBM requested review from ceberam and maxmnemonic February 26, 2026 12:37

ceberam added bug Something isn't working html issue related to html backend labels Feb 26, 2026

ivan-liminary force-pushed the fix/html-rich-table-cells branch from 31e806e to c8d97f2 Compare March 2, 2026 07:14

ivan-liminary force-pushed the fix/html-rich-table-cells branch from c8d97f2 to 1cf4bd3 Compare March 5, 2026 20:43

ceberam requested changes Mar 6, 2026

View reviewed changes

docling/backend/html_backend.py Outdated Show resolved Hide resolved

ivan-liminary added 5 commits March 6, 2026 10:01

test(html): align markdown fixtures with current docling-core behavior

38a3c65

Signed-off-by: Ivan Traus <[email protected]>

ivan-liminary force-pushed the fix/html-rich-table-cells branch from 1cf4bd3 to 2f6a320 Compare March 6, 2026 18:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(html): fix broken document tree and quadratic complexity in rich table cells#3025

fix(html): fix broken document tree and quadratic complexity in rich table cells#3025
ivan-liminary wants to merge 5 commits intodocling-project:mainfrom
ivan-liminary:fix/html-rich-table-cells

ivan-liminary commented Feb 23, 2026

Uh oh!

github-actions bot commented Feb 23, 2026 •

edited

Loading

Uh oh!

mergify bot commented Feb 23, 2026

Uh oh!

dosubot bot commented Feb 23, 2026

Uh oh!

codecov bot commented Feb 23, 2026 •

edited

Loading

Uh oh!

ceberam commented Feb 26, 2026

Uh oh!

ivan-liminary commented Feb 26, 2026 •

edited

Loading

Uh oh!

ceberam commented Mar 5, 2026

Uh oh!

ivan-liminary commented Mar 5, 2026

Uh oh!

ceberam left a comment

Uh oh!

Uh oh!

ivan-liminary commented Mar 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ivan-liminary commented Feb 23, 2026

Summary

Uh oh!

github-actions bot commented Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mergify bot commented Feb 23, 2026

Merge Protections

🔴 Require two reviewer for test updates

🟢 Enforce conventional commit

Uh oh!

dosubot bot commented Feb 23, 2026

Uh oh!

codecov bot commented Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

ceberam commented Feb 26, 2026

Uh oh!

ivan-liminary commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ceberam commented Mar 5, 2026

Uh oh!

ivan-liminary commented Mar 5, 2026

Uh oh!

ceberam left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ivan-liminary commented Mar 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

github-actions bot commented Feb 23, 2026 •

edited

Loading

codecov bot commented Feb 23, 2026 •

edited

Loading

ivan-liminary commented Feb 26, 2026 •

edited

Loading