Skip to content

fix(html): fix broken document tree and quadratic complexity in rich table cells#3025

Open
ivan-liminary wants to merge 5 commits intodocling-project:mainfrom
ivan-liminary:fix/html-rich-table-cells
Open

fix(html): fix broken document tree and quadratic complexity in rich table cells#3025
ivan-liminary wants to merge 5 commits intodocling-project:mainfrom
ivan-liminary:fix/html-rich-table-cells

Conversation

@ivan-liminary
Copy link

Fixes #3024

Summary

Three bugs in html_backend.py's handling of <td>/<th> elements that contain block-level content (nested tables, lists, paragraphs):

Bug 1 — Orphaned InlineGroups in table cell content
_build_table_cell opened an InlineGroup for every cell but only closed it when returning a RichTableCell. When a plain TableCell was returned instead, the open InlineGroup was abandoned in the document body, causing orphaned items between tables.

Fix: always close the InlineGroup before the return-path decision.

Bug 2 — Quadratic PictureItem construction
_build_table_cell called _get_direct_picture_items() once per cell to check for pictures, but that helper iterated the entire document item list — making the HTML backend O(cells × document_items). On large pages this caused a significant slowdown.

Fix: replace the helper call with a direct isinstance check on the locally constructed cell items — O(items_in_cell), no document scan.

Bug 3 — Missing space separator in get_text() for nested table cells
_extract_text_recursively did not append a trailing space after <th>/<td> elements, so sibling-cell text was concatenated without any separator (e.g. "cell Acell B" instead of "cell A cell B").

Fix: add "th" and "td" to the tag set that appends a trailing space (alongside "p" and "li", which already did so).


Note: This PR is one half of a complete fix. export_to_markdown() will still hang on documents with nested rich table cells until the companion fix in docling-project/docling-core#525 is also merged.

@github-actions
Copy link
Contributor

github-actions bot commented Feb 23, 2026

DCO Check Passed

Thanks @ivan-liminary, all your commits are properly signed off. 🎉

@mergify
Copy link

mergify bot commented Feb 23, 2026

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🔴 Require two reviewer for test updates

This rule is failing.

When test data is updated, we require two reviewers

  • #approved-reviews-by >= 2

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

@dosubot
Copy link

dosubot bot commented Feb 23, 2026

Related Documentation

Checked 17 published document(s) in 1 knowledge base(s). No updates required.

How did I do? Any feedback?  Join Discord

@codecov
Copy link

codecov bot commented Feb 23, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@ivan-liminary ivan-liminary force-pushed the fix/html-rich-table-cells branch 3 times, most recently from e31164d to 54a7f97 Compare February 23, 2026 21:09
@ceberam
Copy link
Member

ceberam commented Feb 26, 2026

Thanks @ivan-liminary for spotting those bugs and tackling the issue of the low performance in parsing HTML rich tables.
Could you please check the regression tests? tests/test_backend_html.py::test_e2e_html_conversions is failing with the test file html_rich_table_cells.html.

@ceberam ceberam added bug Something isn't working html issue related to html backend labels Feb 26, 2026
@ivan-liminary
Copy link
Author

ivan-liminary commented Feb 26, 2026

Thanks @ceberam for looking into this — I pushed a fix to address the failing regression test in this PR.

Note: this PR depends on the companion core fix, so ideal merge order is:
(1) docling-project/docling-core#524, then (2) this PR #3025 (with fixture refresh/rebase if needed).

@ivan-liminary ivan-liminary force-pushed the fix/html-rich-table-cells branch from 31e806e to c8d97f2 Compare March 2, 2026 07:14
@ceberam
Copy link
Member

ceberam commented Mar 5, 2026

Thanks @ceberam for looking into this — I pushed a fix to address the failing regression test in this PR.

Note: this PR depends on the companion core fix, so ideal merge order is: (1) docling-project/docling-core#524, then (2) this PR #3025 (with fixture refresh/rebase if needed).

@ivan-liminary Your PR docling-project/docling-core#525 on docling-core has been merged and the fix is available with the new release 2.67.1.
Feel free to use it in this PR and let us know when it is ready for the next review.

@ivan-liminary ivan-liminary force-pushed the fix/html-rich-table-cells branch from c8d97f2 to 1cf4bd3 Compare March 5, 2026 20:43
@ivan-liminary
Copy link
Author

Thanks @ceberam! Everything is updated and ready for review — rebased on latest main, bumped docling-core to 2.67.1 (with #525 merged), regenerated fixtures, and trimmed test docstrings per your feedback on the other PR.

FYI you can reproduce the original bug with uv run docling https://en.wikipedia.org/wiki/Dinosaur — it hangs indefinitely without these fixes. With both PRs applied it completes in ~1 second.

Copy link
Member

@ceberam ceberam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ivan-liminary Could you please check the tests/test_backend_msword.py test?
I think that pinning the latest docling-core requires updating the ground truth of other backend processors that verify the markdown serialization.

…h table cells

Three related bugs in the HTML backend when processing table cells that
contain rich content (RichTableCell), as found on Wikipedia pages with
large reference, taxobox, or classification tables:

Bug 1 — orphaned InlineGroups causing broken parent/child relationships
------------------------------------------------------------------------
When _use_inline_group() created an InlineGroup node (for paragraphs
containing multiple hyperlinks, e.g. "text <a> and <a>"), it was added
as a child of the current parent via doc.add_group(), but its RefItem was
never appended to added_refs / provs_in_cell. This meant:

  - group_cell_elements() reparented the text items inside the InlineGroup
    (because their individual refs WERE in added_refs), moving them from
    body → outer_group_element.
  - The InlineGroup itself remained in body.children still pointing to
    those same text items as its .children.
  - Result: two nodes (InlineGroup and outer_group_element) claimed the
    same child items, with contradictory .parent pointers. This broken
    tree caused double-serialization of text items in export_to_markdown().

Fix: make _use_inline_group() yield the RefItem of the created group.
Callers (_flush_buffer, _handle_block, _handle_list) now track the
InlineGroup ref instead of individual leaf refs when a group was created.
group_cell_elements() then reparents the whole InlineGroup (with its
children intact) rather than orphaning it.

Bug 2 — quadratic PictureItem creation from stray outer image loop
-------------------------------------------------------------------
In _handle_block() for <table> tags, after parse_table_data() had already
walked the entire table subtree (including nested tables) and emitted
PictureItems for every <img>, there was an additional outer loop:

    for img_tag in tag("img"):
        im_ref2 = self._emit_image(tag, doc)

Because BeautifulSoup's .find_all("img") on a tag finds ALL descendant
<img> elements (including those in nested tables), this loop processed
every image in the entire subtree again. A table nested N levels deep
caused N*(N+1)/2 duplicate PictureItems per image (quadratic growth).

Fix: remove the outer loop. Images are already handled by parse_table_data()
-> _use_table_cell_context() -> _walk() -> _emit_image().

Bug 3 — missing space separator between nested table cell text
--------------------------------------------------------------
HTMLDocumentBackend.get_text() uses _extract_text_recursively(), which
only appended a trailing space for <p> and <li> tags. When a table cell
contained a nested <table>, adjacent <th> or <td> elements without
whitespace NavigableString nodes between them were concatenated directly
(e.g. "TypeSound" instead of "Type Sound").

Fix: add "th" and "td" to the trailing-space tag set so that the text
content of each cell is separated by a space.

Bug 1 and Bug 2 were introduced in docling v2.55.0 (commit c803abe) with
rich table cell support.

Signed-off-by: Ivan Traus <[email protected]>
The Bug 3 fix (adding th/td to trailing-space tags in get_text())
affects the XBRL backend which internally uses HTMLDocumentBackend.
Regenerate the mlac-20251231 fixture to match the corrected text
extraction.

Signed-off-by: Ivan Traus <[email protected]>
…m tests

Update uv.lock to pull in the merged nested-table flattening fix
(docling-core#525). Regenerate markdown fixtures that now show flattened
text instead of invalid embedded table syntax. Trim verbose test
docstrings and remove narrating comments.
Signed-off-by: Ivan Traus <[email protected]>

Signed-off-by: Ivan Traus <[email protected]>
Add Generator[RefItem | None, None, None] return type and Google-style
Yields section to _use_inline_group. Regenerate docx ground truth
fixtures affected by docling-core 2.67.1 nested-table flattening.

Signed-off-by: Ivan Traus <[email protected]>
@ivan-liminary ivan-liminary force-pushed the fix/html-rich-table-cells branch from 1cf4bd3 to 2f6a320 Compare March 6, 2026 18:06
@ivan-liminary
Copy link
Author

thanks @ceberam, I addressed both comments: added return type annotation and Google-style Yields: section to _use_inline_group, and regenerated all docx ground truth fixtures for docling-core 2.67.1. All tests pass locally (HTML, XBRL, msword).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working html issue related to html backend

Projects

None yet

Development

Successfully merging this pull request may close these issues.

HTML backend: broken document tree and quadratic complexity in rich table cells

2 participants