fix: add parse timeout to legacy LaTeX documents by adityasasidhar · Pull Request #3019 · docling-project/docling

adityasasidhar · 2026-02-21T12:30:11Z

Issue resolved by this Pull Request:

Resolves #2972

Description of changes:

This PR introduces a parse_timeout option for the LaTeX backend to prevent pylatexenc from spinning indefinitely when parsing legacy arXiv documents with complex or malformed macroscopic environments. By running the LatexWalker via a daemon thread, we can seamlessly interrupt the parse operation and fall back to raw text extraction rather than letting the entire application hang.

The default timeout has been set to 30s ( this does seems a lot, will set a better upper limit as testing on larger files and more complex references in images lineup )

Includes validation tests utilizing the exact problematic files raised in #2972.

those files specifically are:

https://arxiv.org/abs/hep-th/0005057
https://arxiv.org/abs/math/0106220
https://arxiv.org/abs/quant-ph/9802040

Expanded macros and definitions will be added in the subsequent pull requests ( please check if the hanging issue is out ).

Checklist:

Documentation has been updated, if necessary.
Examples have been added, if necessary.
Tests have been added, if necessary.

…but ill have to go over the average parse time of large files and decide upon an upper limit and an average limit, next commit needs individual node ignorance instead of file itself Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>

github-actions · 2026-02-21T12:30:22Z

✅ DCO Check Passed

Thanks @adityasasidhar, all your commits are properly signed off. 🎉

mergify · 2026-02-21T12:30:45Z

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

🟢 Require two reviewer for test updates

Wonderful, this rule succeeded.

When test data is updated, we require two reviewers

#approved-reviews-by >= 2

dosubot · 2026-02-21T12:31:36Z

Related Documentation

1 document(s) may need updating based on files changed in this PR:

Docling

What are the detailed pipeline options and processing behaviors for PDF, DOCX, PPTX, and XLSX files in the Python SDK?

View Suggested Changes

@@ -55,6 +55,32 @@
 
 ---
 
+### LaTeX
+- **Pipeline/Backend**: `SimplePipeline` + `LatexDocumentBackend`
+- **Key Options** (`LatexBackendOptions`):
+    - `parse_timeout` (default: 30.0 seconds): Maximum time allowed for parsing a LaTeX document. Set to `None` to disable the timeout. This prevents `pylatexenc` from spinning indefinitely when parsing legacy arXiv documents with complex or malformed macroscopic environments. If parsing exceeds this timeout, the conversion will fall back to raw text extraction rather than structured parsing. A warning will be logged when a timeout occurs.
+- **Processing**:
+    - Parses LaTeX source using `pylatexenc` to extract structured content (sections, equations, tables, etc.)
+    - Pre-processes custom macros (e.g., `\be`/`\ee` shortcuts for equations)
+    - Timeout enforcement runs parsing in a daemon thread to allow graceful fallback on timeout
+- **Notes**: The `parse_timeout` option is particularly useful for processing legacy arXiv documents that may contain complex or malformed macro environments. To configure the timeout:
+
+```python
+from docling.datamodel.backend_options import LatexBackendOptions
+
+# Increase timeout to 60 seconds
+latex_options = LatexBackendOptions(
+    parse_timeout=60.0
+)
+
+# Or disable timeout entirely
+latex_options = LatexBackendOptions(
+    parse_timeout=None
+)
+```
+
+---
+
 #### Additional Notes
 - Only PDF supports image resolution adjustment (`images_scale`). The default PDF backend is now `docling_parse`.
 - DOCX header/footer export is only available via Python API.

[Accept] [Decline]

Note: You must be authenticated to accept/decline updates.

^{How did I do? Any feedback?}

Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>

codecov · 2026-02-22T05:48:44Z

Codecov Report

❌ Patch coverage is 72.97297% with 30 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
docling/backend/latex_backend.py	72.72%	30 Missing ⚠️

📢 Thoughts on this report? Let us know!

…ata, math envs, bugfixes Features: - Theorem/proof/lemma/corollary/definition/remark/example/conjecture environments - Proof environment with conditional QED ◻ symbol - \paragraph and \subparagraph as headings (levels 4, 5) - \author, \date, \title extracted from preamble - \href preserves URL as [text](url) - \renewcommand and \providecommand macro extraction - dmath/dgroup/darray/subequations math environments - \input cycle detection with depth limit of 10 - quote/quotation/verse environment handling Bugfixes: - Fixed UnboundLocalError in _extract_custom_macros - Fixed _extract_verbatim_content regex stealing content - Fixed is_valid() rejecting preamble-only fragments - Removed unused deepcopy import - Unified recursion depth limits to 10 Tests: - 7 new tests, 1 updated, ground-truth regenerated Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>

…unction now accpets \documentstyle too and added some essential and primitive layout passes Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>

Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>

adityasasidhar · 2026-02-24T03:51:52Z

Hello @vku-ibm @PeterStaar-IBM ,

Here's the full details of all the changes:

Added a default 30 second timeout for each paper
Plus there is a test in the test/test_backend_latex.py that puts the hard to convert .tex file to force the backend to skip over the file ( i would like to assure you thats intentional for now and the backend will be adapted to ignore only parts of the file rather than the whole file )
preamble metadata extraction (\title, \author, \date)
Additional math envs (dmath, dgroup, darray, subequations)
\paragraph/\subparagraph → heading levels 4/5
\href → Markdown links
\renewcommand/\providecommand in macro extraction
\input cycle detection (depth limit 10), basically overall stability improvements inspired from the pandoc and latex2mardown library
Added 22 commands to the skip list (\newtheorem, \pagestyle, \tableofcontents, \clearpage, \protect, etc.)
Added 8 layout commands to pass-through (\noindent, \par, \smallskip, \medskip, \bigskip, \vfill, \vskip, \hskip)
changed the is_valid() function multiple times in this iteration trying to settle on a more appreciated understanding of a valid tex file

please highlight any changes required accordingly

vku-ibm · 2026-02-24T12:43:52Z

@adityasasidhar early results are good, everything was converted without breaking.

There where two errors that showed up in the logs. These are not blockers, but maybe they are informative or provide some additional examples:

2263 processing: [....]/latex-files-mass/chao-dyn9906017.gz
converting: [....]/latex-files-mass/temp/step_0906.tex
Error: /undefined in c
Operand stack:
   --nostringval--   f0   |______?
Execution stack:
   %interp_exit   .runexec2   --nostringval--   --nostringval--   --nostringval--   2   %stopped_push   --nostringval--   --nostringval--   --nostringval--   false   1   %stopped_push   1933   1   3   %oparray_pop   1932   1   3   %oparray_pop   1931   1   3   %oparray_pop   --nostringval--   1917   1   3   %oparray_pop   1787   1   3   %oparray_pop   --nostringval--   %errorexec_pop   .runexec2   --nostringval--   --nostringval--   --nostringval--   2   %stopped_push   --nostringval--   --nostringval--
Dictionary stack:
   --dict:748/1123(ro)(G)--   --dict:0/20(G)--   --dict:90/200(L)--   --dict:100/300(L)--
Current allocation mode is local
Current file position is 14100
GPL Ghostscript 10.04.0: Unrecoverable error, exit code 1

---------

2307 processing: [...]/latex-files-mass/cond-mat0005001.gz
converting: [...]/latex-files-mass/temp/Les_Houches.tex
Error: /undefined in Adobe_level2_AI5
Operand stack:

Execution stack:
   %interp_exit   .runexec2   --nostringval--   --nostringval--   --nostringval--   2   %stopped_push   --nostringval--   --nostringval--   --nostringval--   false   1   %stopped_push   1933   1   3   %oparray_pop   1932   1   3   %oparray_pop   1931   1   3   %oparray_pop   --nostringval--   1917   1   3   %oparray_pop   1787   1   3   %oparray_pop   --nostringval--   %errorexec_pop   .runexec2   --nostringval--   --nostringval--   --nostringval--   2   %stopped_push   --nostringval--
Dictionary stack:
   --dict:748/1123(ro)(G)--   --dict:0/20(G)--   --dict:93/200(L)--
Current allocation mode is local
Current file position is 13960
GPL Ghostscript 10.04.0: Unrecoverable error, exit code 1
Error: /undefined in AltsysDict
....
{the rest is cut for brevity}

adityasasidhar · 2026-02-24T13:04:16Z

@adityasasidhar early results are good, everything was converted without breaking.

There where two errors that showed up in the logs. These are not blockers, but maybe they are informative or provide some additional examples:

2263 processing: [....]/latex-files-mass/chao-dyn9906017.gz
converting: [....]/latex-files-mass/temp/step_0906.tex
Error: /undefined in c
Operand stack:
   --nostringval--   f0   |______?
Execution stack:
   %interp_exit   .runexec2   --nostringval--   --nostringval--   --nostringval--   2   %stopped_push   --nostringval--   --nostringval--   --nostringval--   false   1   %stopped_push   1933   1   3   %oparray_pop   1932   1   3   %oparray_pop   1931   1   3   %oparray_pop   --nostringval--   1917   1   3   %oparray_pop   1787   1   3   %oparray_pop   --nostringval--   %errorexec_pop   .runexec2   --nostringval--   --nostringval--   --nostringval--   2   %stopped_push   --nostringval--   --nostringval--
Dictionary stack:
   --dict:748/1123(ro)(G)--   --dict:0/20(G)--   --dict:90/200(L)--   --dict:100/300(L)--
Current allocation mode is local
Current file position is 14100
GPL Ghostscript 10.04.0: Unrecoverable error, exit code 1

---------

2307 processing: [...]/latex-files-mass/cond-mat0005001.gz
converting: [...]/latex-files-mass/temp/Les_Houches.tex
Error: /undefined in Adobe_level2_AI5
Operand stack:

Execution stack:
   %interp_exit   .runexec2   --nostringval--   --nostringval--   --nostringval--   2   %stopped_push   --nostringval--   --nostringval--   --nostringval--   false   1   %stopped_push   1933   1   3   %oparray_pop   1932   1   3   %oparray_pop   1931   1   3   %oparray_pop   --nostringval--   1917   1   3   %oparray_pop   1787   1   3   %oparray_pop   --nostringval--   %errorexec_pop   .runexec2   --nostringval--   --nostringval--   --nostringval--   2   %stopped_push   --nostringval--
Dictionary stack:
   --dict:748/1123(ro)(G)--   --dict:0/20(G)--   --dict:93/200(L)--
Current allocation mode is local
Current file position is 13960
GPL Ghostscript 10.04.0: Unrecoverable error, exit code 1
Error: /undefined in AltsysDict
....
{the rest is cut for brevity}

yo @vku-ibm ,

That sounds good !! This is the expected behaviour ( giggles )

Let me try explaining the logs you got:

LaTeX backend hits \includegraphics{figure.eps} -> Finds the file on disk -> falls into else (not a .pdf) -> Calls Image.open() on the .eps file -> Pillow internally invokes Ghostscript → Ghostscript crashes on the malformed PostScript -> The exception catches it and logs "Could not load image" -> The backend continue without crashing

the best part is the document still gets a caption placeholder (Image: figure.eps) but without the actual image.

No content is lost, no crash.

also this could also mean there might be a small package mismatch or a some package uninstalled on the system the test was run on, maybe an older ghostscripts package as its an external dependency manager our python pillow library uses....

I'll research a bit and add supporting code for such cases.

Thank you for being patient.

dolfim-ibm

lgtm

…rmatted parsing test which caused hanging on the python 3.10 during the CI/CD testing pipline Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>

adityasasidhar · 2026-03-01T07:03:41Z

hello @vku-ibm @dolfim-ibm @PeterStaar-IBM ,

Apologies for dismissing the review, the new commit got rid of a test that was causing the hang on the python 3.10 during the CI pipeline, I narrowed it down to a test ( specifically the test that checked if the parse time hanging backup is working or not.

Got rid of it and added some test to improve the code coverage.

Would really appreciate a re review when you get a chance.

Thank you for your patience !!

PeterStaar-IBM · 2026-03-01T08:13:45Z

hello @vku-ibm @dolfim-ibm @PeterStaar-IBM ,

Apologies for dismissing the review, the new commit got rid of a test that was causing the hang on the python 3.10 during the CI pipeline, I narrowed it down to a test ( specifically the test that checked if the parse time hanging backup is working or not.

Got rid of it and added some test to improve the code coverage.

Would really appreciate a re review when you get a chance.

Thank you for your patience !!

Nice!

PeterStaar-IBM

great!

adityasasidhar mentioned this pull request Feb 21, 2026

Latex backend gets stuck on some documents #2972

Open

fix: bypass mypy attr-defined for parse_timeout in late options

58e72d0

Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>

PeterStaar-IBM requested review from PeterStaar-IBM, cau-git and dolfim-ibm February 23, 2026 07:09

adityasasidhar added 2 commits February 24, 2026 00:02

Added some more dangerous macors to the ignore list, the is_valid() f…

aee5fbd

…unction now accpets \documentstyle too and added some essential and primitive layout passes Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>

adityasasidhar force-pushed the main branch from aee5fbd to 58e72d0 Compare February 23, 2026 19:06

removed the restrictive nature of the is_valid() function

e93ef19

Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>

adityasasidhar force-pushed the main branch from a92a93c to e93ef19 Compare February 24, 2026 03:36

dolfim-ibm previously approved these changes Feb 27, 2026

View reviewed changes

adityasasidhar added 2 commits February 28, 2026 10:56

Merge branch 'docling-project:main' into main

3bbf70c

added test coverage to the added features and got rid of the time fo…

1e49579

…rmatted parsing test which caused hanging on the python 3.10 during the CI/CD testing pipline Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>

adityasasidhar dismissed dolfim-ibm’s stale review via 1e49579 March 1, 2026 06:55

adityasasidhar requested a review from dolfim-ibm March 1, 2026 07:05

dolfim-ibm approved these changes Mar 1, 2026

View reviewed changes

adityasasidhar added 3 commits March 4, 2026 14:41

Merge branch 'docling-project:main' into main

6272214

Merge branch 'docling-project:main' into main

0fc6d08

Merge branch 'docling-project:main' into main

5740baf

PeterStaar-IBM approved these changes Mar 7, 2026

View reviewed changes

Merge branch 'docling-project:main' into main

e56fb63

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: add parse timeout to legacy LaTeX documents#3019

fix: add parse timeout to legacy LaTeX documents#3019
adityasasidhar wants to merge 11 commits intodocling-project:mainfrom
adityasasidhar:main

adityasasidhar commented Feb 21, 2026

Uh oh!

github-actions bot commented Feb 21, 2026 •

edited

Loading

Uh oh!

mergify bot commented Feb 21, 2026 •

edited

Loading

Uh oh!

dosubot bot commented Feb 21, 2026

Uh oh!

codecov bot commented Feb 22, 2026 •

edited

Loading

Uh oh!

adityasasidhar commented Feb 24, 2026 •

edited

Loading

Uh oh!

vku-ibm commented Feb 24, 2026

Uh oh!

adityasasidhar commented Feb 24, 2026 •

edited

Loading

Uh oh!

dolfim-ibm left a comment

Uh oh!

adityasasidhar commented Mar 1, 2026 •

edited

Loading

Uh oh!

PeterStaar-IBM commented Mar 1, 2026

Uh oh!

PeterStaar-IBM left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

adityasasidhar commented Feb 21, 2026

Uh oh!

github-actions bot commented Feb 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mergify bot commented Feb 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merge Protections

🟢 Enforce conventional commit

🟢 Require two reviewer for test updates

Uh oh!

dosubot bot commented Feb 21, 2026

What are the detailed pipeline options and processing behaviors for PDF, DOCX, PPTX, and XLSX files in the Python SDK?

Uh oh!

codecov bot commented Feb 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

adityasasidhar commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vku-ibm commented Feb 24, 2026

Uh oh!

adityasasidhar commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dolfim-ibm left a comment

Choose a reason for hiding this comment

Uh oh!

adityasasidhar commented Mar 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

PeterStaar-IBM commented Mar 1, 2026

Uh oh!

PeterStaar-IBM left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

github-actions bot commented Feb 21, 2026 •

edited

Loading

mergify bot commented Feb 21, 2026 •

edited

Loading

codecov bot commented Feb 22, 2026 •

edited

Loading

adityasasidhar commented Feb 24, 2026 •

edited

Loading

adityasasidhar commented Feb 24, 2026 •

edited

Loading

adityasasidhar commented Mar 1, 2026 •

edited

Loading