Skip to content

fix: add parse timeout to legacy LaTeX documents#3019

Open
adityasasidhar wants to merge 11 commits intodocling-project:mainfrom
adityasasidhar:main
Open

fix: add parse timeout to legacy LaTeX documents#3019
adityasasidhar wants to merge 11 commits intodocling-project:mainfrom
adityasasidhar:main

Conversation

@adityasasidhar
Copy link
Contributor

Issue resolved by this Pull Request:

Resolves #2972

Description of changes:

This PR introduces a parse_timeout option for the LaTeX backend to prevent pylatexenc from spinning indefinitely when parsing legacy arXiv documents with complex or malformed macroscopic environments. By running the LatexWalker via a daemon thread, we can seamlessly interrupt the parse operation and fall back to raw text extraction rather than letting the entire application hang.

The default timeout has been set to 30s ( this does seems a lot, will set a better upper limit as testing on larger files and more complex references in images lineup )

Includes validation tests utilizing the exact problematic files raised in #2972.

those files specifically are:

https://arxiv.org/abs/hep-th/0005057
https://arxiv.org/abs/math/0106220
https://arxiv.org/abs/quant-ph/9802040

Expanded macros and definitions will be added in the subsequent pull requests ( please check if the hanging issue is out ).

Checklist:

  • Documentation has been updated, if necessary.
  • Examples have been added, if necessary.
  • Tests have been added, if necessary.

…but ill have to go over the average parse time of large files and decide upon an upper limit and an average limit, next commit needs individual node ignorance instead of file itself

Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>
@github-actions
Copy link
Contributor

github-actions bot commented Feb 21, 2026

DCO Check Passed

Thanks @adityasasidhar, all your commits are properly signed off. 🎉

@mergify
Copy link

mergify bot commented Feb 21, 2026

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

🟢 Require two reviewer for test updates

Wonderful, this rule succeeded.

When test data is updated, we require two reviewers

  • #approved-reviews-by >= 2

@dosubot
Copy link

dosubot bot commented Feb 21, 2026

Related Documentation

1 document(s) may need updating based on files changed in this PR:

Docling

What are the detailed pipeline options and processing behaviors for PDF, DOCX, PPTX, and XLSX files in the Python SDK?
View Suggested Changes
@@ -55,6 +55,32 @@
 
 ---
 
+### LaTeX
+- **Pipeline/Backend**: `SimplePipeline` + `LatexDocumentBackend`
+- **Key Options** (`LatexBackendOptions`):
+    - `parse_timeout` (default: 30.0 seconds): Maximum time allowed for parsing a LaTeX document. Set to `None` to disable the timeout. This prevents `pylatexenc` from spinning indefinitely when parsing legacy arXiv documents with complex or malformed macroscopic environments. If parsing exceeds this timeout, the conversion will fall back to raw text extraction rather than structured parsing. A warning will be logged when a timeout occurs.
+- **Processing**:
+    - Parses LaTeX source using `pylatexenc` to extract structured content (sections, equations, tables, etc.)
+    - Pre-processes custom macros (e.g., `\be`/`\ee` shortcuts for equations)
+    - Timeout enforcement runs parsing in a daemon thread to allow graceful fallback on timeout
+- **Notes**: The `parse_timeout` option is particularly useful for processing legacy arXiv documents that may contain complex or malformed macro environments. To configure the timeout:
+
+```python
+from docling.datamodel.backend_options import LatexBackendOptions
+
+# Increase timeout to 60 seconds
+latex_options = LatexBackendOptions(
+    parse_timeout=60.0
+)
+
+# Or disable timeout entirely
+latex_options = LatexBackendOptions(
+    parse_timeout=None
+)
+```
+
+---
+
 #### Additional Notes
 - Only PDF supports image resolution adjustment (`images_scale`). The default PDF backend is now `docling_parse`.
 - DOCX header/footer export is only available via Python API.

[Accept] [Decline]

Note: You must be authenticated to accept/decline updates.

How did I do? Any feedback?  Join Discord

Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>
@codecov
Copy link

codecov bot commented Feb 22, 2026

Codecov Report

❌ Patch coverage is 72.97297% with 30 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
docling/backend/latex_backend.py 72.72% 30 Missing ⚠️

📢 Thoughts on this report? Let us know!

…ata, math envs, bugfixes

Features:
- Theorem/proof/lemma/corollary/definition/remark/example/conjecture environments
- Proof environment with conditional QED ◻ symbol
- \paragraph and \subparagraph as headings (levels 4, 5)
- \author, \date, \title extracted from preamble
- \href preserves URL as [text](url)
- \renewcommand and \providecommand macro extraction
- dmath/dgroup/darray/subequations math environments
- \input cycle detection with depth limit of 10
- quote/quotation/verse environment handling

Bugfixes:
- Fixed UnboundLocalError in _extract_custom_macros
- Fixed _extract_verbatim_content regex stealing content
- Fixed is_valid() rejecting preamble-only fragments
- Removed unused deepcopy import
- Unified recursion depth limits to 10

Tests:
- 7 new tests, 1 updated, ground-truth regenerated

Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>
…unction now accpets \documentstyle too and added some essential and primitive layout passes

Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>
Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>
@adityasasidhar
Copy link
Contributor Author

adityasasidhar commented Feb 24, 2026

Hello @vku-ibm @PeterStaar-IBM ,

Here's the full details of all the changes:

  1. Added a default 30 second timeout for each paper
  2. Plus there is a test in the test/test_backend_latex.py that puts the hard to convert .tex file to force the backend to skip over the file ( i would like to assure you thats intentional for now and the backend will be adapted to ignore only parts of the file rather than the whole file )
  3. preamble metadata extraction (\title, \author, \date)
  4. Additional math envs (dmath, dgroup, darray, subequations)
    \paragraph/\subparagraph → heading levels 4/5
    \href → Markdown links
    \renewcommand/\providecommand in macro extraction
    \input cycle detection (depth limit 10), basically overall stability improvements inspired from the pandoc and latex2mardown library
  5. Added 22 commands to the skip list (\newtheorem, \pagestyle, \tableofcontents, \clearpage, \protect, etc.)
    Added 8 layout commands to pass-through (\noindent, \par, \smallskip, \medskip, \bigskip, \vfill, \vskip, \hskip)
  6. changed the is_valid() function multiple times in this iteration trying to settle on a more appreciated understanding of a valid tex file

please highlight any changes required accordingly

@vku-ibm
Copy link
Member

vku-ibm commented Feb 24, 2026

@adityasasidhar early results are good, everything was converted without breaking.

There where two errors that showed up in the logs. These are not blockers, but maybe they are informative or provide some additional examples:

2263 processing: [....]/latex-files-mass/chao-dyn9906017.gz
converting: [....]/latex-files-mass/temp/step_0906.tex
Error: /undefined in c
Operand stack:
   --nostringval--   f0   |______?
Execution stack:
   %interp_exit   .runexec2   --nostringval--   --nostringval--   --nostringval--   2   %stopped_push   --nostringval--   --nostringval--   --nostringval--   false   1   %stopped_push   1933   1   3   %oparray_pop   1932   1   3   %oparray_pop   1931   1   3   %oparray_pop   --nostringval--   1917   1   3   %oparray_pop   1787   1   3   %oparray_pop   --nostringval--   %errorexec_pop   .runexec2   --nostringval--   --nostringval--   --nostringval--   2   %stopped_push   --nostringval--   --nostringval--
Dictionary stack:
   --dict:748/1123(ro)(G)--   --dict:0/20(G)--   --dict:90/200(L)--   --dict:100/300(L)--
Current allocation mode is local
Current file position is 14100
GPL Ghostscript 10.04.0: Unrecoverable error, exit code 1

---------

2307 processing: [...]/latex-files-mass/cond-mat0005001.gz
converting: [...]/latex-files-mass/temp/Les_Houches.tex
Error: /undefined in Adobe_level2_AI5
Operand stack:

Execution stack:
   %interp_exit   .runexec2   --nostringval--   --nostringval--   --nostringval--   2   %stopped_push   --nostringval--   --nostringval--   --nostringval--   false   1   %stopped_push   1933   1   3   %oparray_pop   1932   1   3   %oparray_pop   1931   1   3   %oparray_pop   --nostringval--   1917   1   3   %oparray_pop   1787   1   3   %oparray_pop   --nostringval--   %errorexec_pop   .runexec2   --nostringval--   --nostringval--   --nostringval--   2   %stopped_push   --nostringval--
Dictionary stack:
   --dict:748/1123(ro)(G)--   --dict:0/20(G)--   --dict:93/200(L)--
Current allocation mode is local
Current file position is 13960
GPL Ghostscript 10.04.0: Unrecoverable error, exit code 1
Error: /undefined in AltsysDict
....
{the rest is cut for brevity}

@adityasasidhar
Copy link
Contributor Author

adityasasidhar commented Feb 24, 2026

@adityasasidhar early results are good, everything was converted without breaking.

There where two errors that showed up in the logs. These are not blockers, but maybe they are informative or provide some additional examples:

2263 processing: [....]/latex-files-mass/chao-dyn9906017.gz
converting: [....]/latex-files-mass/temp/step_0906.tex
Error: /undefined in c
Operand stack:
   --nostringval--   f0   |______?
Execution stack:
   %interp_exit   .runexec2   --nostringval--   --nostringval--   --nostringval--   2   %stopped_push   --nostringval--   --nostringval--   --nostringval--   false   1   %stopped_push   1933   1   3   %oparray_pop   1932   1   3   %oparray_pop   1931   1   3   %oparray_pop   --nostringval--   1917   1   3   %oparray_pop   1787   1   3   %oparray_pop   --nostringval--   %errorexec_pop   .runexec2   --nostringval--   --nostringval--   --nostringval--   2   %stopped_push   --nostringval--   --nostringval--
Dictionary stack:
   --dict:748/1123(ro)(G)--   --dict:0/20(G)--   --dict:90/200(L)--   --dict:100/300(L)--
Current allocation mode is local
Current file position is 14100
GPL Ghostscript 10.04.0: Unrecoverable error, exit code 1

---------

2307 processing: [...]/latex-files-mass/cond-mat0005001.gz
converting: [...]/latex-files-mass/temp/Les_Houches.tex
Error: /undefined in Adobe_level2_AI5
Operand stack:

Execution stack:
   %interp_exit   .runexec2   --nostringval--   --nostringval--   --nostringval--   2   %stopped_push   --nostringval--   --nostringval--   --nostringval--   false   1   %stopped_push   1933   1   3   %oparray_pop   1932   1   3   %oparray_pop   1931   1   3   %oparray_pop   --nostringval--   1917   1   3   %oparray_pop   1787   1   3   %oparray_pop   --nostringval--   %errorexec_pop   .runexec2   --nostringval--   --nostringval--   --nostringval--   2   %stopped_push   --nostringval--
Dictionary stack:
   --dict:748/1123(ro)(G)--   --dict:0/20(G)--   --dict:93/200(L)--
Current allocation mode is local
Current file position is 13960
GPL Ghostscript 10.04.0: Unrecoverable error, exit code 1
Error: /undefined in AltsysDict
....
{the rest is cut for brevity}

yo @vku-ibm ,

That sounds good !! This is the expected behaviour ( giggles )

Let me try explaining the logs you got:

LaTeX backend hits \includegraphics{figure.eps} -> Finds the file on disk -> falls into else (not a .pdf) -> Calls Image.open() on the .eps file -> Pillow internally invokes Ghostscript → Ghostscript crashes on the malformed PostScript -> The exception catches it and logs "Could not load image" -> The backend continue without crashing

the best part is the document still gets a caption placeholder (Image: figure.eps) but without the actual image.

No content is lost, no crash.

also this could also mean there might be a small package mismatch or a some package uninstalled on the system the test was run on, maybe an older ghostscripts package as its an external dependency manager our python pillow library uses....

I'll research a bit and add supporting code for such cases.

Thank you for being patient.

dolfim-ibm
dolfim-ibm previously approved these changes Feb 27, 2026
Copy link
Member

@dolfim-ibm dolfim-ibm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

…rmatted parsing test which caused hanging on the python 3.10 during the CI/CD testing pipline

Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>
@adityasasidhar
Copy link
Contributor Author

adityasasidhar commented Mar 1, 2026

hello @vku-ibm @dolfim-ibm @PeterStaar-IBM ,

Apologies for dismissing the review, the new commit got rid of a test that was causing the hang on the python 3.10 during the CI pipeline, I narrowed it down to a test ( specifically the test that checked if the parse time hanging backup is working or not.

Got rid of it and added some test to improve the code coverage.

Would really appreciate a re review when you get a chance.

Thank you for your patience !!

@PeterStaar-IBM
Copy link
Member

hello @vku-ibm @dolfim-ibm @PeterStaar-IBM ,

Apologies for dismissing the review, the new commit got rid of a test that was causing the hang on the python 3.10 during the CI pipeline, I narrowed it down to a test ( specifically the test that checked if the parse time hanging backup is working or not.

Got rid of it and added some test to improve the code coverage.

Would really appreciate a re review when you get a chance.

Thank you for your patience !!

Nice!

Copy link
Member

@PeterStaar-IBM PeterStaar-IBM left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Latex backend gets stuck on some documents

4 participants