fix: add parse timeout to legacy LaTeX documents#3019
fix: add parse timeout to legacy LaTeX documents#3019adityasasidhar wants to merge 11 commits intodocling-project:mainfrom
Conversation
…but ill have to go over the average parse time of large files and decide upon an upper limit and an average limit, next commit needs individual node ignorance instead of file itself Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>
|
✅ DCO Check Passed Thanks @adityasasidhar, all your commits are properly signed off. 🎉 |
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🟢 Enforce conventional commitWonderful, this rule succeeded.Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
🟢 Require two reviewer for test updatesWonderful, this rule succeeded.When test data is updated, we require two reviewers
|
|
Related Documentation 1 document(s) may need updating based on files changed in this PR: Docling What are the detailed pipeline options and processing behaviors for PDF, DOCX, PPTX, and XLSX files in the Python SDK?View Suggested Changes@@ -55,6 +55,32 @@
---
+### LaTeX
+- **Pipeline/Backend**: `SimplePipeline` + `LatexDocumentBackend`
+- **Key Options** (`LatexBackendOptions`):
+ - `parse_timeout` (default: 30.0 seconds): Maximum time allowed for parsing a LaTeX document. Set to `None` to disable the timeout. This prevents `pylatexenc` from spinning indefinitely when parsing legacy arXiv documents with complex or malformed macroscopic environments. If parsing exceeds this timeout, the conversion will fall back to raw text extraction rather than structured parsing. A warning will be logged when a timeout occurs.
+- **Processing**:
+ - Parses LaTeX source using `pylatexenc` to extract structured content (sections, equations, tables, etc.)
+ - Pre-processes custom macros (e.g., `\be`/`\ee` shortcuts for equations)
+ - Timeout enforcement runs parsing in a daemon thread to allow graceful fallback on timeout
+- **Notes**: The `parse_timeout` option is particularly useful for processing legacy arXiv documents that may contain complex or malformed macro environments. To configure the timeout:
+
+```python
+from docling.datamodel.backend_options import LatexBackendOptions
+
+# Increase timeout to 60 seconds
+latex_options = LatexBackendOptions(
+ parse_timeout=60.0
+)
+
+# Or disable timeout entirely
+latex_options = LatexBackendOptions(
+ parse_timeout=None
+)
+```
+
+---
+
#### Additional Notes
- Only PDF supports image resolution adjustment (`images_scale`). The default PDF backend is now `docling_parse`.
- DOCX header/footer export is only available via Python API.Note: You must be authenticated to accept/decline updates. |
Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
…ata, math envs, bugfixes Features: - Theorem/proof/lemma/corollary/definition/remark/example/conjecture environments - Proof environment with conditional QED ◻ symbol - \paragraph and \subparagraph as headings (levels 4, 5) - \author, \date, \title extracted from preamble - \href preserves URL as [text](url) - \renewcommand and \providecommand macro extraction - dmath/dgroup/darray/subequations math environments - \input cycle detection with depth limit of 10 - quote/quotation/verse environment handling Bugfixes: - Fixed UnboundLocalError in _extract_custom_macros - Fixed _extract_verbatim_content regex stealing content - Fixed is_valid() rejecting preamble-only fragments - Removed unused deepcopy import - Unified recursion depth limits to 10 Tests: - 7 new tests, 1 updated, ground-truth regenerated Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>
…unction now accpets \documentstyle too and added some essential and primitive layout passes Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>
Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>
|
Hello @vku-ibm @PeterStaar-IBM , Here's the full details of all the changes:
please highlight any changes required accordingly |
|
@adityasasidhar early results are good, everything was converted without breaking. There where two errors that showed up in the logs. These are not blockers, but maybe they are informative or provide some additional examples: |
yo @vku-ibm , That sounds good !! This is the expected behaviour ( giggles ) Let me try explaining the logs you got: LaTeX backend hits \includegraphics{figure.eps} -> Finds the file on disk -> falls into else (not a .pdf) -> Calls Image.open() on the .eps file -> Pillow internally invokes Ghostscript → Ghostscript crashes on the malformed PostScript -> The exception catches it and logs "Could not load image" -> The backend continue without crashing the best part is the document still gets a caption placeholder (Image: figure.eps) but without the actual image. No content is lost, no crash. also this could also mean there might be a small package mismatch or a some package uninstalled on the system the test was run on, maybe an older ghostscripts package as its an external dependency manager our python pillow library uses.... I'll research a bit and add supporting code for such cases. Thank you for being patient. |
…rmatted parsing test which caused hanging on the python 3.10 during the CI/CD testing pipline Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>
|
hello @vku-ibm @dolfim-ibm @PeterStaar-IBM , Apologies for dismissing the review, the new commit got rid of a test that was causing the hang on the python 3.10 during the CI pipeline, I narrowed it down to a test ( specifically the test that checked if the parse time hanging backup is working or not. Got rid of it and added some test to improve the code coverage. Would really appreciate a re review when you get a chance. Thank you for your patience !! |
Nice! |
Issue resolved by this Pull Request:
Resolves #2972
Description of changes:
This PR introduces a
parse_timeoutoption for the LaTeX backend to preventpylatexencfrom spinning indefinitely when parsing legacy arXiv documents with complex or malformed macroscopic environments. By running theLatexWalkervia a daemon thread, we can seamlessly interrupt the parse operation and fall back to raw text extraction rather than letting the entire application hang.The default timeout has been set to 30s ( this does seems a lot, will set a better upper limit as testing on larger files and more complex references in images lineup )
Includes validation tests utilizing the exact problematic files raised in #2972.
those files specifically are:
https://arxiv.org/abs/hep-th/0005057
https://arxiv.org/abs/math/0106220
https://arxiv.org/abs/quant-ph/9802040
Expanded macros and definitions will be added in the subsequent pull requests ( please check if the hanging issue is out ).
Checklist: