Releases: krrome/docling-hierarchical-pdf
Fix bug and improve detection of incorrect headers
Fix minor bugs
Fix minor bugs
Fix: Source DocumentStreams + Error handling
Makes it possible to pass the file path of the source file or a stream to ResultPostprocessor in order to be read by pymupdf for metadata extraction:
If you run into the PDFFileNotFoundException then your source attribute to DocumentConverter().convert(source=source) has either been of type str or of type DocumentStream so there is the Docling conversion result unfortunately does not hold a valid reference to the source file anymore. Hence the Postprocessor needs your help - if source was a string then you can add the source=source when instantiating ResultPostprocessor - full example:
from docling.document_converter import DocumentConverter
from hierarchical.postprocessor import ResultPostprocessor
source = "my_file.pdf" # document per local path or URL
converter = DocumentConverter()
result = converter.convert(source)
# the postprocessor modifies the result.document in place.
ResultPostprocessor(result, source=source).process()
# ...If you have used a DocumentStream object as source you are unfortunately in the situation that you will have to pass a valid Path to the PDF as a source argument to ResultPostprocessor or a new, open BytesIO stream or DocumentStream object as a source argument to ResultPostprocessor. The reason is that docling closes the source stream when it is finished - so no more reading from that stream is possible.
Use also PDF-metadata ToC
New in this release:
- use pymupdf to read ToC from pdf (if it exists in the pdf metadata)
- correct header levels and hierarchy based on this
- best effort attempt to:
- convert texts and list items to headers if they were parsed incorrectly and appear in the ToC
- convert header to text items if they were parsed incorrectly and do not appear in the ToC
v0.0.2
Initial minimal release
Initial minimal release to give access to hierarchy inference.