Skip to content

Releases: krrome/docling-hierarchical-pdf

Fix bug and improve detection of incorrect headers

19 Feb 06:54

Choose a tag to compare

Fix minor bugs

06 Jan 06:18

Choose a tag to compare

Fix minor bugs

20 Oct 18:25

Choose a tag to compare

Fixes

  • #7 ValueError when parsing arabic numbering
  • #8 Don't try to process headings if none were detected

Fix: Source DocumentStreams + Error handling

13 Oct 19:08

Choose a tag to compare

Makes it possible to pass the file path of the source file or a stream to ResultPostprocessor in order to be read by pymupdf for metadata extraction:

If you run into the PDFFileNotFoundException then your source attribute to DocumentConverter().convert(source=source) has either been of type str or of type DocumentStream so there is the Docling conversion result unfortunately does not hold a valid reference to the source file anymore. Hence the Postprocessor needs your help - if source was a string then you can add the source=source when instantiating ResultPostprocessor - full example:

from docling.document_converter import DocumentConverter
from hierarchical.postprocessor import ResultPostprocessor

source = "my_file.pdf"  # document per local path or URL
converter = DocumentConverter()
result = converter.convert(source)
# the postprocessor modifies the result.document in place.
ResultPostprocessor(result, source=source).process()
# ...

If you have used a DocumentStream object as source you are unfortunately in the situation that you will have to pass a valid Path to the PDF as a source argument to ResultPostprocessor or a new, open BytesIO stream or DocumentStream object as a source argument to ResultPostprocessor. The reason is that docling closes the source stream when it is finished - so no more reading from that stream is possible.

Use also PDF-metadata ToC

06 Oct 19:03

Choose a tag to compare

New in this release:

  • use pymupdf to read ToC from pdf (if it exists in the pdf metadata)
  • correct header levels and hierarchy based on this
  • best effort attempt to:
    • convert texts and list items to headers if they were parsed incorrectly and appear in the ToC
    • convert header to text items if they were parsed incorrectly and do not appear in the ToC

v0.0.2

26 Sep 08:27

Choose a tag to compare

Fixes in documentation etc.

Initial minimal release

25 Sep 04:56

Choose a tag to compare

Pre-release

Initial minimal release to give access to hierarchy inference.