Releases · krrome/docling-hierarchical-pdf

If you run into the PDFFileNotFoundException then your source attribute to DocumentConverter().convert(source=source) has either been of type str or of type DocumentStream so there is the Docling conversion result unfortunately does not hold a valid reference to the source file anymore. Hence the Postprocessor needs your help - if source was a string then you can add the source=source when instantiating ResultPostprocessor - full example:

from docling.document_converter import DocumentConverter
from hierarchical.postprocessor import ResultPostprocessor

source = "my_file.pdf"  # document per local path or URL
converter = DocumentConverter()
result = converter.convert(source)
# the postprocessor modifies the result.document in place.
ResultPostprocessor(result, source=source).process()
# ...

If you have used a DocumentStream object as source you are unfortunately in the situation that you will have to pass a valid Path to the PDF as a source argument to ResultPostprocessor or a new, open BytesIO stream or DocumentStream object as a source argument to ResultPostprocessor. The reason is that docling closes the source stream when it is finished - so no more reading from that stream is possible.

Assets 2

06 Oct 19:03

krrome

v0.1.0

ca33194

Use also PDF-metadata ToC

New in this release:

use pymupdf to read ToC from pdf (if it exists in the pdf metadata)
correct header levels and hierarchy based on this
best effort attempt to:
- convert texts and list items to headers if they were parsed incorrectly and appear in the ToC
- convert header to text items if they were parsed incorrectly and do not appear in the ToC

Assets 2

26 Sep 08:27

krrome

v0.0.2

88a73ed

v0.0.2

Fixes in documentation etc.

Assets 2

25 Sep 04:56

krrome

v0.0.1

d911a6d

Initial minimal release Pre-release

Pre-release

Initial minimal release to give access to hierarchy inference.

Assets 2

Releases: krrome/docling-hierarchical-pdf

Fix bug and improve detection of incorrect headers

Uh oh!

Fix minor bugs

Uh oh!

Fix minor bugs

Fixes

Uh oh!

Fix: Source DocumentStreams + Error handling

Uh oh!

Use also PDF-metadata ToC

Uh oh!

v0.0.2

Uh oh!

Initial minimal release

Uh oh!