-
Notifications
You must be signed in to change notification settings - Fork 135
Open
Description
Description
When parsing PDF documents, if specific pages fail to parse (e.g., due to exceptions caught in the pipeline) and are excluded from the doc.pages list, the export functions (export_to_doctags, export_to_html, export_to_markdown) do not generate page break markers for these gaps. This causes page count mismatch issues in downstream processing.
Steps to Reproduce
Attachments
test.pdf- A minimal reproduction file derived from a larger document. The content has been intentionally corrupted for confidentiality, but the parsing error still reproduces.
* Structure: Consists of 4 pages (Pages 78, 79, 83, 84).
* Behavior: Parsing succeeds for pages 78 and 84, but fails for pages 79 and 83.
1. Convert the attached test.pdf using DocumentConverter.
2. Export using export_to_doctags(), export_to_markdown(page_break_placeholder=str), export_to_html(split_page_view=True).
3. Count the page break
Reproduction Script:
converter = DocumentConverter(
format_options={
InputFormat.PDF: PdfFormatOption(
pipeline_options=pipeline_options,
),
},
)
doc = converter.convert(source).document
doctags_output = doc.export_to_doctags()
markdown_output = doc.export_to_markdown(page_break_placeholder="===PAGE_BREAK===")
html_output = doc.export_to_html(split_page_view=True)
print(doctags_output)
print(markdown_output)
print(html_output)Expected Behavior
- Export: export fuction should generate
pagebreaktags for all pages, including failed ones, so that page numbering remains consistent.
Actual Behavior
Missing page break for failed pages:
If pages 1, 2, 4, 5 succeed but page 3 fails, the output currently looks like this:
... content ... 1page
<page_break>
... content ... 2page
<page_break>
... content ... 4page
<page_break>
... content ... 5page... content ... 1page
===PAGE_BREAK===
... content ... 2page
===PAGE_BREAK===
... content ... 4page
===PAGE_BREAK===
... content ... 5page<td>
<div class="page">
... content ... 1page
</div>
</td>
<td>
<div class="page">
... content ... 2page
</div>
</td>
<td>
<div class="page">
... content ... 4page
</div>
</td>
<td>
<div class="page">
... content ... 5page
</div>
</td>Environment
| Component | Version / Details |
|---|---|
| docling version | 2.31.1 |
| docling-core version | 2.31.0 |
| Python | 3.11 |
| OS | macOS |
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels