Skip to content

Export functions fail to generate <page_break> tags for non-consecutive (skipped) pages #472

@jhchoi1182

Description

@jhchoi1182

Description

When parsing PDF documents, if specific pages fail to parse (e.g., due to exceptions caught in the pipeline) and are excluded from the doc.pages list, the export functions (export_to_doctags, export_to_html, export_to_markdown) do not generate page break markers for these gaps. This causes page count mismatch issues in downstream processing.

Steps to Reproduce

Attachments

test.pdf

  • test.pdf - A minimal reproduction file derived from a larger document. The content has been intentionally corrupted for confidentiality, but the parsing error still reproduces.
      * Structure: Consists of 4 pages (Pages 78, 79, 83, 84).
      * Behavior: Parsing succeeds for pages 78 and 84, but fails for pages 79 and 83.

1.  Convert the attached test.pdf using DocumentConverter.
2.  Export using export_to_doctags(), export_to_markdown(page_break_placeholder=str), export_to_html(split_page_view=True).
3.  Count the page break

Reproduction Script:

converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(
            pipeline_options=pipeline_options,
        ),
    },
) 

doc = converter.convert(source).document

doctags_output = doc.export_to_doctags()
markdown_output = doc.export_to_markdown(page_break_placeholder="===PAGE_BREAK===")
html_output = doc.export_to_html(split_page_view=True) 

print(doctags_output)
print(markdown_output)
print(html_output)

Expected Behavior

  • Export: export fuction should generate pagebreak tags for all pages, including failed ones, so that page numbering remains consistent.

Actual Behavior

Missing page break for failed pages:
If pages 1, 2, 4, 5 succeed but page 3 fails, the output currently looks like this:

... content ...      1page
<page_break>
... content ...      2page
<page_break>
... content ...      4page
<page_break>
... content ...      5page
... content ...      1page
===PAGE_BREAK===
... content ...      2page
===PAGE_BREAK===
... content ...      4page
===PAGE_BREAK===
... content ...      5page
<td>
<div class="page">
... content ...      1page
</div>
</td> 

<td>
<div class="page">
... content ...      2page
</div>
</td> 

<td>
<div class="page">
... content ...      4page
</div>
</td> 

<td>
<div class="page">
... content ...      5page
</div>
</td>

Environment

Component Version / Details
docling version 2.31.1
docling-core version 2.31.0
Python 3.11
OS macOS

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions