fix: add failed pages to DoclingDocument for page break consistency by jhchoi1182 · Pull Request #2939 · docling-project/docling

jhchoi1182 · 2026-01-31T17:41:19Z

Description

When some PDF pages fail to parse, they were not added to DoclingDocument.pages, causing page break markers to be incorrect during export.

This PR adds failed/skipped pages with their size info (if available from the backend) to maintain correct page numbering and structure in the final document.

Changes

Added _add_failed_pages_to_document() method in StandardPdfPipeline
Failed pages are now added to DoclingDocument.pages with:
- Page number (page_no)
- Size information (retrieved from backend if possible, otherwise defaults to 0.0 x 0.0)
- No content (empty PageItem)
This ensures export functions generate correct <page_break> markers even when some pages fail

Related Issue

This change is required to properly support the page break marker fix in docling-core:

Without this fix, DoclingDocument.pages would be missing failed pages, causing the page break logic in docling-core to produce incorrect results.

Checklist:

Documentation has been updated, if necessary.
Examples have been added, if necessary.
Tests have been added, if necessary.

github-actions · 2026-01-31T17:41:28Z

✅ DCO Check Passed

Thanks @jhchoi1182, all your commits are properly signed off. 🎉

dosubot · 2026-01-31T17:41:33Z

Related Documentation

Checked 14 published document(s) in 1 knowledge base(s). No updates required.

^{How did I do? Any feedback?}

mergify · 2026-01-31T17:41:54Z

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

🟢 Require two reviewer for test updates

Wonderful, this rule succeeded.

When test data is updated, we require two reviewers

#approved-reviews-by >= 2

Copilot

Pull request overview

Ensures PDFs with skipped/failed pages still preserve correct page numbering in DoclingDocument.pages, so downstream exports can emit consistent page-break markers.

Changes:

Add _add_failed_pages_to_document() to StandardPdfPipeline to insert placeholder PageItems for missing pages (with best-effort size info).
Invoke the new helper during document assembly to backfill missing page entries.
Add pytest coverage validating page presence and size info for skipped-page PDFs.

Reviewed changes

Copilot reviewed 2 out of 5 changed files in this pull request and generated 3 comments.

File	Description
`docling/pipeline/standard_pdf_pipeline.py`	Adds logic to backfill missing/failed pages into `DoclingDocument.pages` and updates some typing annotations.
`tests/test_failed_pages.py`	Adds tests asserting failed/skipped pages are still represented in `DoclingDocument.pages` with size info.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

docling/pipeline/standard_pdf_pipeline.py

codecov · 2026-02-01T10:59:49Z

Codecov Report

❌ Patch coverage is 84.37500% with 5 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
docling/pipeline/standard_pdf_pipeline.py	84.37%	5 Missing ⚠️

📢 Thoughts on this report? Let us know!

jhchoi1182 · 2026-02-12T05:40:33Z

Hi @cau-git, just checking in on this PR. Is there anything else needed from my end to move this forward? Let me know if you need any changes.

cau-git

@jhchoi1182 Thanks for this contribution! Your changes pass tests and look legitimate.

PeterStaar-IBM

🎖️

tests/test_failed_pages.py

jhchoi1182 · 2026-02-13T01:37:03Z

Hi @PeterStaar-IBM @cau-git, apologies for the dismissal. I just pushed a small fix to the test case (checking for PARTIAL status) based on feedback. Could you please take a quick look again?

jhchoi1182 · 2026-02-13T02:10:52Z

The test-pip-install-no-lock failure on Python 3.11 appears to be a dependency resolution issue — openai-whisper v20250625 resolves to numba==0.63.1 (with llvmlite==0.46.0) on Python 3.10, but to numba==0.53.1 (with llvmlite==0.36.0, which only supports Python <3.10) on Python 3.11.
Let me know if there's anything I should do on my end.

cau-git · 2026-02-13T09:19:37Z

@jhchoi1182 thanks for the updates. Could you please rebase to main once more, it brings the fix for the test-pip-install-no-lock in.

When some PDF pages fail to parse, they were not added to DoclingDocument.pages, causing page break markers to be incorrect during export. This adds failed/skipped pages with their size info (if available) to maintain correct page numbering and structure. - Add _add_failed_pages_to_document() method in StandardPdfPipeline - Add test cases for failed page handling - Add test cases for normal page handling (regression test) - Add test PDF files Signed-off-by: jhchoi1182 <[email protected]>

- Wrap page_backend usage in try-finally to guarantee unload (prevents resource leaks). - Simplify redundant 'float | None | None' type hint. Signed-off-by: jhchoi1182 <[email protected]>

…rom e2e test Signed-off-by: jhchoi1182 <[email protected]>

Signed-off-by: jhchoi1182 <[email protected]>

jhchoi1182 · 2026-02-13T11:42:55Z

@cau-git
Done, rebased onto the latest main!

dolfim-ibm

lgtm

…n models (#2999) * fix: add failed pages to DoclingDocument for page break consistency (#2939) * fix: add failed pages to DoclingDocument for page break consistency When some PDF pages fail to parse, they were not added to DoclingDocument.pages, causing page break markers to be incorrect during export. This adds failed/skipped pages with their size info (if available) to maintain correct page numbering and structure. - Add _add_failed_pages_to_document() method in StandardPdfPipeline - Add test cases for failed page handling - Add test cases for normal page handling (regression test) - Add test PDF files Signed-off-by: jhchoi1182 <[email protected]> * fix: ensure resource cleanup and simplify type hints - Wrap page_backend usage in try-finally to guarantee unload (prevents resource leaks). - Simplify redundant 'float | None | None' type hint. Signed-off-by: jhchoi1182 <[email protected]> * fix: add groundtruth for normal_4pages.pdf and exclude failing PDFs from e2e test Signed-off-by: jhchoi1182 <[email protected]> * fix: ensure correct status assertion for failed pages in tests Signed-off-by: jhchoi1182 <[email protected]> --------- Signed-off-by: jhchoi1182 <[email protected]> * fix: Use timezone-aware datetime (#2947) * Use timezone-aware datetime for profiling timestamps Updated timestamp recording to use timezone-aware datetime. Signed-off-by: Nikhil Singh <[email protected]> * run formatter Signed-off-by: Michele Dolfi <[email protected]> --------- Signed-off-by: Nikhil Singh <[email protected]> Signed-off-by: Michele Dolfi <[email protected]> Co-authored-by: Michele Dolfi <[email protected]> * fix(asciidoc): handle commas in image alt text (#2983) * Fix: Handle commas in AsciiDoc image alt text - Modified _parse_picture() to gracefully handle alt text containing commas - Commas in alt text are now preserved instead of causing ValueError - Added test case with realistic auto-generated alt text - split('=', 1) prevents issues when values contain '=' characters * DCO Remediation Commit for n0rdp0l <[email protected]> I, n0rdp0l <[email protected]>, hereby add my Signed-off-by to this commit: ee75249 Signed-off-by: n0rdp0l <[email protected]> * style: fix ruff formatting in test_backend_asciidoc.py Signed-off-by: n0rdp0l <[email protected]> --------- Signed-off-by: n0rdp0l <[email protected]> Co-authored-by: Michele Dolfi <[email protected]> * chore: bump version to 2.73.1 [skip ci] * First attempt at establishing API Kserve2 facet Signed-off-by: Christoph Auer <[email protected]> * refactor: improve KServe v2 engine implementation after code review - Add comprehensive error handling to KserveV2HttpClient - Catch and wrap Timeout, ConnectionError, HTTPError with context - Validate response formats with clear error messages - Refactor URL building to eliminate duplication - Extract _build_model_url() helper method - Single source of truth for infer_url and model_metadata_url - Make URL required parameter (remove default localhost:8000) - Update ApiKserveV2*EngineOptions to require explicit URL - Add preset validation with helpful error messages - Rename constants for clarity: TRITON_* → KSERVE_V2_* - Add comment explaining KServe v2 uses Triton type system - Improve error messages with actual values - Show counts, shapes, and supported types in validation errors - Document official KServe Python SDK alternative - Note async-only requirement and alpha status - Update tests for required URL parameter Signed-off-by: Christoph Auer <[email protected]> * Cleanup in kserve http helper and options Signed-off-by: Christoph Auer <[email protected]> * Further cleanup Signed-off-by: Christoph Auer <[email protected]> * Fix for remote-services on tablemodel Signed-off-by: Christoph Auer <[email protected]> * fix: improved deserialization of engine_options (#3008) * add registry of discriminated subclasses Signed-off-by: Michele Dolfi <[email protected]> * fix detection of engine_type value Signed-off-by: Michele Dolfi <[email protected]> --------- Signed-off-by: Michele Dolfi <[email protected]> * Add options serialization improvements Signed-off-by: Christoph Auer <[email protected]> --------- Signed-off-by: jhchoi1182 <[email protected]> Signed-off-by: Nikhil Singh <[email protected]> Signed-off-by: Michele Dolfi <[email protected]> Signed-off-by: n0rdp0l <[email protected]> Signed-off-by: Christoph Auer <[email protected]> Co-authored-by: jhchoi1182 <[email protected]> Co-authored-by: Nikhil Singh <[email protected]> Co-authored-by: Michele Dolfi <[email protected]> Co-authored-by: Felix Wente <[email protected]> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Michele Dolfi <[email protected]>

…ication) and KServe v2 API support (#2979) * feat: Inference engines abstraction for image classification model family with HF Transformers and ONNX runtime Implements runtime abstraction for image classification models with support for both ONNX Runtime and HuggingFace Transformers engines. Users can switch between engines without model retraining, similar to the object detection abstraction (#2959). Key components: - BaseImageClassificationEngine with factory pattern - OnnxRuntimeImageClassificationEngine and TransformersImageClassificationEngine implementations - Shared HfVisionModelMixin for common HF model utilities - Engine-specific configuration options - Test suite and example demonstrating runtime engine switching Signed-off-by: Christoph Auer <[email protected]> * Add missing files and re-export for backward compat Signed-off-by: Christoph Auer <[email protected]> * Don't run with OCR in the example. Signed-off-by: Christoph Auer <[email protected]> * Remove excess onnxruntime related options for inuts and outputs Signed-off-by: Christoph Auer <[email protected]> * feat: centralize torch compile defaults with DOCLING_INFERENCE_COMPILE_TORCH_MODELS Signed-off-by: Christoph Auer <[email protected]> * feat: Add Kserve2 API engine for image classifier and object detection models (#2999) * fix: add failed pages to DoclingDocument for page break consistency (#2939) * fix: add failed pages to DoclingDocument for page break consistency When some PDF pages fail to parse, they were not added to DoclingDocument.pages, causing page break markers to be incorrect during export. This adds failed/skipped pages with their size info (if available) to maintain correct page numbering and structure. - Add _add_failed_pages_to_document() method in StandardPdfPipeline - Add test cases for failed page handling - Add test cases for normal page handling (regression test) - Add test PDF files Signed-off-by: jhchoi1182 <[email protected]> * fix: ensure resource cleanup and simplify type hints - Wrap page_backend usage in try-finally to guarantee unload (prevents resource leaks). - Simplify redundant 'float | None | None' type hint. Signed-off-by: jhchoi1182 <[email protected]> * fix: add groundtruth for normal_4pages.pdf and exclude failing PDFs from e2e test Signed-off-by: jhchoi1182 <[email protected]> * fix: ensure correct status assertion for failed pages in tests Signed-off-by: jhchoi1182 <[email protected]> --------- Signed-off-by: jhchoi1182 <[email protected]> * fix: Use timezone-aware datetime (#2947) * Use timezone-aware datetime for profiling timestamps Updated timestamp recording to use timezone-aware datetime. Signed-off-by: Nikhil Singh <[email protected]> * run formatter Signed-off-by: Michele Dolfi <[email protected]> --------- Signed-off-by: Nikhil Singh <[email protected]> Signed-off-by: Michele Dolfi <[email protected]> Co-authored-by: Michele Dolfi <[email protected]> * fix(asciidoc): handle commas in image alt text (#2983) * Fix: Handle commas in AsciiDoc image alt text - Modified _parse_picture() to gracefully handle alt text containing commas - Commas in alt text are now preserved instead of causing ValueError - Added test case with realistic auto-generated alt text - split('=', 1) prevents issues when values contain '=' characters * DCO Remediation Commit for n0rdp0l <[email protected]> I, n0rdp0l <[email protected]>, hereby add my Signed-off-by to this commit: ee75249 Signed-off-by: n0rdp0l <[email protected]> * style: fix ruff formatting in test_backend_asciidoc.py Signed-off-by: n0rdp0l <[email protected]> --------- Signed-off-by: n0rdp0l <[email protected]> Co-authored-by: Michele Dolfi <[email protected]> * chore: bump version to 2.73.1 [skip ci] * First attempt at establishing API Kserve2 facet Signed-off-by: Christoph Auer <[email protected]> * refactor: improve KServe v2 engine implementation after code review - Add comprehensive error handling to KserveV2HttpClient - Catch and wrap Timeout, ConnectionError, HTTPError with context - Validate response formats with clear error messages - Refactor URL building to eliminate duplication - Extract _build_model_url() helper method - Single source of truth for infer_url and model_metadata_url - Make URL required parameter (remove default localhost:8000) - Update ApiKserveV2*EngineOptions to require explicit URL - Add preset validation with helpful error messages - Rename constants for clarity: TRITON_* → KSERVE_V2_* - Add comment explaining KServe v2 uses Triton type system - Improve error messages with actual values - Show counts, shapes, and supported types in validation errors - Document official KServe Python SDK alternative - Note async-only requirement and alpha status - Update tests for required URL parameter Signed-off-by: Christoph Auer <[email protected]> * Cleanup in kserve http helper and options Signed-off-by: Christoph Auer <[email protected]> * Further cleanup Signed-off-by: Christoph Auer <[email protected]> * Fix for remote-services on tablemodel Signed-off-by: Christoph Auer <[email protected]> * fix: improved deserialization of engine_options (#3008) * add registry of discriminated subclasses Signed-off-by: Michele Dolfi <[email protected]> * fix detection of engine_type value Signed-off-by: Michele Dolfi <[email protected]> --------- Signed-off-by: Michele Dolfi <[email protected]> * Add options serialization improvements Signed-off-by: Christoph Auer <[email protected]> --------- Signed-off-by: jhchoi1182 <[email protected]> Signed-off-by: Nikhil Singh <[email protected]> Signed-off-by: Michele Dolfi <[email protected]> Signed-off-by: n0rdp0l <[email protected]> Signed-off-by: Christoph Auer <[email protected]> Co-authored-by: jhchoi1182 <[email protected]> Co-authored-by: Nikhil Singh <[email protected]> Co-authored-by: Michele Dolfi <[email protected]> Co-authored-by: Felix Wente <[email protected]> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Michele Dolfi <[email protected]> * Fixes from review Signed-off-by: Christoph Auer <[email protected]> * DCO Remediation Commit for Christoph Auer <[email protected]> I, Christoph Auer <[email protected]>, hereby add my Signed-off-by to this commit: 4cdb01e Signed-off-by: Christoph Auer <[email protected]> * DCO Remediation Commit for Christoph Auer <[email protected]> I, Christoph Auer <[email protected]>, hereby add my Signed-off-by to this commit: e293ba3 Signed-off-by: Christoph Auer <[email protected]> Signed-off-by: Christoph Auer <[email protected]> * Add fallback for API variants Signed-off-by: Christoph Auer <[email protected]> * Recreate uv.lock Signed-off-by: Christoph Auer <[email protected]> --------- Signed-off-by: Christoph Auer <[email protected]> Signed-off-by: jhchoi1182 <[email protected]> Signed-off-by: Nikhil Singh <[email protected]> Signed-off-by: Michele Dolfi <[email protected]> Signed-off-by: n0rdp0l <[email protected]> Signed-off-by: Christoph Auer <[email protected]> Co-authored-by: jhchoi1182 <[email protected]> Co-authored-by: Nikhil Singh <[email protected]> Co-authored-by: Michele Dolfi <[email protected]> Co-authored-by: Felix Wente <[email protected]> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Michele Dolfi <[email protected]>

Copilot AI review requested due to automatic review settings January 31, 2026 17:41

Copilot started reviewing on behalf of jhchoi1182 January 31, 2026 17:41 View session

Copilot AI reviewed Jan 31, 2026

View reviewed changes

docling/pipeline/standard_pdf_pipeline.py Outdated Show resolved Hide resolved

docling/pipeline/standard_pdf_pipeline.py Outdated Show resolved Hide resolved

docling/pipeline/standard_pdf_pipeline.py Outdated Show resolved Hide resolved

jhchoi1182 mentioned this pull request Jan 31, 2026

fix: generate page_break for skipped pages in export functions docling-project/docling-core#466

Open

3 tasks

dolfim-ibm requested a review from cau-git February 1, 2026 10:46

cau-git previously approved these changes Feb 12, 2026

View reviewed changes

PeterStaar-IBM previously approved these changes Feb 12, 2026

View reviewed changes

dolfim-ibm reviewed Feb 12, 2026

View reviewed changes

tests/test_failed_pages.py Outdated Show resolved Hide resolved

tests/test_failed_pages.py Outdated Show resolved Hide resolved

jhchoi1182 dismissed stale reviews from PeterStaar-IBM and cau-git via 59590a2 February 13, 2026 00:46

jhchoi1182 requested review from PeterStaar-IBM and cau-git February 13, 2026 00:50

jhchoi1182 added 4 commits February 13, 2026 20:38

fix: ensure resource cleanup and simplify type hints

e2eed4c

- Wrap page_backend usage in try-finally to guarantee unload (prevents resource leaks). - Simplify redundant 'float | None | None' type hint. Signed-off-by: jhchoi1182 <[email protected]>

fix: add groundtruth for normal_4pages.pdf and exclude failing PDFs f…

7772e97

…rom e2e test Signed-off-by: jhchoi1182 <[email protected]>

fix: ensure correct status assertion for failed pages in tests

a6b18a2

Signed-off-by: jhchoi1182 <[email protected]>

jhchoi1182 force-pushed the fix/add-failed-pages-to-document branch from 64fbe4e to a6b18a2 Compare February 13, 2026 11:40

dolfim-ibm approved these changes Feb 13, 2026

View reviewed changes

cau-git approved these changes Feb 13, 2026

View reviewed changes

cau-git merged commit 1f91482 into docling-project:main Feb 13, 2026
25 checks passed

jhchoi1182 deleted the fix/add-failed-pages-to-document branch February 13, 2026 12:39

jhchoi1182 restored the fix/add-failed-pages-to-document branch February 19, 2026 02:09

jhchoi1182 deleted the fix/add-failed-pages-to-document branch February 19, 2026 04:39

Conversation

jhchoi1182 commented Jan 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Changes

Related Issue

Uh oh!

github-actions bot commented Jan 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dosubot bot commented Jan 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mergify bot commented Jan 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merge Protections

🟢 Enforce conventional commit

🟢 Require two reviewer for test updates

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov bot commented Feb 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

jhchoi1182 commented Feb 12, 2026

Uh oh!

cau-git left a comment

Choose a reason for hiding this comment

Uh oh!

PeterStaar-IBM left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jhchoi1182 commented Feb 13, 2026

Uh oh!

jhchoi1182 commented Feb 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cau-git commented Feb 13, 2026

Uh oh!

jhchoi1182 commented Feb 13, 2026

Uh oh!

dolfim-ibm left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

jhchoi1182 commented Jan 31, 2026 •

edited

Loading

github-actions bot commented Jan 31, 2026 •

edited

Loading

dosubot bot commented Jan 31, 2026 •

edited

Loading

mergify bot commented Jan 31, 2026 •

edited

Loading

codecov bot commented Feb 1, 2026 •

edited

Loading

jhchoi1182 commented Feb 13, 2026 •

edited

Loading