feat: Enable pipeline override and reuse with compatible options#2952
feat: Enable pipeline override and reuse with compatible options#2952
Conversation
Signed-off-by: Michele Dolfi <[email protected]>
Signed-off-by: Michele Dolfi <[email protected]>
Signed-off-by: Michele Dolfi <[email protected]>
Signed-off-by: Michele Dolfi <[email protected]>
Signed-off-by: Michele Dolfi <[email protected]>
Signed-off-by: Michele Dolfi <[email protected]>
Signed-off-by: Michele Dolfi <[email protected]>
Signed-off-by: Michele Dolfi <[email protected]>
Signed-off-by: Michele Dolfi <[email protected]>
Signed-off-by: Michele Dolfi <[email protected]>
Signed-off-by: Michele Dolfi <[email protected]>
Signed-off-by: Michele Dolfi <[email protected]>
Signed-off-by: Michele Dolfi <[email protected]>
Signed-off-by: Michele Dolfi <[email protected]>
Signed-off-by: Michele Dolfi <[email protected]>
Signed-off-by: Michele Dolfi <[email protected]>
Signed-off-by: Michele Dolfi <[email protected]>
Signed-off-by: Michele Dolfi <[email protected]>
Signed-off-by: Michele Dolfi <[email protected]>
Signed-off-by: Michele Dolfi <[email protected]>
Signed-off-by: Michele Dolfi <[email protected]>
Signed-off-by: Michele Dolfi <[email protected]>
Signed-off-by: Michele Dolfi <[email protected]>
Signed-off-by: Michele Dolfi <[email protected]>
Signed-off-by: Michele Dolfi <[email protected]>
Signed-off-by: Michele Dolfi <[email protected]>
Signed-off-by: Michele Dolfi <[email protected]>
Signed-off-by: Michele Dolfi <[email protected]>
Signed-off-by: Michele Dolfi <[email protected]>
Signed-off-by: Michele Dolfi <[email protected]>
Signed-off-by: Michele Dolfi <[email protected]>
Signed-off-by: Michele Dolfi <[email protected]>
Signed-off-by: Michele Dolfi <[email protected]>
Signed-off-by: Michele Dolfi <[email protected]>
Signed-off-by: Michele Dolfi <[email protected]>
Signed-off-by: Michele Dolfi <[email protected]>
Signed-off-by: Michele Dolfi <[email protected]>
Signed-off-by: Michele Dolfi <[email protected]>
Signed-off-by: Michele Dolfi <[email protected]>
Signed-off-by: Christoph Auer <[email protected]>
|
✅ DCO Check Passed Thanks @cau-git, all your commits are properly signed off. 🎉 |
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🟢 Enforce conventional commitWonderful, this rule succeeded.Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
|
Signed-off-by: Christoph Auer <[email protected]>
…ne-options-override-without-reinit
- remove `force_all_model_init` - reject incompatible override options (no auto pipeline reinit) - allow runtime `do_*` overrides only for `True -> False` toggles - apply compatible `do_*` overrides per execution in base/threaded PDF pipelines - add compatibility tests and update converter docstrings Signed-off-by: Christoph Auer <[email protected]>
|
Related Documentation 1 document(s) may need updating based on files changed in this PR: Docling What are the detailed pipeline options and processing behaviors for PDF, DOCX, PPTX, and XLSX files in the Python SDK?View Suggested Changes@@ -8,9 +8,8 @@
- `generate_page_images`, `generate_picture_images`: Extract page/picture images
- `force_backend_text`: Force backend text extraction
- Additional options for OCR engine, layout model, table extraction, etc.
-- **Notes**: Only PDF supports image resolution adjustment. For more details, see [pipeline options code](https://github.com/docling-project/docling/blob/ae4fdbbb09fd377bb271e9b2efe541873eeb2990/docling/datamodel/pipeline_options.py#L891-L1336) and [example](https://app.dosu.dev/documents/9640186d-61e1-4ca1-9d8a-b82b3ee6bff8).
-
----
+- **Pipeline Option Overrides**: The Python API allows you to override pipeline options at conversion time for a given format using the `format_options` argument. Only `do_*` flags (such as `do_ocr`, `do_table_structure`, `do_code_enrichment`, `do_formula_enrichment`, etc.) can be changed, and only from `True` to `False`. All other options must remain identical to those used at pipeline initialization. Attempting to enable a do_* flag or change other fields will result in an error. This enables per-call disabling of enrichment features without reinitializing the pipeline.
+- **Notes**: Only PDF supports image resolution adjustment. For more details, see [pipeline options code](https://github.com/docling-project/docling/blob/ae4fdbbb09fd377bb271e9b2efe541873eeb2990/docling/datamodel/pipeline_options.py#L891-L1336) and [example](https://app.dosu.dev/documents/9640186d-61e1-4ca1-9d8a-b82b3ee6bff8). Refer to the Python SDK documentation for usage of `format_options`.
### DOCX
- **Pipeline/Backend**: `SimplePipeline` + `MsWordDocumentBackend`
@@ -52,5 +51,4 @@
- Only PDF supports image resolution adjustment (`images_scale`).
- DOCX header/footer export is only available via Python API.
- PPTX/XLSX support enrichment options and pagination (slide/sheet level).
-
-For further details, refer to the provided code links and examples.
+- **Pipeline Option Overrides**: For all formats, the Python API supports disabling enrichment-related `do_*` flags at conversion time using the `format_options` argument. Only disabling (True → False) is allowed; all other options must remain unchanged. See the PDF section above for details.Note: You must be authenticated to accept/decline updates. |
Signed-off-by: Christoph Auer <[email protected]>
| def _get_enrichment_pipe_for_execution( | ||
| self, | ||
| ) -> Iterable[GenericEnrichmentModel[Any]]: | ||
| effective_options = self.get_effective_options() | ||
| assert isinstance(effective_options, ConvertPipelineOptions) | ||
|
|
||
| do_picture_classification = ( | ||
| effective_options.do_picture_classification | ||
| or effective_options.do_chart_extraction | ||
| ) | ||
| do_picture_description = effective_options.do_picture_description | ||
| do_chart_extraction = effective_options.do_chart_extraction | ||
|
|
||
| for model in self.enrichment_pipe: | ||
| if isinstance(model, DocumentPictureClassifier): | ||
| if do_picture_classification: | ||
| yield model | ||
| elif isinstance(model, PictureDescriptionBaseModel): | ||
| if do_picture_description: | ||
| yield model | ||
| elif isinstance(model, ChartExtractionModelGraniteVision): | ||
| if do_chart_extraction: | ||
| yield model | ||
| else: | ||
| yield model |
There was a problem hiding this comment.
Not the coolest thing to put here. Ideas for improvements are welcome.
There was a problem hiding this comment.
Currently the approach was to always list the models, but to set the enabled argument in the __init__. Do we have to change it?
Summary
This PR adds per-call pipeline option overrides in
DocumentConverterand enforces compatibility-based reuse of initialized pipelines. The main scenario this covers is:DocumentConverteronce with pipeline options for each desired format.DocumentConverter.convertcan provide modified pipeline options within very tight constraints, only to disable a model that was enabled in the initialization (e.g. turn off OCR, turn off table structure etc. in the standard pipeline)convertcall is requesting to disable a model. Previously it would have required to initialize anotherDocumentConverterwith different options.What’s Included
format_optionsoverride support to:DocumentConverter.convert(...)DocumentConverter.convert_all(...)do_...flagsdo_...flags can only be relaxed (True -> False). Hence, one can override to not use an "enabled" model, but not enable a model that was disabled in the pre-initialized pipeline.do_*overrides are respected safely.tests/test_options.py.Behavior
raises_on_error).Checklist: