fix: detect Office Open XML formats from ZIP contents when filename has no extension by DebadityaQU · Pull Request #3073 · docling-project/docling

DebadityaQU · 2026-03-05T23:23:50Z

Summary

Bug: Format detection fails for DOCX/XLSX/PPTX files fetched via pre-signed URLs (S3/GCS/Azure Blob) when the storage backend does not return a Content-Disposition header with the original filename.
Root cause: _guess_format() identifies these files as application/zip (correct, since Office Open XML formats are ZIP-based), then attempts to disambiguate using only the filename extension. For pre-signed URLs, the resolved filename often has no extension (e.g. abc123-def456), so disambiguation fails, _guess_format() returns None, and the document is rejected as "File format not allowed".
Fix: Add _detect_office_mime_from_zip() that inspects the ZIP archive's internal structure as a fallback when extension-based detection cannot disambiguate. It checks for canonical marker files:
- word/document.xml → DOCX
- xl/workbook.xml → XLSX
- ppt/presentation.xml → PPTX

This fallback is applied in both the Path and DocumentStream branches of _guess_format().

Test plan

Verify DOCX files with .docx extension still resolve correctly (no regression)
Verify DOCX/XLSX/PPTX files with no extension are correctly detected via ZIP introspection
Verify non-Office ZIP files (e.g. plain .zip archives) still return None as before
Verify corrupted/invalid ZIP files are handled gracefully (BadZipFile is caught)
Pre-commit checks pass (ruff formatter, ruff linter, mypy)

…as no extension Pre-signed URLs (S3/GCS/Azure) often lack a Content-Disposition header, causing the resolved filename to have no .docx/.xlsx/.pptx extension. Since these formats are ZIP-based, filetype reports "application/zip", and the extension-only disambiguation fails, leaving format as None. Add _detect_office_mime_from_zip() that inspects the ZIP archive's internal structure (word/document.xml, xl/workbook.xml, ppt/presentation.xml) as a fallback when extension-based detection cannot disambiguate. Signed-off-by: Debaditya Shome <[email protected]> Made-with: Cursor

github-actions · 2026-03-05T23:24:00Z

✅ DCO Check Passed

Thanks @DebadityaQU, all your commits are properly signed off. 🎉

mergify · 2026-03-05T23:24:25Z

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

dosubot · 2026-03-05T23:26:19Z

Related Documentation

Checked 21 published document(s) in 1 knowledge base(s). No updates required.

^{How did I do? Any feedback?}

…tension Cover the introspection fallback for DOCX and PPTX streams and paths with no file extension, and verify plain (non-Office) ZIP archives still return None. Also fix an IndexError on Path inputs with no suffix where obj.suffixes[-1] would crash on an empty list — use obj.suffix instead. Signed-off-by: Debaditya Shome <[email protected]> Made-with: Cursor

dolfim-ibm

thanks, lgtm

codecov · 2026-03-06T07:49:31Z

Codecov Report

❌ Patch coverage is 74.07407% with 7 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
docling/datamodel/document.py	74.07%	7 Missing ⚠️

📢 Thoughts on this report? Let us know!

dolfim-ibm approved these changes Mar 6, 2026

View reviewed changes

PeterStaar-IBM approved these changes Mar 6, 2026

View reviewed changes

dolfim-ibm merged commit 56f06fe into docling-project:main Mar 6, 2026
24 of 25 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: detect Office Open XML formats from ZIP contents when filename has no extension#3073

fix: detect Office Open XML formats from ZIP contents when filename has no extension#3073
dolfim-ibm merged 2 commits intodocling-project:mainfrom
DebadityaQU:fix/zip-based-office-format-detection

DebadityaQU commented Mar 5, 2026

Uh oh!

github-actions bot commented Mar 5, 2026 •

edited

Loading

Uh oh!

mergify bot commented Mar 5, 2026

Uh oh!

dosubot bot commented Mar 5, 2026

Uh oh!

dolfim-ibm left a comment

Uh oh!

codecov bot commented Mar 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

DebadityaQU commented Mar 5, 2026

Summary

Test plan

Uh oh!

github-actions bot commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mergify bot commented Mar 5, 2026

Merge Protections

🟢 Enforce conventional commit

Uh oh!

dosubot bot commented Mar 5, 2026

Uh oh!

dolfim-ibm left a comment

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Mar 6, 2026

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

github-actions bot commented Mar 5, 2026 •

edited

Loading