Skip to content

fix: detect Office Open XML formats from ZIP contents when filename has no extension#3073

Merged
dolfim-ibm merged 2 commits intodocling-project:mainfrom
DebadityaQU:fix/zip-based-office-format-detection
Mar 6, 2026
Merged

fix: detect Office Open XML formats from ZIP contents when filename has no extension#3073
dolfim-ibm merged 2 commits intodocling-project:mainfrom
DebadityaQU:fix/zip-based-office-format-detection

Conversation

@DebadityaQU
Copy link
Contributor

Summary

  • Bug: Format detection fails for DOCX/XLSX/PPTX files fetched via pre-signed URLs (S3/GCS/Azure Blob) when the storage backend does not return a Content-Disposition header with the original filename.
  • Root cause: _guess_format() identifies these files as application/zip (correct, since Office Open XML formats are ZIP-based), then attempts to disambiguate using only the filename extension. For pre-signed URLs, the resolved filename often has no extension (e.g. abc123-def456), so disambiguation fails, _guess_format() returns None, and the document is rejected as "File format not allowed".
  • Fix: Add _detect_office_mime_from_zip() that inspects the ZIP archive's internal structure as a fallback when extension-based detection cannot disambiguate. It checks for canonical marker files:
    • word/document.xml → DOCX
    • xl/workbook.xml → XLSX
    • ppt/presentation.xml → PPTX

This fallback is applied in both the Path and DocumentStream branches of _guess_format().

Test plan

  • Verify DOCX files with .docx extension still resolve correctly (no regression)
  • Verify DOCX/XLSX/PPTX files with no extension are correctly detected via ZIP introspection
  • Verify non-Office ZIP files (e.g. plain .zip archives) still return None as before
  • Verify corrupted/invalid ZIP files are handled gracefully (BadZipFile is caught)
  • Pre-commit checks pass (ruff formatter, ruff linter, mypy)

…as no extension

Pre-signed URLs (S3/GCS/Azure) often lack a Content-Disposition header,
causing the resolved filename to have no .docx/.xlsx/.pptx extension.
Since these formats are ZIP-based, filetype reports "application/zip",
and the extension-only disambiguation fails, leaving format as None.

Add _detect_office_mime_from_zip() that inspects the ZIP archive's
internal structure (word/document.xml, xl/workbook.xml,
ppt/presentation.xml) as a fallback when extension-based detection
cannot disambiguate.

Signed-off-by: Debaditya Shome <[email protected]>
Made-with: Cursor
@github-actions
Copy link
Contributor

github-actions bot commented Mar 5, 2026

DCO Check Passed

Thanks @DebadityaQU, all your commits are properly signed off. 🎉

@mergify
Copy link

mergify bot commented Mar 5, 2026

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

@dosubot
Copy link

dosubot bot commented Mar 5, 2026

Related Documentation

Checked 21 published document(s) in 1 knowledge base(s). No updates required.

How did I do? Any feedback?  Join Discord

…tension

Cover the introspection fallback for DOCX and PPTX streams and paths
with no file extension, and verify plain (non-Office) ZIP archives
still return None.

Also fix an IndexError on Path inputs with no suffix where
obj.suffixes[-1] would crash on an empty list — use obj.suffix instead.

Signed-off-by: Debaditya Shome <[email protected]>
Made-with: Cursor
Copy link
Member

@dolfim-ibm dolfim-ibm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, lgtm

@codecov
Copy link

codecov bot commented Mar 6, 2026

Codecov Report

❌ Patch coverage is 74.07407% with 7 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
docling/datamodel/document.py 74.07% 7 Missing ⚠️

📢 Thoughts on this report? Let us know!

@dolfim-ibm dolfim-ibm merged commit 56f06fe into docling-project:main Mar 6, 2026
24 of 25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants