Skip to content

Bug: Support XLSX Mimetype#88

Merged
cau-git merged 3 commits intodocling-project:mainfrom
ctandrewtran:main
Dec 9, 2024
Merged

Bug: Support XLSX Mimetype#88
cau-git merged 3 commits intodocling-project:mainfrom
ctandrewtran:main

Conversation

@ctandrewtran
Copy link
Contributor

@ctandrewtran ctandrewtran commented Dec 2, 2024

Context

Using latest docling-core (v2.6.1) and docling (v2.8.1) xlsx files are unable to be processed due to a ValueError mentioning that the mimetype "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet" is not valid as per DocumentOrigin

"application/vnd.openxmlformats-officedocument.spreadsheetml.sheet" is not a mimetype offered by default by the mimetype pkg being used by docling and hence has to be added in the extras section

Fix

Add the mimetype as an extra mimetype

Signed-off-by: Andrew Tran <48397302+ctandrewtran@users.noreply.github.com>
Signed-off-by: Andrew Tran <48397302+ctandrewtran@users.noreply.github.com>
@dolfim-ibm
Copy link
Member

@ctandrewtran can you please provide a few more details on the failure that you observed?

  • which operating system?
  • which python version?
  • was this on a specific file?
  • running which code/command?

@ctandrewtran
Copy link
Contributor Author

ctandrewtran commented Dec 3, 2024

@ctandrewtran can you please provide a few more details on the failure that you observed?

  • which operating system?
  • which python version?
  • was this on a specific file?
  • running which code/command?

OS: Amazon Linux 2

Kernel Version: 5.10.228-219.884.amzn2.x86_64

Python Version: 3.12.6

Specific file: any .xlsx file that will have mimetype application/vnd.openxmlformats-officedocument.spreadsheetml.sheet can reproduce the issue

Note: On my end I implemented the change done in this PR in the docling-core package manually in my environment and it did indeed fix the problem.

Code:

from docling import *
import streamlit as st
import io

file = st.file_uploader(type=['xlsx'])

arti = docling.pipeline.standard_pdf_pipeline.StandardPdfPipeline.download_models_hf()

pipelinePDF = docling.datampdel.pipeline_options.PdfPipelineOptions(artifacts_path=arti)

converter = docling.document_converter.DocumentConverter(allowed_formats=[docling.InputFormat.XLSX])

buf = io.BytesIO(file.read())

src = docling.datamodel.base_models.DocumentStream(name=file.name, stream=buf) 

doclingified = converter.convert(src)

@ctandrewtran ctandrewtran changed the title Support XLSX Mimetype Bug: Support XLSX Mimetype Dec 4, 2024
@ctandrewtran
Copy link
Contributor Author

ctandrewtran commented Dec 6, 2024

@ctandrewtran can you please provide a few more details on the failure that you observed?

  • which operating system?
  • which python version?
  • was this on a specific file?
  • running which code/command?

Closing the loop, someone else in the issue I raised in the Docling Repo mentioned having the same bug

docling-project/docling#493

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
@cau-git cau-git merged commit d69dd55 into docling-project:main Dec 9, 2024
muhark added a commit to muhark/docling-core that referenced this pull request Mar 19, 2025
Signed-off-by: Andrew Tran <48397302+ctandrewtran@users.noreply.github.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants