Skip to content

Enable XSLX and PPTX support for OCR (when either Tesseract or Textract are chosen as engine) #409

@reganwolfrom

Description

@reganwolfrom

Currently, only DOC/DOCX are handled for text extraction when OCR is requested, but extending this functionality to both XLS/XLSX and PPT/PPTX should be reasonable.

NOTE: it's not clear if DOC/XLS/PPT actually are as simple, or that DOC is working, as they may not have the same open format that can be extracted.

Ideally, any OCR request, no matter if the engine is Tesseract or Textract, should extract text from these documents.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions