A powerful document conversion plugin for Dify based on MarkItDown 0.0.2a1, designed to convert various file formats into Markdown with high accuracy and reliability.
This plugin serves as an excellent alternative to traditional document extraction nodes, offering robust file conversion capabilities within the Dify ecosystem. It leverages MarkItDown's plugin-based architecture to provide seamless conversion of multiple file formats to Markdown.
-
Documents
- PDF files (
.pdf) - Microsoft Word documents (
.doc,.docx) - Microsoft PowerPoint presentations (
.ppt,.pptx) - Microsoft Excel spreadsheets (
.xls,.xlsx) - HTML files (
.html,.htm)
- PDF files (
-
Media Files
- Images with EXIF metadata and OCR support
- Audio files with metadata and speech transcription
-
Data Formats
- CSV files
- JSON documents
- XML files
- Text files
-
Archives
- ZIP files (with automatic content iteration)
The plugin accepts the following parameters in the Dify interface:
files:
type: files
required: false
description: "Array of files to be converted to Markdown"The plugin provides three types of response formats for maximum flexibility:
-
JSON Response (New)
{ "status": "success|error", "total_files": 2, "successful_conversions": 2, "results": [ { "filename": "example1.pdf", "original_format": "pdf", "markdown_content": "# Content...", "status": "success" }, { "filename": "example2.docx", "original_format": "docx", "markdown_content": "# Content...", "status": "success" } ] }Error response example:
{ "filename": "failed.pdf", "original_format": "pdf", "error": "Error message", "status": "error" } -
Blob Response
- Returns raw markdown content as a blob
- Includes mime_type metadata: "text/markdown"
- Useful for direct content processing
-
Text Response (Legacy)
- Single File:
[Markdown content of the file]- Multiple Files:
================================================== File 1: example1.pdf ================================================== [Markdown content of file 1] ================================================== File 2: example2.docx ================================================== [Markdown content of file 2]
Please convert the attached files to Markdown format.
{@markitdown files=["document.pdf", "presentation.pptx"]}
- Batch Processing: Process multiple files in a single request
- Clear File Separation: When processing multiple files, content is clearly separated with headers and dividers
- Format Preservation: Maintains important formatting elements in the Markdown output
- Error Handling: Provides clear error messages for failed conversions
- Automatic Cleanup: Temporary files are automatically managed and cleaned up
- File Size: While there's no strict limit, it's recommended to keep individual files under 50MB for optimal performance
- Batch Processing: You can process multiple files simultaneously, but consider limiting batches to 5-10 files
- Format Support: When possible, use standard file formats from the supported list for best results
- Error Handling: Always check for error messages in the response when processing critical documents
Here's how to integrate the plugin into your Dify workflow:
-
Document Analysis Flow
Input: {files} -> MarkItDown Plugin -> LLM Analysis -
Content Extraction Flow
Input: {files} -> MarkItDown Plugin -> Text Extraction -> Database Storage
- Format Support: Broader range of supported file formats
- Metadata Preservation: Retains important metadata from source files
- Structured Output: Consistently formatted Markdown output
- Batch Processing: Efficient handling of multiple files
- Clear Separation: Better organization of multiple file contents
- Error Handling: Comprehensive error reporting and handling
- Based on MarkItDown 0.0.2a1
- Maintains backward compatibility with 0.0.1a3
- Implements plugin-based architecture for extensibility
- Automatic temporary file management
- Thread-safe processing for concurrent requests
- Network connectivity required for some conversion operations
- Processing time may vary based on file size and complexity
- Some advanced formatting may be simplified in the Markdown output
For issues and feature requests, please create an issue in the repository or contact the plugin maintainer.
Note: This plugin is based on MarkItDown 0.0.2a1 and may receive updates as the base library evolves to its first non-alpha release.
-
JSON Response
- Workflow automation and data processing
- Status tracking and error handling
- Structured data extraction
- API integrations
-
Blob Response
- Direct content processing
- Raw markdown handling
- Stream processing
- File system operations
-
Text Response
- Human readable output
- Legacy system compatibility
- Simple content viewing
- Direct LLM input
contact:[email protected]