Skip to content

Conversation

@FranciscoJBL
Copy link

This pull request adds experimental PDF table extraction support to MarkItDown, allowing users to extract tables from PDFs as markdown tables using optional dependencies. The feature is exposed via a new --pdf-tables CLI flag and a corresponding Python API parameter, with support for multiple extraction modes. Documentation and tests are included to explain usage and verify behavior.

PDF Table Extraction Feature

  • Added experimental table extraction for PDFs, selectable via the new --pdf-tables CLI flag and pdf_tables Python API parameter. Supports four modes: none (default, plain text), plumber (uses pdfplumber), camelot (uses camelot), and auto (tries plumber then camelot, falls back to plain text). [1] [2] [3]

Documentation Updates

  • Updated README.md to document the new table extraction feature, usage examples, supported modes, installation of optional dependencies, and caveats. [1] [2]

Dependency Management

  • Added a new optional dependency group pdf-tables in pyproject.toml, including pdfminer.six, pdfplumber, and camelot-py.

Codebase Changes

  • Modified the PDF converter to detect and use optional table extraction libraries, format extracted tables as markdown, and gracefully fall back if dependencies are missing. [1] [2] [3]
  • Updated CLI and API argument handling to pass through the pdf_tables option to converters. [1] [2] [3]

Testing

  • Added new tests for PDF table extraction covering different modes and fallback behavior, using in-memory PDF generation for robustness.

This PR address issue #1419

@FranciscoJBL
Copy link
Author

@microsoft-github-policy-service agree

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant