feat: hybrid table extraction with optional plumber/camelot backends #1422

FranciscoJBL · 2025-09-17T23:27:02Z

This pull request adds experimental PDF table extraction support to MarkItDown, allowing users to extract tables from PDFs as markdown tables using optional dependencies. The feature is exposed via a new --pdf-tables CLI flag and a corresponding Python API parameter, with support for multiple extraction modes. Documentation and tests are included to explain usage and verify behavior.

PDF Table Extraction Feature

Added experimental table extraction for PDFs, selectable via the new --pdf-tables CLI flag and pdf_tables Python API parameter. Supports four modes: none (default, plain text), plumber (uses pdfplumber), camelot (uses camelot), and auto (tries plumber then camelot, falls back to plain text). [1] [2] [3]

Documentation Updates

Updated README.md to document the new table extraction feature, usage examples, supported modes, installation of optional dependencies, and caveats. [1] [2]

Dependency Management

Added a new optional dependency group pdf-tables in pyproject.toml, including pdfminer.six, pdfplumber, and camelot-py.

Codebase Changes

Modified the PDF converter to detect and use optional table extraction libraries, format extracted tables as markdown, and gracefully fall back if dependencies are missing. [1] [2] [3]
Updated CLI and API argument handling to pass through the pdf_tables option to converters. [1] [2] [3]

Testing

Added new tests for PDF table extraction covering different modes and fallback behavior, using in-memory PDF generation for robustness.

This PR address issue #1419

FranciscoJBL · 2025-09-18T01:46:07Z

@microsoft-github-policy-service agree

feat: hybrid table extraction with optional plumber/camelot backends

1a7d6a3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: hybrid table extraction with optional plumber/camelot backends #1422

feat: hybrid table extraction with optional plumber/camelot backends #1422

Uh oh!

FranciscoJBL commented Sep 17, 2025

Uh oh!

FranciscoJBL commented Sep 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

feat: hybrid table extraction with optional plumber/camelot backends #1422

Are you sure you want to change the base?

feat: hybrid table extraction with optional plumber/camelot backends #1422

Uh oh!

Conversation

FranciscoJBL commented Sep 17, 2025

Uh oh!

FranciscoJBL commented Sep 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant