Add support for bounding box extraction in images and `LLM_PARSE` via PaddleOCR #128

dilithjay · 2025-09-15T20:53:26Z

This PR adds support for

PaddleOCR through STATIC_PARSE. This enables support for parsing images with STATIC_PARSE. However, note that the output from PaddleOCR does not get formatted as Markdown, only as plain text.
Extracting bounding boxes for the output of LLM_PARSE.
Auto-selection of the framework to use for bounding box detection
- If an image is present in the PDF or if the input itself is an image, use PaddleOCR
- Else, if the input is a PDF with only a text layer, use PDFPlumber
- The auto-selection is necessary because PDFPlumber's coordinates are exact, but are not available if no text layer exists

TODO: Update documentation

…nce of images

Copilot

Pull Request Overview

This PR adds support for bounding box extraction and EasyOCR integration to enable parsing of images and PDFs with images. The primary goal is to extend the framework's capabilities to handle image-based documents while providing accurate bounding box coordinates for text elements.

Adds EasyOCR dependency and parsing capability through STATIC_PARSE for image-based documents
Implements automatic framework selection for bounding box detection based on document type
Refactors bounding box functionality with improved fuzzy matching and bbox splitting for multi-word text

Reviewed Changes

Copilot reviewed 5 out of 7 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
pyproject.toml	Adds easyocr dependency
lexoid/core/utils.py	Adds bbox routing logic, refactors edit distance function, and improves bounding box matching
lexoid/core/parse_type/static_parser.py	Implements EasyOCR parser and adds image parsing support
lexoid/core/conversion_utils.py	Refactors base64 conversion functions for better modularity
lexoid/api.py	Integrates bbox extraction with automatic framework selection

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

lexoid/core/conversion_utils.py

lexoid/core/utils.py

Co-authored-by: Copilot <[email protected]>

Copilot

Pull Request Overview

Copilot reviewed 5 out of 7 changed files in this pull request and generated 4 comments.

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

lexoid/core/utils.py

lexoid/core/parse_type/static_parser.py

pyproject.toml

lexoid/core/utils.py

lexoid/api.py

lexoid/core/parse_type/static_parser.py

lexoid/core/conversion_utils.py

dilithjay added 4 commits September 15, 2025 09:45

Add support for parsing images via EasyOCR

8b22a0c

Add support for bounding box extraction in LLM_PARSE

3c320af

Add easyocr dependency

2b39c65

Add auto selection of bounding box detection framework based on prese…

02a0a5d

…nce of images

dilithjay requested review from Copilot and pramitchoudhary September 15, 2025 20:53

dilithjay self-assigned this Sep 15, 2025

Copilot AI reviewed Sep 15, 2025

View reviewed changes

lexoid/core/conversion_utils.py Outdated Show resolved Hide resolved

lexoid/core/utils.py Outdated Show resolved Hide resolved

lexoid/core/utils.py Outdated Show resolved Hide resolved

dilithjay and others added 4 commits September 15, 2025 19:34

Update lexoid/core/conversion_utils.py

1322982

Co-authored-by: Copilot <[email protected]>

Update lexoid/core/utils.py

e3efd94

Co-authored-by: Copilot <[email protected]>

Fix comments on edit distance

4d0b50b

Switch from EasyOCR to PaddleOCR

ed5ab1c

dilithjay changed the title ~~Add support for bounding box extraction in images and LLM_PARSE via EasyOCR~~ Add support for bounding box extraction in images and LLM_PARSE via PaddleOCR Sep 19, 2025

dilithjay requested a review from Copilot September 21, 2025 12:12

Copilot AI reviewed Sep 21, 2025

View reviewed changes

lexoid/core/utils.py Outdated Show resolved Hide resolved

lexoid/core/utils.py Outdated Show resolved Hide resolved

lexoid/core/parse_type/static_parser.py Outdated Show resolved Hide resolved

lexoid/core/parse_type/static_parser.py Outdated Show resolved Hide resolved

dilithjay added 3 commits September 21, 2025 09:16

Add separate code cells for pdfplumber and paddleocr in notebook

74aae6c

Address copilot reviews

1650448

Use paddleocr as fallback instead of pdfminer

ddc0100

pramitchoudhary added the enhancement New feature or request label Sep 22, 2025

pramitchoudhary reviewed Sep 24, 2025

View reviewed changes

pyproject.toml Show resolved Hide resolved

pramitchoudhary reviewed Sep 24, 2025

View reviewed changes

lexoid/core/utils.py Outdated Show resolved Hide resolved

pramitchoudhary reviewed Sep 24, 2025

View reviewed changes

lexoid/api.py Show resolved Hide resolved

pramitchoudhary reviewed Sep 24, 2025

View reviewed changes

lexoid/core/parse_type/static_parser.py Outdated Show resolved Hide resolved

pramitchoudhary reviewed Sep 25, 2025

View reviewed changes

lexoid/core/conversion_utils.py Outdated Show resolved Hide resolved

dilithjay added 5 commits September 26, 2025 18:28

Use levenshtein library and remove easyocr dependency

dc83dcf

Rename base64_to_cv2_image to base64_to_np_array

e35ced5

Fix incorrect bbox_framework logic

8961ded

Update docs

7a802a9

Add comments to reference highlight notebook

49176be

dilithjay merged commit 3931455 into main Sep 29, 2025

dilithjay deleted the dj/ref-llm branch September 29, 2025 13:40

This was referenced Nov 16, 2025

Possible improvements #121

Closed

Add OCR support for static parsers #20

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add support for bounding box extraction in images and `LLM_PARSE` via PaddleOCR #128

Add support for bounding box extraction in images and `LLM_PARSE` via PaddleOCR #128

Uh oh!

dilithjay commented Sep 15, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add support for bounding box extraction in images and LLM_PARSE via PaddleOCR #128

Add support for bounding box extraction in images and LLM_PARSE via PaddleOCR #128

Uh oh!

Conversation

dilithjay commented Sep 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add support for bounding box extraction in images and `LLM_PARSE` via PaddleOCR #128

Add support for bounding box extraction in images and `LLM_PARSE` via PaddleOCR #128

dilithjay commented Sep 15, 2025 •

edited

Loading