Skip to content

Conversation

@dilithjay
Copy link
Contributor

@dilithjay dilithjay commented Sep 15, 2025

This PR adds support for

  • PaddleOCR through STATIC_PARSE. This enables support for parsing images with STATIC_PARSE. However, note that the output from PaddleOCR does not get formatted as Markdown, only as plain text.
  • Extracting bounding boxes for the output of LLM_PARSE.
  • Auto-selection of the framework to use for bounding box detection
    • If an image is present in the PDF or if the input itself is an image, use PaddleOCR
    • Else, if the input is a PDF with only a text layer, use PDFPlumber
    • The auto-selection is necessary because PDFPlumber's coordinates are exact, but are not available if no text layer exists

TODO: Update documentation

@dilithjay dilithjay self-assigned this Sep 15, 2025
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds support for bounding box extraction and EasyOCR integration to enable parsing of images and PDFs with images. The primary goal is to extend the framework's capabilities to handle image-based documents while providing accurate bounding box coordinates for text elements.

  • Adds EasyOCR dependency and parsing capability through STATIC_PARSE for image-based documents
  • Implements automatic framework selection for bounding box detection based on document type
  • Refactors bounding box functionality with improved fuzzy matching and bbox splitting for multi-word text

Reviewed Changes

Copilot reviewed 5 out of 7 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
pyproject.toml Adds easyocr dependency
lexoid/core/utils.py Adds bbox routing logic, refactors edit distance function, and improves bounding box matching
lexoid/core/parse_type/static_parser.py Implements EasyOCR parser and adds image parsing support
lexoid/core/conversion_utils.py Refactors base64 conversion functions for better modularity
lexoid/api.py Integrates bbox extraction with automatic framework selection

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@dilithjay dilithjay changed the title Add support for bounding box extraction in images and LLM_PARSE via EasyOCR Add support for bounding box extraction in images and LLM_PARSE via PaddleOCR Sep 19, 2025
@dilithjay dilithjay requested a review from Copilot September 21, 2025 12:12
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 5 out of 7 changed files in this pull request and generated 4 comments.


Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@pramitchoudhary pramitchoudhary added the enhancement New feature or request label Sep 22, 2025
@dilithjay dilithjay merged commit 3931455 into main Sep 29, 2025
@dilithjay dilithjay deleted the dj/ref-llm branch September 29, 2025 13:40
This was referenced Nov 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants