Skip to content

Conversation

@dilithjay
Copy link
Contributor

This PR adds support for reference highlighting when using STATIC_PARSE with pdfplumber. A function (find_bboxes_for_substring) is also provided to find the correct bounding boxes corresponding to a given substring.

A new example notebook (example_notebook_reference_highlight.ipynb) has been added to demonstrate how it can be used.

There are 2 optional params for find_bboxes_for_substring:

  • fuzzy (default: False): If True, finds the best approximate match (min word-level edit distance)
  • all_matches (default: False): If True, return bounding boxes for all occurrences of the substring

@dilithjay dilithjay self-assigned this Sep 8, 2025
@dilithjay dilithjay added the enhancement New feature or request label Sep 8, 2025
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds support for reference highlighting when using STATIC_PARSE with pdfplumber. The implementation enables locating bounding boxes for text substrings within PDF documents, supporting both exact and fuzzy matching.

  • Adds a new utility function find_bboxes_for_substring with fuzzy matching capabilities
  • Modifies PDF parsing to track word-level bounding boxes alongside content extraction
  • Updates return types to include bounding box metadata for downstream highlighting

Reviewed Changes

Copilot reviewed 2 out of 3 changed files in this pull request and generated 3 comments.

File Description
lexoid/core/utils.py Adds text processing utilities and the main find_bboxes_for_substring function with fuzzy matching support
lexoid/core/parse_type/static_parser.py Updates PDF parsing to collect and return word bounding boxes alongside extracted content

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Copy link
Contributor

@pramitchoudhary pramitchoudhary left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor suggestions otherwise LGTM

@dilithjay dilithjay merged commit d7d357f into main Sep 9, 2025
@dilithjay dilithjay deleted the dj/ref branch September 9, 2025 15:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants