-
Notifications
You must be signed in to change notification settings - Fork 11
Add support for reference highlighting #127
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds support for reference highlighting when using STATIC_PARSE with pdfplumber. The implementation enables locating bounding boxes for text substrings within PDF documents, supporting both exact and fuzzy matching.
- Adds a new utility function
find_bboxes_for_substringwith fuzzy matching capabilities - Modifies PDF parsing to track word-level bounding boxes alongside content extraction
- Updates return types to include bounding box metadata for downstream highlighting
Reviewed Changes
Copilot reviewed 2 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| lexoid/core/utils.py | Adds text processing utilities and the main find_bboxes_for_substring function with fuzzy matching support |
| lexoid/core/parse_type/static_parser.py | Updates PDF parsing to collect and return word bounding boxes alongside extracted content |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
pramitchoudhary
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor suggestions otherwise LGTM
This PR adds support for reference highlighting when using
STATIC_PARSEwithpdfplumber. A function (find_bboxes_for_substring) is also provided to find the correct bounding boxes corresponding to a given substring.A new example notebook (
example_notebook_reference_highlight.ipynb) has been added to demonstrate how it can be used.There are 2 optional params for
find_bboxes_for_substring:fuzzy (default: False): If True, finds the best approximate match (min word-level edit distance)all_matches (default: False): If True, return bounding boxes for all occurrences of the substring