-
Notifications
You must be signed in to change notification settings - Fork 868
1569 Fix gliner truncates text #1805
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
6c82ee7
b04d9c7
e0eb745
71fb611
c986737
ea49b70
83e2bd4
5553245
0d53ce1
c1ae52f
560021c
1556d73
9324450
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,19 @@ | ||
| """Text chunking strategies for handling long texts.""" | ||
|
|
||
| from presidio_analyzer.chunkers.base_chunker import BaseTextChunker | ||
| from presidio_analyzer.chunkers.character_based_text_chunker import ( | ||
| CharacterBasedTextChunker, | ||
| ) | ||
| from presidio_analyzer.chunkers.chunking_utils import ( | ||
| deduplicate_overlapping_entities, | ||
| predict_with_chunking, | ||
| process_text_in_chunks, | ||
| ) | ||
|
|
||
| __all__ = [ | ||
| "BaseTextChunker", | ||
| "CharacterBasedTextChunker", | ||
| "predict_with_chunking", | ||
| "process_text_in_chunks", | ||
| "deduplicate_overlapping_entities", | ||
| ] | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,16 @@ | ||
| """Abstract base class for text chunking strategies.""" | ||
| from abc import ABC, abstractmethod | ||
| from typing import List | ||
|
|
||
|
|
||
| class BaseTextChunker(ABC): | ||
| """Abstract base class for text chunking strategies.""" | ||
|
|
||
| @abstractmethod | ||
| def chunk(self, text: str) -> List[str]: | ||
| """Split text into chunks. | ||
|
|
||
| :param text: The input text to split | ||
| :return: List of text chunks | ||
| """ | ||
| pass |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,86 @@ | ||
| """Character-based text chunker with word boundary preservation. | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could you add a bit logs in debug mode ?
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done! |
||
|
|
||
| Based on gliner-spacy implementation: | ||
| https://github.com/theirstory/gliner-spacy/blob/main/gliner_spacy/pipeline.py#L60-L96 | ||
| """ | ||
| import logging | ||
| from typing import List | ||
|
|
||
| from presidio_analyzer.chunkers.base_chunker import BaseTextChunker | ||
|
|
||
| logger = logging.getLogger("presidio-analyzer") | ||
|
|
||
|
|
||
| class CharacterBasedTextChunker(BaseTextChunker): | ||
| """Character-based text chunker with word boundary preservation.""" | ||
|
|
||
| def __init__(self, chunk_size: int, chunk_overlap: int = 0): | ||
| """Initialize the character-based text chunker. | ||
|
|
||
| Note: Chunks may slightly exceed chunk_size to preserve complete words. | ||
| When this occurs, the actual overlap may vary from the specified value. | ||
|
|
||
| :param chunk_size: Target maximum characters per chunk (must be > 0) | ||
| :param chunk_overlap: Target characters to overlap between chunks | ||
| (must be >= 0 and < chunk_size) | ||
| """ | ||
| if chunk_size <= 0: | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should we export the validation logic into validation function ?
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is a good point, but following YAGNI now I think No, I don't think we need to. The validation is only 2 lines and very straightforward. Extracting it would add complexity without real benefit since there's only one chunker type currently and no other code needs this validation. We can refactor later if we add more chunker implementations that share validation logic. |
||
| logger.error("Invalid chunk_size: %d. Must be greater than 0.", chunk_size) | ||
| raise ValueError("chunk_size must be greater than 0") | ||
| if chunk_overlap < 0 or chunk_overlap >= chunk_size: | ||
| logger.error( | ||
| "Invalid chunk_overlap. Must be non-negative and less than chunk_size" | ||
| ) | ||
| raise ValueError( | ||
| "chunk_overlap must be non-negative and less than chunk_size" | ||
| ) | ||
|
|
||
| self.chunk_size = chunk_size | ||
| self.chunk_overlap = chunk_overlap | ||
|
|
||
| def chunk(self, text: str) -> List[str]: | ||
| """Split text into overlapping chunks at word boundaries. | ||
|
|
||
| Chunks are extended to the nearest word boundary (space or newline) | ||
| to avoid splitting words. This means chunks may slightly exceed | ||
| chunk_size. For texts without spaces (e.g., CJK languages), chunks | ||
| may extend to end of text. | ||
|
|
||
| :param text: The input text to chunk | ||
| :return: List of text chunks with overlap | ||
| """ | ||
| if not text: | ||
| logger.debug("Empty text provided, returning empty chunk list") | ||
| return [] | ||
|
|
||
| logger.debug( | ||
| "Chunking text: length=%d, chunk_size=%d, overlap=%d", | ||
| len(text), | ||
| self.chunk_size, | ||
| self.chunk_overlap, | ||
| ) | ||
|
|
||
| chunks = [] | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. what would happen when we using languages with no spaces , /n ? should we log a warning ?
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good catch, but No warning needed. The docstring already documents this: "For texts without spaces (e.g., CJK languages), chunks may extend to end of text." and warning will just add unnecessary noise. Most real-world CJK text has punctuation or newlines for boundaries. For pure spaceless text, not splitting mid-character is the right choice to avoid corrupting Unicode. If CJK truncation becomes a real issue, we can add character-based fallback chunking or other ways of chunking approach as a future enhancements. |
||
| start = 0 | ||
|
|
||
| while start < len(text): | ||
| # Calculate end position | ||
| end = ( | ||
| start + self.chunk_size | ||
| if start + self.chunk_size < len(text) | ||
| else len(text) | ||
| ) | ||
|
|
||
| # Extend to complete word boundary (space or newline) | ||
| while end < len(text) and text[end] not in [" ", "\n"]: | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should we export it to constant level parameter and think about more word boundary ?
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is good suggestion, I can certainly consider doing this in another PR, would that be okay? |
||
| end += 1 | ||
|
|
||
| chunks.append(text[start:end]) | ||
|
|
||
| # Move start position with overlap (stop if we've covered all text) | ||
| if end >= len(text): | ||
| break | ||
| start = end - self.chunk_overlap | ||
|
|
||
| logger.debug("Created %d chunks from text", len(chunks)) | ||
| return chunks | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,102 @@ | ||
| """Utility functions for processing text with chunking strategies.""" | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Did we consider either (1) reusing an existing splitter (LangChain / NLTK / spaCy / HF tokenizers) or (2) at least aligning our implementation with their separator hierarchy pattern (paragraph β line β word β char)?
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good point! We considered this but chose a simple custom implementation for several reasons |
||
| from typing import Any, Callable, Dict, List | ||
|
|
||
| from presidio_analyzer.chunkers.base_chunker import BaseTextChunker | ||
|
|
||
|
|
||
| def predict_with_chunking( | ||
| text: str, | ||
| predict_func: Callable[[str], List[Dict[str, Any]]], | ||
| chunker: BaseTextChunker, | ||
| ) -> List[Dict[str, Any]]: | ||
| """Process text with automatic chunking for long texts. | ||
|
|
||
| For short text (β€ chunker.chunk_size), calls predict_func directly. | ||
| For long text, chunks it and merges predictions with deduplication. | ||
|
|
||
| :param text: Input text to process | ||
| :param predict_func: Function that takes text and returns predictions | ||
| :param chunker: Text chunking strategy (contains chunk_size and chunk_overlap) | ||
| :return: List of predictions with correct offsets | ||
| """ | ||
| if len(text) <= chunker.chunk_size: | ||
| return predict_func(text) | ||
|
|
||
| predictions = process_text_in_chunks( | ||
| text=text, | ||
| chunker=chunker, | ||
| process_func=predict_func, | ||
| ) | ||
| return deduplicate_overlapping_entities(predictions) | ||
|
|
||
| def process_text_in_chunks( | ||
| text: str, | ||
| chunker: BaseTextChunker, | ||
| process_func: Callable[[str], List[Dict[str, Any]]], | ||
| ) -> List[Dict[str, Any]]: | ||
| """Process text in chunks and adjust entity offsets. | ||
|
|
||
| :param text: Input text to process | ||
| :param chunker: Text chunking strategy | ||
| :param process_func: Function that takes chunk text and returns predictions | ||
| :return: List of predictions with adjusted offsets | ||
| """ | ||
| chunks = chunker.chunk(text) | ||
| all_predictions = [] | ||
| offset = 0 | ||
|
|
||
| for chunk in chunks: | ||
| chunk_predictions = process_func(chunk) | ||
|
|
||
| # Adjust offsets to match original text position | ||
| for pred in chunk_predictions: | ||
| pred["start"] += offset | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should we validate that predictions have required keys? or catch exception if one chunk fail? and log warning ?
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'd say not adding validation or error handling here would be the right way Type hints define the contract - callers must provide the correct format |
||
| pred["end"] += offset | ||
|
|
||
| all_predictions.extend(chunk_predictions) | ||
| offset += len(chunk) - chunker.chunk_overlap | ||
|
|
||
| return all_predictions | ||
|
|
||
| def deduplicate_overlapping_entities( | ||
| predictions: List[Dict[str, Any]], overlap_threshold: float = 0.5 | ||
| ) -> List[Dict[str, Any]]: | ||
| """Remove duplicate entities from overlapping chunks. | ||
|
|
||
| :param predictions: List of predictions with 'start', 'end', 'label', | ||
| 'score' | ||
| :param overlap_threshold: Overlap ratio threshold to consider duplicates | ||
| (default: 0.5) | ||
| :return: Deduplicated list of predictions sorted by position | ||
| """ | ||
| if not predictions: | ||
| return predictions | ||
|
|
||
| # Sort by score descending to keep highest scoring entities | ||
| sorted_preds = sorted(predictions, key=lambda p: p["score"], reverse=True) | ||
| unique = [] | ||
|
|
||
| for pred in sorted_preds: | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For n predictions, this is O(nΒ²), could we optimize it using any library or more sophisticated approach ?
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is a very good observation! have discussed this with Sharon H TLDR is |
||
| is_duplicate = False | ||
| for kept in unique: | ||
| # Check if same entity type and overlapping positions | ||
| if pred["label"] == kept["label"]: | ||
| overlap_start = max(pred["start"], kept["start"]) | ||
| overlap_end = min(pred["end"], kept["end"]) | ||
|
|
||
| if overlap_start < overlap_end: | ||
| # Calculate overlap ratio | ||
| overlap_len = overlap_end - overlap_start | ||
| pred_len = pred["end"] - pred["start"] | ||
| kept_len = kept["end"] - kept["start"] | ||
|
|
||
| # Check if overlap exceeds threshold | ||
| if overlap_len / min(pred_len, kept_len) > overlap_threshold: | ||
| is_duplicate = True | ||
| break | ||
|
|
||
| if not is_duplicate: | ||
| unique.append(pred) | ||
|
|
||
| # Sort by position for consistent output | ||
| return sorted(unique, key=lambda p: p["start"]) | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at these classes, I'm thinking about applying the Factory pattern here.
The idea is that users would define the chunker type via a string in their YAML config (e.g., chunker_type: "character"), and the factory would instantiate the appropriate chunker for them.
This would align with how Presidio handles other configs
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea for future extensibility! However, I'd suggest keeping the current approach for now:
Only one chunker exists so a factory would be premature abstraction
Current API is flexible - Users can already pass any chunker via the parameter in GLiNERRecognizer
YAGNI principle - We can add the factory pattern when we actually have multiple chunker types that need runtime selection
I have ensured that the current design allows for furture implementation of new chunkers (e.g., sentence-based, semantic-based), we can introduce a factory then. The current design doesn't prevent that future addition.
Does that make sense?