-
Notifications
You must be signed in to change notification settings - Fork 866
1569 Fix gliner truncates text #1805
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
1569 Fix gliner truncates text #1805
Conversation
…l to chunking from gliner recognizer
ShakutaiGit
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR would be a Great addition to presidio capabilities !! and probably used in other use cases.
Left a few comments
| :param chunk_overlap: Target characters to overlap between chunks | ||
| (must be >= 0 and < chunk_size) | ||
| """ | ||
| if chunk_size <= 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we export the validation logic into validation function ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good point, but following YAGNI now I think No, I don't think we need to. The validation is only 2 lines and very straightforward. Extracting it would add complexity without real benefit since there's only one chunker type currently and no other code needs this validation. We can refactor later if we add more chunker implementations that share validation logic.
| @@ -0,0 +1,19 @@ | |||
| """Text chunking strategies for handling long texts.""" | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at these classes, I'm thinking about applying the Factory pattern here.
The idea is that users would define the chunker type via a string in their YAML config (e.g., chunker_type: "character"), and the factory would instantiate the appropriate chunker for them.
This would align with how Presidio handles other configs
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea for future extensibility! However, I'd suggest keeping the current approach for now:
Only one chunker exists so a factory would be premature abstraction
Current API is flexible - Users can already pass any chunker via the parameter in GLiNERRecognizer
YAGNI principle - We can add the factory pattern when we actually have multiple chunker types that need runtime selection
I have ensured that the current design allows for furture implementation of new chunkers (e.g., sentence-based, semantic-based), we can introduce a factory then. The current design doesn't prevent that future addition.
Does that make sense?
| ) | ||
|
|
||
| # Extend to complete word boundary (space or newline) | ||
| while end < len(text) and text[end] not in [" ", "\n"]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we export it to constant level parameter and think about more word boundary ?
Should we let the user option to enhance this list via config or something else ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is good suggestion, I can certainly consider doing this in another PR, would that be okay?
As the current simple implementation solves the immediate problem (GLiNER truncation). We can enhance boundary detection as a separate feature if there's actual demand.
| @@ -0,0 +1,70 @@ | |||
| """Character-based text chunker with word boundary preservation. | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a bit logs in debug mode ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
| if not text: | ||
| return [] | ||
|
|
||
| chunks = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what would happen when we using languages with no spaces , /n ? should we log a warning ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, but No warning needed. The docstring already documents this: "For texts without spaces (e.g., CJK languages), chunks may extend to end of text." and warning will just add unnecessary noise.
and we also have a unit test suggesting the behaviour to devs using this.
Most real-world CJK text has punctuation or newlines for boundaries. For pure spaceless text, not splitting mid-character is the right choice to avoid corrupting Unicode. If CJK truncation becomes a real issue, we can add character-based fallback chunking or other ways of chunking approach as a future enhancements.
wdyt?
| @@ -0,0 +1,105 @@ | |||
| """Utility functions for processing text with chunking strategies.""" | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did we consider either (1) reusing an existing splitter (LangChain / NLTK / spaCy / HF tokenizers) or (2) at least aligning our implementation with their separator hierarchy pattern (paragraph → line → word → char)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! We considered this but chose a simple custom implementation for several reasons
Please check the commit message for the justification/info around approaches considered
| pred["end"] += offset | ||
|
|
||
| all_predictions.extend(chunk_predictions) | ||
| offset += len(chunk) - chunk_overlap |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The offset calculation assumes fixed overlap, but CharacterBasedTextChunker extends to word boundaries. Could this cause entity position errors?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Verified this works correctly. The offset tracks where the next chunk starts, not where it ends. When chunks extend to word boundaries, the chunker's [start = end - chunk_overlap]
This is verified by test_long_text_with_offset_adjustment in test_chunking_utils.py which passes.
|
|
||
| # Adjust offsets to match original text position | ||
| for pred in chunk_predictions: | ||
| pred["start"] += offset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we validate that predictions have required keys? or catch exception if one chunk fail? and log warning ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd say not adding validation or error handling here would be the right way
Type hints define the contract - callers must provide the correct format
Fail fast is better - if predictions are malformed, a KeyError immediately shows where the bug is
Consistent with existing code
Performance - this runs for every prediction, so validation adds unnecessary overhead
If a chunk fails, it's a recognizer bug that needs fixing, not something to silently skip.
| sorted_preds = sorted(predictions, key=lambda p: p["score"], reverse=True) | ||
| unique = [] | ||
|
|
||
| for pred in sorted_preds: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For n predictions, this is O(n²), could we optimize it using any library or more sophisticated approach ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a very good observation! have discussed this with Sharon H
I have mentioned this and justification in the commit message as well
TLDR is
I'd suggest keeping the current simple implementation for now since:
It's readable and maintainable
Performance is acceptable for typical entity counts
Adding a dependency just for this would increase complexity
wdyt?
| :param text: Input text to process | ||
| :param chunker: Text chunking strategy | ||
| :param process_func: Function that takes chunk text and returns predictions | ||
| :param chunk_overlap: Number of characters overlapping between chunks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why pass chunk_overlapas a separate parameter when it's already available on the chunker? Could this lead to inconsistencies?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right! Removed.
Change Description
Change to be able to to call recogniser with chunker and call text chunking (character based) when calling analyse in gliner recogniser.
Issue reference
Fixes #1569
Checklist
Architectural/Technical Decisions:
🎯 Problem Statement
Issue #1569: GLiNER checkpoint has a maximum sequence length of 384 tokens. Documents exceeding this limit are silently truncated with a warning "UserWarning: Sentence of length 20415 has been truncated to 384".
Impact:
Root Cause: GLiNER truncating text to fixed context window of 384 tokens to detect PIIs.
🔍 Approach Evaluation
Option 1: Increase Model Context Window
Description: Retrain or use larger GLiNER model with higher token limit
Decision: ❌ Rejected - Not feasible, GLiNER architecture is fixed
Option 2: Token-Based Chunking
Description: Use GLiNER's tokenizer to chunk at exact token boundaries
Decision: ❌ Rejected - Complexity outweighs benefits
Option 3: Character-Based Chunking ✅
Description: Split text by character count with configurable overlap
Decision: ✅ SELECTED - Best balance of simplicity and effectiveness
Rationale:
Option 4: Sentence-Based Chunking
Description: Split on sentence boundaries using NLP
Decision: ❌ Rejected - Unnecessary complexity for current needs
Future Consideration: Could implement as alternative strategy later
🎛️ Key Design Decisions
Decision 1: Chunk Size = 250 characters
Options Considered:
Analysis:
Why 250?
Decision 2: Overlap = 50 characters (20%)
Problem: Entities at chunk boundaries might be split
Options Considered:
Why 20%?
Decision 3: Word Boundary Preservation
Problem: Mid-word breaks confuse NER models
Options Considered:
Trade-off Accepted:
Decision 4: Deduplication Strategy
Problem: Overlapping chunks produce duplicate entities
Approach: Score-based deduplication with overlap threshold
Why 50% threshold?
Alternative Considered:
Decision 5: Architecture Pattern = Strategy Pattern
Why Strategy Pattern?
Benefits:
Trade-off:
Decision 6: Utility Functions - Simple Over Generic
Approach: Utilities designed for character-based chunking with clear assumptions
Key Assumptions:
chunk_size(characters)chunk_overlap(characters)len(text) <= chunker.chunk_sizeWhy Simple Design?
Alternative Considered: More Generic Interface
Why Rejected:
Trade-off Accepted:
🏗️ Implementation Decisions
Offset Calculation
Critical Decision: How to map chunk entity positions to original text?
Approach Selected:
Why
len(chunk)notchunk_size?Validation:
Error Handling Philosophy
Decision: Trust upstream components (GLiNER), minimal defensive coding
Rationale from Code Review:
Approach:
Example - Rejected Defensive Code:
Why rejected? GLiNER never returns invalid positions. Adding checks adds complexity for zero benefit.
📊 Trade-offs Summary
Performance vs Accuracy
Decision: Prioritize accuracy over raw speed
Performance Result:
Simplicity vs Flexibility
Decision: Strategy pattern for future extensibility
Trade-off:
Character-based vs Token-based
Decision: Character-based for simplicity
Trade-off:
🧪 Validation Approach
Testing Strategy
Test Coverage:
Key Test Insights:
🚀 Future Considerations
Immediate Monitoring Needs
Potential Enhancements
Enhancement 1: Parallel Chunk Processing
When: If latency becomes issue (>1s per document)
Approach: Process chunks concurrently
Expected gain: 2-3x speedup for large documents
Enhancement 2: Adaptive Chunk Size
When: If we see frequent boundary misses
Approach: Adjust chunk size based on entity density
Trade-off: Added complexity vs marginal gain
Enhancement 3: Alternative Chunking Strategies
When: User needs semantic chunking
Approach: Implement via Strategy pattern
Enhancement 4: Deduplication Optimization
When: Typical document has >1000 entities
Approach: Spatial indexing (O(n log n))
Current: Not needed - O(n²) is fast enough