1569 Fix gliner truncates text #1805

jedheaj314 · 2025-12-02T16:56:55Z

Change Description

Change to be able to to call recogniser with chunker and call text chunking (character based) when calling analyse in gliner recogniser.

Updated gliner recognizer to call predict function after chunking long text
New base and character based chunkers
New chunking util for deduplication and offset calculations
Unit tests

Issue reference

Fixes #1569

Checklist

I have reviewed the contribution guidelines
I have signed the CLA (if required)
My code includes unit tests
All unit tests and lint checks pass locally
My PR contains documentation updates / additions if required

Architectural/Technical Decisions:

🎯 Problem Statement

Issue #1569: GLiNER checkpoint has a maximum sequence length of 384 tokens. Documents exceeding this limit are silently truncated with a warning "UserWarning: Sentence of length 20415 has been truncated to 384".

Impact:

Security: PII entities beyond 384 tokens are not detected
Compliance: Incomplete data protection scanning (GDPR, HIPAA violations)
Reliability: Silent failure mode with no user notification

Root Cause: GLiNER truncating text to fixed context window of 384 tokens to detect PIIs.

🔍 Approach Evaluation

Option 1: Increase Model Context Window

Description: Retrain or use larger GLiNER model with higher token limit
Decision: ❌ Rejected - Not feasible, GLiNER architecture is fixed

Option 2: Token-Based Chunking

Description: Use GLiNER's tokenizer to chunk at exact token boundaries

Pros	Cons
Precise token count control	Requires tokenizer loading
Guarantees < 384 tokens	Model-specific implementation
Optimal model utilization	Complex offset calculation
	Tokenizer adds latency
	Ties implementation to model architecture

Decision: ❌ Rejected - Complexity outweighs benefits

Option 3: Character-Based Chunking ✅

Description: Split text by character count with configurable overlap

Pros	Cons
Simple implementation	Approximate token count
Model-agnostic	Needs safety margin
Fast (no tokenization)	May underutilize context window
Easy to configure	Requires deduplication
Works with any NER model

Decision: ✅ SELECTED - Best balance of simplicity and effectiveness

Rationale:

GLiNER averages ~1.5 chars/token
250 chars = ~166 tokens (safe margin under 384)
Simplicity enables maintainability
Works with future models without changes

Option 4: Sentence-Based Chunking

Description: Split on sentence boundaries using NLP

Pros	Cons
Natural language boundaries	Requires sentence tokenizer
Preserves context	Variable chunk sizes
Better semantic coherence	Long sentences problematic
	Adds dependency

Decision: ❌ Rejected - Unnecessary complexity for current needs

Future Consideration: Could implement as alternative strategy later

🎛️ Key Design Decisions

Decision 1: Chunk Size = 250 characters

Options Considered:

100 chars: Too conservative, many chunks, slow
250 chars: ✅ Selected - ~166 tokens, safe margin
500 chars: Too risky, may exceed 384 token limit

Analysis:

Token estimation: chars / 1.5 = tokens
250 / 1.5 = ~166 tokens (< 384 ✓)
Safety margin: 218 tokens = 57% buffer

Why 250?

Proven ratio from gliner-spacy reference implementation
Sufficient context for entity recognition
Safe margin for token variation (special chars, unicode)

Decision 2: Overlap = 50 characters (20%)

Problem: Entities at chunk boundaries might be split

Options Considered:

0% overlap: Fast but misses boundary entities ❌
10% (25 chars): ❌ Too small, may still split entities
20% (50 chars): ✅ Selected - catches boundary entities
50% (125 chars): ❌ Excessive duplication, slow

Why 20%?

Average entity length: 10-30 characters
50 char overlap covers typical entity spans
Balances coverage vs redundant processing
Standard practice in NLP chunking

Decision 3: Word Boundary Preservation

Problem: Mid-word breaks confuse NER models

# Extends chunk to nearest word boundary
while end < len(text) and text[end] not in [" ", "\n"]:
    end += 1

Options Considered:

Approach	Pros	Cons	Decision
Hard cutoff	Simple	Breaks words	❌
Word boundary	Preserves context	Variable chunk size	✅
Sentence boundary	Best context	Too complex	❌

Trade-off Accepted:

Chunks may exceed 250 chars slightly
Still well under token limit (tested: max ~280 chars)
Better entity detection outweighs strict size limit

Decision 4: Deduplication Strategy

Problem: Overlapping chunks produce duplicate entities

Approach: Score-based deduplication with overlap threshold

overlap_ratio = overlap_length / min(entity1_length, entity2_length)
if overlap_ratio > 0.5:  # 50% threshold
    keep_highest_score_entity()

Why 50% threshold?

Catches true duplicates: [10:20] and [10:20] → 100% overlap
Allows partial overlaps: [10:20] and [15:25] → 50% overlap
Avoids false positives: [10:20] and [30:40] → 0% overlap

Alternative Considered:

Exact match only (start==start, end==end)
- ❌ Misses duplicates with slight offset differences
- Models may return [10:20] in one chunk, [11:20] in another

Decision 5: Architecture Pattern = Strategy Pattern

Why Strategy Pattern?

BaseTextChunker (interface)
    ↓
LocalTextChunker (current implementation)
    ↓
Future: RemoteChunker, LangChainChunker, SemanticChunker

Benefits:

Open/Closed Principle: New strategies without modifying existing code
Testability: Easy to mock chunker in tests
Flexibility: Users can plug in custom chunkers
Future-proof: Supports evolution of chunking approaches

Trade-off:

More code upfront (abstract base class)
✅ Worth it: Enables extensibility without breaking changes

Decision 6: Utility Functions - Simple Over Generic

Approach: Utilities designed for character-based chunking with clear assumptions

predict_with_chunking(
    text=text,
    predict_func=any_prediction_function,  # Model-agnostic
    chunker=any_chunker_implementation      # Character-based expected
)

Key Assumptions:

Chunker has chunk_size (characters)
Chunker has chunk_overlap (characters)
Short-circuit check: len(text) <= chunker.chunk_size

Why Simple Design?

YAGNI Principle: Token-based chunking was rejected, no need to optimize for it
Clarity: Code is immediately understandable
Maintainability: Fewer abstractions = easier debugging
Strategy Pattern Already Provides Extensibility: New chunker types can be added

Alternative Considered: More Generic Interface

# Rejected approach
chunker.should_chunk(text)  # Instead of len() check
chunker.get_overlap_size()  # Dynamic overlap

Why Rejected:

❌ Premature abstraction for unused feature
❌ More complex interface
❌ Harder to understand
✅ Can refactor later if token-based is actually needed (unlikely)

Trade-off Accepted:

Future chunker types must work with character-based assumptions OR
Refactor utils if genuinely needed (straightforward change)

🏗️ Implementation Decisions

Offset Calculation

Critical Decision: How to map chunk entity positions to original text?

Approach Selected:

offset += len(chunk) - chunk_overlap

Why len(chunk) not chunk_size?

Word boundary extension creates variable-length chunks
Using actual length ensures accurate position mapping
Example: chunk_size=250, actual=273 (word boundary)

Validation:

Tested with word boundaries: ✅ Positions correct
Tested with CJK text (no spaces): ✅ Works
Tested with special characters: ✅ Accurate

Error Handling Philosophy

Decision: Trust upstream components (GLiNER), minimal defensive coding

Rationale from Code Review:

23 potential issues identified
22 were false alarms (defensive programming paranoia)
1 real bug (parameter redundancy - fixed)

Approach:

✅ Trust GLiNER to return valid predictions
✅ Trust Python to handle edge cases (empty strings, etc.)
❌ Avoid unnecessary validation code
❌ Don't add error handling "just in case"

Example - Rejected Defensive Code:

# Considered but rejected
if pred["end"] > len(text):
    logger.warning("Entity beyond text")
    continue

Why rejected? GLiNER never returns invalid positions. Adding checks adds complexity for zero benefit.

📊 Trade-offs Summary

Performance vs Accuracy

Decision: Prioritize accuracy over raw speed

Aspect	Choice	Rationale
Overlap	20% duplication	Catches boundary entities
Word boundaries	Variable chunk size	Better entity detection
Deduplication	O(n²) algorithm	Simple and correct

Performance Result:

1,000 entities: 12,824/sec (fast enough)
Typical doc: 10-200 entities, < 0.1s overhead
✅ No optimization needed yet

Simplicity vs Flexibility

Decision: Strategy pattern for future extensibility

Trade-off:

More code upfront (base class + concrete)
Benefit: Easy to add new chunking strategies
Verdict: ✅ Worth it - prevents future breaking changes

Character-based vs Token-based

Decision: Character-based for simplicity

Trade-off:

Less precise token control
Benefit: Model-agnostic, no tokenizer overhead
Mitigation: Large safety margin (57% buffer)
Verdict: ✅ Simplicity wins

🧪 Validation Approach

Testing Strategy

Test Coverage:

27 unit tests
Edge cases: CJK text, newlines, empty strings, zero-length entities
Real-world scenarios: Long documents, overlapping entities

Key Test Insights:

Word boundaries don't cause infinite loops (verified with CJK)
Offset calculation handles variable chunks (verified with entity positions)
Deduplication handles zero-length entities (edge case discovered)

🚀 Future Considerations

Immediate Monitoring Needs

Chunk count distribution: How many chunks per document?
Deduplication rate: How many duplicates removed?
Latency impact: Overhead from chunking?

Potential Enhancements

Enhancement 1: Parallel Chunk Processing

When: If latency becomes issue (>1s per document)
Approach: Process chunks concurrently
Expected gain: 2-3x speedup for large documents

Enhancement 2: Adaptive Chunk Size

When: If we see frequent boundary misses
Approach: Adjust chunk size based on entity density
Trade-off: Added complexity vs marginal gain

Enhancement 3: Alternative Chunking Strategies

When: User needs semantic chunking
Approach: Implement via Strategy pattern

SemanticChunker()  # Preserves paragraphs
SentenceChunker()  # Natural sentence boundaries  
LangChainChunker() # Integration with LangChain

Enhancement 4: Deduplication Optimization

When: Typical document has >1000 entities
Approach: Spatial indexing (O(n log n))
Current: Not needed - O(n²) is fast enough

…l to chunking from gliner recognizer

ShakutaiGit

This PR would be a Great addition to presidio capabilities !! and probably used in other use cases.
Left a few comments

ShakutaiGit · 2025-12-03T10:45:12Z

presidio-analyzer/presidio_analyzer/chunkers/character_based_text_chunker.py

+        :param chunk_overlap: Target characters to overlap between chunks
+            (must be >= 0 and < chunk_size)
+        """
+        if chunk_size <= 0:


Should we export the validation logic into validation function ?

This is a good point, but following YAGNI now I think No, I don't think we need to. The validation is only 2 lines and very straightforward. Extracting it would add complexity without real benefit since there's only one chunker type currently and no other code needs this validation. We can refactor later if we add more chunker implementations that share validation logic.

ShakutaiGit · 2025-12-03T10:58:38Z

presidio-analyzer/presidio_analyzer/chunkers/__init__.py

@@ -0,0 +1,19 @@
+"""Text chunking strategies for handling long texts."""


Looking at these classes, I'm thinking about applying the Factory pattern here.

The idea is that users would define the chunker type via a string in their YAML config (e.g., chunker_type: "character"), and the factory would instantiate the appropriate chunker for them.

This would align with how Presidio handles other configs

WDYT?

Good idea for future extensibility! However, I'd suggest keeping the current approach for now:

Only one chunker exists so a factory would be premature abstraction
Current API is flexible - Users can already pass any chunker via the parameter in GLiNERRecognizer
YAGNI principle - We can add the factory pattern when we actually have multiple chunker types that need runtime selection
I have ensured that the current design allows for furture implementation of new chunkers (e.g., sentence-based, semantic-based), we can introduce a factory then. The current design doesn't prevent that future addition.
Does that make sense?

ShakutaiGit · 2025-12-03T11:02:32Z

presidio-analyzer/presidio_analyzer/chunkers/character_based_text_chunker.py

+            )
+
+            # Extend to complete word boundary (space or newline)
+            while end < len(text) and text[end] not in [" ", "\n"]:


Should we export it to constant level parameter and think about more word boundary ?
Should we let the user option to enhance this list via config or something else ?

This is good suggestion, I can certainly consider doing this in another PR, would that be okay?
As the current simple implementation solves the immediate problem (GLiNER truncation). We can enhance boundary detection as a separate feature if there's actual demand.

ShakutaiGit · 2025-12-03T11:03:35Z

presidio-analyzer/presidio_analyzer/chunkers/character_based_text_chunker.py

@@ -0,0 +1,70 @@
+"""Character-based text chunker with word boundary preservation.


Could you add a bit logs in debug mode ?

ShakutaiGit · 2025-12-03T11:05:23Z

presidio-analyzer/presidio_analyzer/chunkers/character_based_text_chunker.py

+        if not text:
+            return []
+
+        chunks = []


what would happen when we using languages with no spaces , /n ? should we log a warning ?

Good catch, but No warning needed. The docstring already documents this: "For texts without spaces (e.g., CJK languages), chunks may extend to end of text." and warning will just add unnecessary noise.
and we also have a unit test suggesting the behaviour to devs using this.

Most real-world CJK text has punctuation or newlines for boundaries. For pure spaceless text, not splitting mid-character is the right choice to avoid corrupting Unicode. If CJK truncation becomes a real issue, we can add character-based fallback chunking or other ways of chunking approach as a future enhancements.
wdyt?

ShakutaiGit · 2025-12-03T11:08:19Z