Skip to content

Conversation

@jedheaj314
Copy link

Change Description

Change to be able to to call recogniser with chunker and call text chunking (character based) when calling analyse in gliner recogniser.

  • Updated gliner recognizer to call predict function after chunking long text
  • New base and character based chunkers
  • New chunking util for deduplication and offset calculations
  • Unit tests

Issue reference

Fixes #1569

Checklist

  • I have reviewed the contribution guidelines
  • I have signed the CLA (if required)
  • My code includes unit tests
  • All unit tests and lint checks pass locally
  • My PR contains documentation updates / additions if required

Architectural/Technical Decisions:

🎯 Problem Statement

Issue #1569: GLiNER checkpoint has a maximum sequence length of 384 tokens. Documents exceeding this limit are silently truncated with a warning "UserWarning: Sentence of length 20415 has been truncated to 384".

Impact:

  • Security: PII entities beyond 384 tokens are not detected
  • Compliance: Incomplete data protection scanning (GDPR, HIPAA violations)
  • Reliability: Silent failure mode with no user notification

Root Cause: GLiNER truncating text to fixed context window of 384 tokens to detect PIIs.


🔍 Approach Evaluation

Option 1: Increase Model Context Window

Description: Retrain or use larger GLiNER model with higher token limit
Decision:Rejected - Not feasible, GLiNER architecture is fixed


Option 2: Token-Based Chunking

Description: Use GLiNER's tokenizer to chunk at exact token boundaries

Pros Cons
Precise token count control Requires tokenizer loading
Guarantees < 384 tokens Model-specific implementation
Optimal model utilization Complex offset calculation
Tokenizer adds latency
Ties implementation to model architecture

Decision:Rejected - Complexity outweighs benefits


Option 3: Character-Based Chunking ✅

Description: Split text by character count with configurable overlap

Pros Cons
Simple implementation Approximate token count
Model-agnostic Needs safety margin
Fast (no tokenization) May underutilize context window
Easy to configure Requires deduplication
Works with any NER model

Decision:SELECTED - Best balance of simplicity and effectiveness

Rationale:

  • GLiNER averages ~1.5 chars/token
  • 250 chars = ~166 tokens (safe margin under 384)
  • Simplicity enables maintainability
  • Works with future models without changes

Option 4: Sentence-Based Chunking

Description: Split on sentence boundaries using NLP

Pros Cons
Natural language boundaries Requires sentence tokenizer
Preserves context Variable chunk sizes
Better semantic coherence Long sentences problematic
Adds dependency

Decision:Rejected - Unnecessary complexity for current needs

Future Consideration: Could implement as alternative strategy later


🎛️ Key Design Decisions

Decision 1: Chunk Size = 250 characters

Options Considered:

  • 100 chars: Too conservative, many chunks, slow
  • 250 chars: ✅ Selected - ~166 tokens, safe margin
  • 500 chars: Too risky, may exceed 384 token limit

Analysis:

Token estimation: chars / 1.5 = tokens
250 / 1.5 = ~166 tokens (< 384 ✓)
Safety margin: 218 tokens = 57% buffer

Why 250?

  • Proven ratio from gliner-spacy reference implementation
  • Sufficient context for entity recognition
  • Safe margin for token variation (special chars, unicode)

Decision 2: Overlap = 50 characters (20%)

Problem: Entities at chunk boundaries might be split

Options Considered:

  • 0% overlap: Fast but misses boundary entities ❌
  • 10% (25 chars): ❌ Too small, may still split entities
  • 20% (50 chars): ✅ Selected - catches boundary entities
  • 50% (125 chars): ❌ Excessive duplication, slow

Why 20%?

  • Average entity length: 10-30 characters
  • 50 char overlap covers typical entity spans
  • Balances coverage vs redundant processing
  • Standard practice in NLP chunking

Decision 3: Word Boundary Preservation

Problem: Mid-word breaks confuse NER models

# Extends chunk to nearest word boundary
while end < len(text) and text[end] not in [" ", "\n"]:
    end += 1

Options Considered:

Approach Pros Cons Decision
Hard cutoff Simple Breaks words
Word boundary Preserves context Variable chunk size
Sentence boundary Best context Too complex

Trade-off Accepted:

  • Chunks may exceed 250 chars slightly
  • Still well under token limit (tested: max ~280 chars)
  • Better entity detection outweighs strict size limit

Decision 4: Deduplication Strategy

Problem: Overlapping chunks produce duplicate entities

Approach: Score-based deduplication with overlap threshold

overlap_ratio = overlap_length / min(entity1_length, entity2_length)
if overlap_ratio > 0.5:  # 50% threshold
    keep_highest_score_entity()

Why 50% threshold?

  • Catches true duplicates: [10:20] and [10:20] → 100% overlap
  • Allows partial overlaps: [10:20] and [15:25] → 50% overlap
  • Avoids false positives: [10:20] and [30:40] → 0% overlap

Alternative Considered:

  • Exact match only (start==start, end==end)
    • ❌ Misses duplicates with slight offset differences
    • Models may return [10:20] in one chunk, [11:20] in another

Decision 5: Architecture Pattern = Strategy Pattern

Why Strategy Pattern?

BaseTextChunker (interface)
    ↓
LocalTextChunker (current implementation)
    ↓
Future: RemoteChunker, LangChainChunker, SemanticChunker

Benefits:

  • Open/Closed Principle: New strategies without modifying existing code
  • Testability: Easy to mock chunker in tests
  • Flexibility: Users can plug in custom chunkers
  • Future-proof: Supports evolution of chunking approaches

Trade-off:

  • More code upfront (abstract base class)
  • ✅ Worth it: Enables extensibility without breaking changes

Decision 6: Utility Functions - Simple Over Generic

Approach: Utilities designed for character-based chunking with clear assumptions

predict_with_chunking(
    text=text,
    predict_func=any_prediction_function,  # Model-agnostic
    chunker=any_chunker_implementation      # Character-based expected
)

Key Assumptions:

  • Chunker has chunk_size (characters)
  • Chunker has chunk_overlap (characters)
  • Short-circuit check: len(text) <= chunker.chunk_size

Why Simple Design?

  • YAGNI Principle: Token-based chunking was rejected, no need to optimize for it
  • Clarity: Code is immediately understandable
  • Maintainability: Fewer abstractions = easier debugging
  • Strategy Pattern Already Provides Extensibility: New chunker types can be added

Alternative Considered: More Generic Interface

# Rejected approach
chunker.should_chunk(text)  # Instead of len() check
chunker.get_overlap_size()  # Dynamic overlap

Why Rejected:

  • ❌ Premature abstraction for unused feature
  • ❌ More complex interface
  • ❌ Harder to understand
  • ✅ Can refactor later if token-based is actually needed (unlikely)

Trade-off Accepted:

  • Future chunker types must work with character-based assumptions OR
  • Refactor utils if genuinely needed (straightforward change)

🏗️ Implementation Decisions

Offset Calculation

Critical Decision: How to map chunk entity positions to original text?

Approach Selected:

offset += len(chunk) - chunk_overlap

Why len(chunk) not chunk_size?

  • Word boundary extension creates variable-length chunks
  • Using actual length ensures accurate position mapping
  • Example: chunk_size=250, actual=273 (word boundary)

Validation:

  • Tested with word boundaries: ✅ Positions correct
  • Tested with CJK text (no spaces): ✅ Works
  • Tested with special characters: ✅ Accurate

Error Handling Philosophy

Decision: Trust upstream components (GLiNER), minimal defensive coding

Rationale from Code Review:

  • 23 potential issues identified
  • 22 were false alarms (defensive programming paranoia)
  • 1 real bug (parameter redundancy - fixed)

Approach:

  • ✅ Trust GLiNER to return valid predictions
  • ✅ Trust Python to handle edge cases (empty strings, etc.)
  • ❌ Avoid unnecessary validation code
  • ❌ Don't add error handling "just in case"

Example - Rejected Defensive Code:

# Considered but rejected
if pred["end"] > len(text):
    logger.warning("Entity beyond text")
    continue

Why rejected? GLiNER never returns invalid positions. Adding checks adds complexity for zero benefit.


📊 Trade-offs Summary

Performance vs Accuracy

Decision: Prioritize accuracy over raw speed

Aspect Choice Rationale
Overlap 20% duplication Catches boundary entities
Word boundaries Variable chunk size Better entity detection
Deduplication O(n²) algorithm Simple and correct

Performance Result:

  • 1,000 entities: 12,824/sec (fast enough)
  • Typical doc: 10-200 entities, < 0.1s overhead
  • ✅ No optimization needed yet

Simplicity vs Flexibility

Decision: Strategy pattern for future extensibility

Trade-off:

  • More code upfront (base class + concrete)
  • Benefit: Easy to add new chunking strategies
  • Verdict: ✅ Worth it - prevents future breaking changes

Character-based vs Token-based

Decision: Character-based for simplicity

Trade-off:

  • Less precise token control
  • Benefit: Model-agnostic, no tokenizer overhead
  • Mitigation: Large safety margin (57% buffer)
  • Verdict: ✅ Simplicity wins

🧪 Validation Approach

Testing Strategy

Test Coverage:

  • 27 unit tests
  • Edge cases: CJK text, newlines, empty strings, zero-length entities
  • Real-world scenarios: Long documents, overlapping entities

Key Test Insights:

  1. Word boundaries don't cause infinite loops (verified with CJK)
  2. Offset calculation handles variable chunks (verified with entity positions)
  3. Deduplication handles zero-length entities (edge case discovered)

🚀 Future Considerations

Immediate Monitoring Needs

  1. Chunk count distribution: How many chunks per document?
  2. Deduplication rate: How many duplicates removed?
  3. Latency impact: Overhead from chunking?

Potential Enhancements

Enhancement 1: Parallel Chunk Processing

When: If latency becomes issue (>1s per document)
Approach: Process chunks concurrently
Expected gain: 2-3x speedup for large documents

Enhancement 2: Adaptive Chunk Size

When: If we see frequent boundary misses
Approach: Adjust chunk size based on entity density
Trade-off: Added complexity vs marginal gain

Enhancement 3: Alternative Chunking Strategies

When: User needs semantic chunking
Approach: Implement via Strategy pattern

SemanticChunker()  # Preserves paragraphs
SentenceChunker()  # Natural sentence boundaries  
LangChainChunker() # Integration with LangChain

Enhancement 4: Deduplication Optimization

When: Typical document has >1000 entities
Approach: Spatial indexing (O(n log n))
Current: Not needed - O(n²) is fast enough


@ShakutaiGit ShakutaiGit self-requested a review December 3, 2025 10:29
Copy link
Collaborator

@ShakutaiGit ShakutaiGit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR would be a Great addition to presidio capabilities !! and probably used in other use cases.
Left a few comments

:param chunk_overlap: Target characters to overlap between chunks
(must be >= 0 and < chunk_size)
"""
if chunk_size <= 0:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we export the validation logic into validation function ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point, but following YAGNI now I think No, I don't think we need to. The validation is only 2 lines and very straightforward. Extracting it would add complexity without real benefit since there's only one chunker type currently and no other code needs this validation. We can refactor later if we add more chunker implementations that share validation logic.

@@ -0,0 +1,19 @@
"""Text chunking strategies for handling long texts."""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at these classes, I'm thinking about applying the Factory pattern here.

The idea is that users would define the chunker type via a string in their YAML config (e.g., chunker_type: "character"), and the factory would instantiate the appropriate chunker for them.

This would align with how Presidio handles other configs

WDYT?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea for future extensibility! However, I'd suggest keeping the current approach for now:

Only one chunker exists so a factory would be premature abstraction
Current API is flexible - Users can already pass any chunker via the parameter in GLiNERRecognizer
YAGNI principle - We can add the factory pattern when we actually have multiple chunker types that need runtime selection
I have ensured that the current design allows for furture implementation of new chunkers (e.g., sentence-based, semantic-based), we can introduce a factory then. The current design doesn't prevent that future addition.
Does that make sense?

)

# Extend to complete word boundary (space or newline)
while end < len(text) and text[end] not in [" ", "\n"]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we export it to constant level parameter and think about more word boundary ?
Should we let the user option to enhance this list via config or something else ?

Copy link
Author

@jedheaj314 jedheaj314 Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good suggestion, I can certainly consider doing this in another PR, would that be okay?
As the current simple implementation solves the immediate problem (GLiNER truncation). We can enhance boundary detection as a separate feature if there's actual demand.

@@ -0,0 +1,70 @@
"""Character-based text chunker with word boundary preservation.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a bit logs in debug mode ?

Copy link
Author

@jedheaj314 jedheaj314 Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

if not text:
return []

chunks = []
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what would happen when we using languages with no spaces , /n ? should we log a warning ?

Copy link
Author

@jedheaj314 jedheaj314 Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, but No warning needed. The docstring already documents this: "For texts without spaces (e.g., CJK languages), chunks may extend to end of text." and warning will just add unnecessary noise.
and we also have a unit test suggesting the behaviour to devs using this.

Most real-world CJK text has punctuation or newlines for boundaries. For pure spaceless text, not splitting mid-character is the right choice to avoid corrupting Unicode. If CJK truncation becomes a real issue, we can add character-based fallback chunking or other ways of chunking approach as a future enhancements.
wdyt?

@@ -0,0 +1,105 @@
"""Utility functions for processing text with chunking strategies."""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did we consider either (1) reusing an existing splitter (LangChain / NLTK / spaCy / HF tokenizers) or (2) at least aligning our implementation with their separator hierarchy pattern (paragraph → line → word → char)?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! We considered this but chose a simple custom implementation for several reasons
Please check the commit message for the justification/info around approaches considered

pred["end"] += offset

all_predictions.extend(chunk_predictions)
offset += len(chunk) - chunk_overlap
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The offset calculation assumes fixed overlap, but CharacterBasedTextChunker extends to word boundaries. Could this cause entity position errors?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verified this works correctly. The offset tracks where the next chunk starts, not where it ends. When chunks extend to word boundaries, the chunker's [start = end - chunk_overlap]
This is verified by test_long_text_with_offset_adjustment in test_chunking_utils.py which passes.


# Adjust offsets to match original text position
for pred in chunk_predictions:
pred["start"] += offset
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we validate that predictions have required keys? or catch exception if one chunk fail? and log warning ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say not adding validation or error handling here would be the right way

Type hints define the contract - callers must provide the correct format
Fail fast is better - if predictions are malformed, a KeyError immediately shows where the bug is
Consistent with existing code
Performance - this runs for every prediction, so validation adds unnecessary overhead
If a chunk fails, it's a recognizer bug that needs fixing, not something to silently skip.

sorted_preds = sorted(predictions, key=lambda p: p["score"], reverse=True)
unique = []

for pred in sorted_preds:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For n predictions, this is O(n²), could we optimize it using any library or more sophisticated approach ?

Copy link
Author

@jedheaj314 jedheaj314 Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very good observation! have discussed this with Sharon H
I have mentioned this and justification in the commit message as well

TLDR is
I'd suggest keeping the current simple implementation for now since:
It's readable and maintainable
Performance is acceptable for typical entity counts
Adding a dependency just for this would increase complexity
wdyt?

:param text: Input text to process
:param chunker: Text chunking strategy
:param process_func: Function that takes chunk text and returns predictions
:param chunk_overlap: Number of characters overlapping between chunks
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why pass chunk_overlapas a separate parameter when it's already available on the chunker? Could this lead to inconsistencies?

Copy link
Author

@jedheaj314 jedheaj314 Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right! Removed.

@ShakutaiGit ShakutaiGit requested a review from omri374 December 3, 2025 11:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

GLiNER Recognizer Truncates Long Text, Leading to Poor Redaction Results

2 participants