- 
                Notifications
    
You must be signed in to change notification settings  - Fork 836
 
Description
User Story: Enhanced Context-Aware PII Detection
Improve PII detection accuracy using contextual information around matched patterns
As a developer or data privacy engineer using Microsoft Presidio, I want a mechanism that enhances PII detection by analyzing the surrounding textual context of a detected entity (e.g., regex match), so that false positives are reduced and weak matches can be validated or promoted using linguistic cues like nearby keywords, semantic meaning, or lightweight machine learning.
Description / Acceptance Criteria
Background
Presidio currently supports context words for recognizers and uses regex or NER-based models to detect PII. However, many PII types (e.g., ages, credit cards, dates, ZIP codes) often appear in ambiguous formats or without strong patterns. Simple regexes may generate false positives or miss entities if context isn't considered. For example:
- 
555-212-1234 might be a phone number only if “phone", "number" appears nearby.
 - 
"He just turned 6" should indicate that 6 is a likely age and not an arbitrary one digit text.
 
Acceptance Criteria:
- Conduct research on different ways to achieve cost-effective (i.e. running on CPU) approaches for improving overall accuracy and specifically context awareness for identified patterns.
 - The new mechanism should possibly be implemented via an extended 
ContextAwareEnhancer, can evaluate the surrounding tokens of a potential PII match and adjust its confidence based on either:- Presence of context keywords (configurable per entity type).
 - Embedding-based semantic similarity of surrounding words.
 - Optional use of a lightweight ML classifier for disambiguation.
 
 - The solution is evaluated on a dataset and shows significant improvement over the rule-based baseline.
 - The mechanism can be applied to enhance results from regex-based recognizer and be customized for specific needs / languages / entities. Moreover, it should allow users to create high-recall / low-precision patterns (e.g. any number) and be able to filter out false positives based on context.