Improved context awareness using ML - Embeddings, ML classifiers or other non-rule-based approaches

# User Story: Enhanced Context-Aware PII Detection

## Improve PII detection accuracy using contextual information around matched patterns

As a developer or data privacy engineer using Microsoft Presidio, I want a mechanism that enhances PII detection by analyzing the surrounding textual context of a detected entity (e.g., regex match), so that false positives are reduced and weak matches can be validated or promoted using linguistic cues like nearby keywords, semantic meaning, or lightweight machine learning.


### Description / Acceptance Criteria

#### Background
Presidio currently supports context words for recognizers and uses regex or NER-based models to detect PII. However, many PII types (e.g., ages, credit cards, dates, ZIP codes) often appear in ambiguous formats or without strong patterns. Simple regexes may generate false positives or miss entities if context isn't considered. For example:

- 555-212-1234 might be a phone number only if “phone", "number" appears nearby.

- "He just turned 6" should indicate that 6 is a likely age and not an arbitrary one digit text.

#### Acceptance Criteria:
- Conduct research on different ways to achieve cost-effective (i.e. running on CPU) approaches for improving overall accuracy and specifically context awareness for identified patterns.
- The new mechanism should possibly be implemented via an extended `ContextAwareEnhancer`, can evaluate the surrounding tokens of a potential PII match and adjust its confidence based on either:
  - Presence of context keywords (configurable per entity type).
  - Embedding-based semantic similarity of surrounding words.
  - Optional use of a lightweight ML classifier for disambiguation.
- The solution is evaluated on a dataset and shows significant improvement over the rule-based baseline.
- The mechanism can be applied to enhance results from regex-based recognizer and be customized for specific needs / languages / entities. Moreover, it should allow users to create high-recall / low-precision patterns (e.g. any number) and be able to filter out false positives based on context. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improved context awareness using ML - Embeddings, ML classifiers or other non-rule-based approaches #1686

User Story: Enhanced Context-Aware PII Detection

Improve PII detection accuracy using contextual information around matched patterns

Description / Acceptance Criteria

Background

Acceptance Criteria:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Improved context awareness using ML - Embeddings, ML classifiers or other non-rule-based approaches #1686

Description

User Story: Enhanced Context-Aware PII Detection

Improve PII detection accuracy using contextual information around matched patterns

Description / Acceptance Criteria

Background

Acceptance Criteria:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions