Skip to content

Improved context awareness using ML - Embeddings, ML classifiers or other non-rule-based approaches #1686

@omri374

Description

@omri374

User Story: Enhanced Context-Aware PII Detection

Improve PII detection accuracy using contextual information around matched patterns

As a developer or data privacy engineer using Microsoft Presidio, I want a mechanism that enhances PII detection by analyzing the surrounding textual context of a detected entity (e.g., regex match), so that false positives are reduced and weak matches can be validated or promoted using linguistic cues like nearby keywords, semantic meaning, or lightweight machine learning.

Description / Acceptance Criteria

Background

Presidio currently supports context words for recognizers and uses regex or NER-based models to detect PII. However, many PII types (e.g., ages, credit cards, dates, ZIP codes) often appear in ambiguous formats or without strong patterns. Simple regexes may generate false positives or miss entities if context isn't considered. For example:

  • 555-212-1234 might be a phone number only if “phone", "number" appears nearby.

  • "He just turned 6" should indicate that 6 is a likely age and not an arbitrary one digit text.

Acceptance Criteria:

  • Conduct research on different ways to achieve cost-effective (i.e. running on CPU) approaches for improving overall accuracy and specifically context awareness for identified patterns.
  • The new mechanism should possibly be implemented via an extended ContextAwareEnhancer, can evaluate the surrounding tokens of a potential PII match and adjust its confidence based on either:
    • Presence of context keywords (configurable per entity type).
    • Embedding-based semantic similarity of surrounding words.
    • Optional use of a lightweight ML classifier for disambiguation.
  • The solution is evaluated on a dataset and shows significant improvement over the rule-based baseline.
  • The mechanism can be applied to enhance results from regex-based recognizer and be customized for specific needs / languages / entities. Moreover, it should allow users to create high-recall / low-precision patterns (e.g. any number) and be able to filter out false positives based on context.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions