Skip to content

Catch false positives / false negatives during CI #1639

@omri374

Description

@omri374

Is your feature request related to a problem? Please describe.

As new predefined recognizers are proposed, the CI should test if those have logic which has side effects on other parts of the codebase, by running an evaluation on a synthetic dataset and measuring precision and recall.

Describe the solution you'd like
During CI, a synthetic dataset is created (Based on the presidio-evaluator data generator logic), and precision/recall values are measured.
For every new contributed recognizer, the contributor would have to add the logic for creating synthetic samples for this entity.

High-level flow:

  • A contributor adds a new recognizer via PR
  • The contributor adds a few templates with the entity (e.g. "My name is {{name}}"), and a faker provider that generates this value, if doesn't exist
  • The synthetic data is generated during CI and precision / recall values are collected
  • CI fails if the metrics drop significantly.

Describe alternatives you've considered
Manually estimating the side effects of predefined recognizers (currently the practice)

Additional context
https://github.com/microsoft/presidio-research/blob/master/notebooks/1_Generate_data.ipynb

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions