LemmaContextAwareEnhancer- per entity type context words


Hi all,

First of all, thank you for the great work on Presidio! 🙏
I have a question regarding the use of different context words per entity type when using the TransformersRecognizer.

**The Problem**
Currently, `LemmaContextAwareEnhancer` expects `recognizer.context` to be a flat list of words.
This means that the same context words are applied to all entities supported by that recognizer:

```
supportive_context_word = self._find_supportive_word_in_context(
    surrounding_words,
    recognizer.context
)

```
For example, if I build one recognizer with multiple entity types:
```
fin_rec = TransformersRecognizer(
    model=model,
    supported_entities=["ACCOUNTNUMBER", "CREDIT_CARD", "IBAN_CODE", "US_BANK_NUMBER", "US_ITIN"],
    supported_language=supported_language,
    pipeline=shared_pipeline, 
)

fin_rec.context = [
    "account", "account number", "bank account", "checking account", "savings account",
    "credit card", "card number", "visa", "mastercard", "amex", "payment", "billing",
    "iban", "bank code", "swift", "routing",
    "tax id", "itin", "individual taxpayer", "taxpayer identification",
]
```

With this setup:
Context words like "account" could accidentally boost "US_ITIN" detections.
Context words like "tax id" could boost "ACCOUNTNUMBER".

Is there already a mechanism to achieve per-entity context words?

Thanks!!!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LemmaContextAwareEnhancer- per entity type context words #1711

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

LemmaContextAwareEnhancer- per entity type context words #1711

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions