Skip to content

LemmaContextAwareEnhancer- per entity type context words #1711

@rotemvo

Description

@rotemvo

Hi all,

First of all, thank you for the great work on Presidio! 🙏
I have a question regarding the use of different context words per entity type when using the TransformersRecognizer.

The Problem
Currently, LemmaContextAwareEnhancer expects recognizer.context to be a flat list of words.
This means that the same context words are applied to all entities supported by that recognizer:

supportive_context_word = self._find_supportive_word_in_context(
    surrounding_words,
    recognizer.context
)

For example, if I build one recognizer with multiple entity types:

fin_rec = TransformersRecognizer(
    model=model,
    supported_entities=["ACCOUNTNUMBER", "CREDIT_CARD", "IBAN_CODE", "US_BANK_NUMBER", "US_ITIN"],
    supported_language=supported_language,
    pipeline=shared_pipeline, 
)

fin_rec.context = [
    "account", "account number", "bank account", "checking account", "savings account",
    "credit card", "card number", "visa", "mastercard", "amex", "payment", "billing",
    "iban", "bank code", "swift", "routing",
    "tax id", "itin", "individual taxpayer", "taxpayer identification",
]

With this setup:
Context words like "account" could accidentally boost "US_ITIN" detections.
Context words like "tax id" could boost "ACCOUNTNUMBER".

Is there already a mechanism to achieve per-entity context words?

Thanks!!!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions