-
Notifications
You must be signed in to change notification settings - Fork 942
Open
Description
Hi all,
First of all, thank you for the great work on Presidio! 🙏
I have a question regarding the use of different context words per entity type when using the TransformersRecognizer.
The Problem
Currently, LemmaContextAwareEnhancer expects recognizer.context to be a flat list of words.
This means that the same context words are applied to all entities supported by that recognizer:
supportive_context_word = self._find_supportive_word_in_context(
surrounding_words,
recognizer.context
)
For example, if I build one recognizer with multiple entity types:
fin_rec = TransformersRecognizer(
model=model,
supported_entities=["ACCOUNTNUMBER", "CREDIT_CARD", "IBAN_CODE", "US_BANK_NUMBER", "US_ITIN"],
supported_language=supported_language,
pipeline=shared_pipeline,
)
fin_rec.context = [
"account", "account number", "bank account", "checking account", "savings account",
"credit card", "card number", "visa", "mastercard", "amex", "payment", "billing",
"iban", "bank code", "swift", "routing",
"tax id", "itin", "individual taxpayer", "taxpayer identification",
]
With this setup:
Context words like "account" could accidentally boost "US_ITIN" detections.
Context words like "tax id" could boost "ACCOUNTNUMBER".
Is there already a mechanism to achieve per-entity context words?
Thanks!!!
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels