SPARKNLP-1293 Enhancements EntityRuler and DocumentNormalizer #14674
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
Extend EntityRulerModel and DocumentNormalizer with Auto Modes and Preset Patterns
🚀 Summary of Changes
1️⃣ EntityRulerModel Enhancements
🆕 New Parameters
autoModeStringNone"communication_entities"or"network_entities".extractEntitiesArray[String][]autoModeor JSON are used.🧱 Available Built-in Auto Modes
network_entitiesIPV4_PATTERN,IPV6_PATTERN,IP_ADDRESS_PATTERN,IP_ADDRESS_NAME_PATTERNemail_entitiesEMAIL_ADDRESS_PATTERN,EMAIL_DATETIMETZ_PATTERN,MAPI_ID_PATTERNcommunication_entitiesEMAIL_ADDRESS_PATTERN,US_PHONE_NUMBERS_PATTERNmedia_entitiesIMAGE_URL_PATTERNall_entities⚙️ Behavior Updates
autoModeis provided."COMMUNICATION_ENTITIES"works).2️⃣ DocumentNormalizer Enhancements
🆕 New Parameters
presetPatternStringNone"CLEAN_BULLETS","CLEAN_DASHES").autoModeStringNone"html_clean","light_clean", or"full_auto".🧱 Available Auto Modes
light_cleancleanExtraWhitespace,cleanTrailingPunctuationdocument_cleancleanBullets,cleanOrderedBullets,cleanDashes,cleanExtraWhitespacesocial_cleanremovePunctuation,cleanDashes,cleanExtraWhitespacehtml_cleanreplaceUnicodeCharacters,cleanNonAsciiCharsfull_auto⚙️ Behavior Updates
autoMode(case-insensitive).autoModewhen bothpresetPatternandautoModeare set.Motivation and Context
Both
EntityRulerModelandDocumentNormalizerwere powerful but required manual configuration for common text extraction or cleaning patterns.Users had to define or load pattern JSONs or manually build regex rules, even for well-known cases like emails, IP addresses, or HTML tag cleaning.
To improve usability, reproducibility, and out-of-the-box convenience, this PR introduces automatic preset modes and functional pattern selectors.
These allow users to easily apply common entity extraction or normalization behaviors without additional configuration.
How Has This Been Tested?
Screenshots (if appropriate):
Types of changes
Checklist: