SPARKNLP-1293 Enhancements EntityRuler and DocumentNormalizer #14674

danilojsl · 2025-10-17T15:51:34Z

Description

Extend EntityRulerModel and DocumentNormalizer with Auto Modes and Preset Patterns

🚀 Summary of Changes

1️⃣ EntityRulerModel Enhancements

🆕 New Parameters

Parameter	Type	Default	Description
`autoMode`	`String`	`None`	Enables predefined regex groups such as `"communication_entities"` or `"network_entities"`.
`extractEntities`	`Array[String]`	`[]`	Filters the entities to extract. If empty, all entities for the current `autoMode` or JSON are used.

🧱 Available Built-in Auto Modes

Auto Mode	Entities Included
`network_entities`	`IPV4_PATTERN`, `IPV6_PATTERN`, `IP_ADDRESS_PATTERN`, `IP_ADDRESS_NAME_PATTERN`
`email_entities`	`EMAIL_ADDRESS_PATTERN`, `EMAIL_DATETIMETZ_PATTERN`, `MAPI_ID_PATTERN`
`communication_entities`	`EMAIL_ADDRESS_PATTERN`, `US_PHONE_NUMBERS_PATTERN`
`media_entities`	`IMAGE_URL_PATTERN`
`all_entities`	All of the above

⚙️ Behavior Updates

Automatically selects regex presets when autoMode is provided.
Supports case-insensitive mode names (e.g. "COMMUNICATION_ENTITIES" works).
Safely falls back to regex-only mode if no automaton is defined.
Maintains backward compatibility with regex JSONs and RocksDB rules.

2️⃣ DocumentNormalizer Enhancements

🆕 New Parameters

Parameter	Type	Default	Description
`presetPattern`	`String`	`None`	Selects a single cleaning function preset (e.g. `"CLEAN_BULLETS"`, `"CLEAN_DASHES"`).
`autoMode`	`String`	`None`	Enables predefined cleaning modes like `"html_clean"`, `"light_clean"`, or `"full_auto"`.

🧱 Available Auto Modes

Auto Mode	Functions Applied
`light_clean`	`cleanExtraWhitespace`, `cleanTrailingPunctuation`
`document_clean`	`cleanBullets`, `cleanOrderedBullets`, `cleanDashes`, `cleanExtraWhitespace`
`social_clean`	`removePunctuation`, `cleanDashes`, `cleanExtraWhitespace`
`html_clean`	`replaceUnicodeCharacters`, `cleanNonAsciiChars`
`full_auto`	All functional presets combined

⚙️ Behavior Updates

Normalizes autoMode (case-insensitive).
Prioritizes autoMode when both presetPattern and autoMode are set.
Ensures safe fallback to manual normalization if no mode is provided.

Motivation and Context

Both EntityRulerModel and DocumentNormalizer were powerful but required manual configuration for common text extraction or cleaning patterns.
Users had to define or load pattern JSONs or manually build regex rules, even for well-known cases like emails, IP addresses, or HTML tag cleaning.

To improve usability, reproducibility, and out-of-the-box convenience, this PR introduces automatic preset modes and functional pattern selectors.
These allow users to easily apply common entity extraction or normalization behaviors without additional configuration.

How Has This Been Tested?

Screenshots (if appropriate):

Local Tests
Google Colab

Types of changes

Bug fix (non-breaking change which fixes an issue)
Code improvements with no or little impact
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

My code follows the code style of this project.
My change requires a change to the documentation.
I have updated the documentation accordingly.
I have read the CONTRIBUTING page.
I have added tests to cover my changes.
All new and existing tests passed.

…uto mode feature

coveralls · 2025-10-20T14:55:31Z

Pull Request Test Coverage Report for Build 18652478786

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

For more information on this, see Tracking coverage changes with pull request builds.
To avoid this issue with future PRs, see these Recommended CI Configurations.
For a quick fix, rebase this PR at GitHub. Your next report should be accurate.

Details

114 of 149 (76.51%) changed or added relevant lines in 3 files are covered.
40 unchanged lines in 36 files lost coverage.
Overall coverage increased (+0.2%) to 55.25%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
src/main/scala/com/johnsnowlabs/nlp/annotators/er/EntityRulerModel.scala	88	102	86.27%
src/main/scala/com/johnsnowlabs/nlp/annotators/DocumentNormalizer.scala	25	46	54.35%

Files with Coverage Reduction	New Missed Lines	%
src/main/scala/com/johnsnowlabs/nlp/annotators/common/Tagged.scala	1	67.82%
src/main/scala/com/johnsnowlabs/nlp/annotators/er/EntityRulerApproach.scala	1	95.65%
src/main/scala/com/johnsnowlabs/nlp/annotators/ner/crf/NerCrfModel.scala	1	69.23%
src/main/scala/com/johnsnowlabs/nlp/annotators/ner/dl/NerDLApproach.scala	1	81.03%
src/main/scala/com/johnsnowlabs/nlp/annotators/parser/dep/GreedyTransition/DependencyMaker.scala	1	81.01%
src/main/scala/com/johnsnowlabs/nlp/annotators/pos/perceptron/PerceptronApproachDistributed.scala	1	82.24%
src/main/scala/com/johnsnowlabs/nlp/annotators/spell/symmetric/SymmetricDeleteModel.scala	1	89.55%
src/main/scala/com/johnsnowlabs/nlp/annotators/TextSplitter.scala	1	92.59%
src/main/scala/com/johnsnowlabs/nlp/annotators/tokenizer/bpe/BpeSpecialTokens.scala	1	59.22%
src/main/scala/com/johnsnowlabs/nlp/annotators/tokenizer/moses/MosesTokenizer.scala	1	95.98%

Totals
Change from base Build 18651502542:	0.2%
Covered Lines:	11919
Relevant Lines:	21573

💛 - Coveralls

danilojsl added 3 commits October 15, 2025 14:55

[SPARKNLP-1293] Adding extract entities to EntityRuler

f6d8d8e

[SPARKNLP-1293] Adding auto mode to DocumentNormalizer and EntityRuler

9f71051

[SPARKNLP-1293] Adding unit tests and python wrapper parameters for a…

6275a0b

…uto mode feature

danilojsl self-assigned this Oct 17, 2025

danilojsl added the enhancement label Oct 17, 2025

danilojsl marked this pull request as ready for review October 17, 2025 17:07

danilojsl requested a review from DevinTDHa October 17, 2025 17:07

DevinTDHa approved these changes Oct 20, 2025

View reviewed changes

[SPARKNLP-1293] Add FastTest tag to Scala tests

b08968f

DevinTDHa changed the base branch from master to release/620-release-candidate October 20, 2025 12:47

DevinTDHa merged commit 2e8e3f6 into release/620-release-candidate Oct 20, 2025
4 checks passed

DevinTDHa mentioned this pull request Oct 20, 2025

Spark NLP 6.2.0 Release #14676

Merged

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

SPARKNLP-1293 Enhancements EntityRuler and DocumentNormalizer #14674

SPARKNLP-1293 Enhancements EntityRuler and DocumentNormalizer #14674

Uh oh!

danilojsl commented Oct 17, 2025

Uh oh!

Uh oh!

coveralls commented Oct 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

SPARKNLP-1293 Enhancements EntityRuler and DocumentNormalizer #14674

SPARKNLP-1293 Enhancements EntityRuler and DocumentNormalizer #14674

Uh oh!

Conversation

danilojsl commented Oct 17, 2025

Description

Extend EntityRulerModel and DocumentNormalizer with Auto Modes and Preset Patterns

🚀 Summary of Changes

1️⃣ EntityRulerModel Enhancements

🆕 New Parameters

🧱 Available Built-in Auto Modes

⚙️ Behavior Updates

2️⃣ DocumentNormalizer Enhancements

🆕 New Parameters

🧱 Available Auto Modes

⚙️ Behavior Updates

Motivation and Context

How Has This Been Tested?

Screenshots (if appropriate):

Types of changes

Checklist:

Uh oh!

Uh oh!

coveralls commented Oct 20, 2025

Pull Request Test Coverage Report for Build 18652478786

Warning: This coverage report may be inaccurate.

Details

💛 - Coveralls

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants