Skip to content

Conversation

@danilojsl
Copy link
Contributor

Description

Extend EntityRulerModel and DocumentNormalizer with Auto Modes and Preset Patterns

🚀 Summary of Changes

1️⃣ EntityRulerModel Enhancements

🆕 New Parameters

Parameter Type Default Description
autoMode String None Enables predefined regex groups such as "communication_entities" or "network_entities".
extractEntities Array[String] [] Filters the entities to extract. If empty, all entities for the current autoMode or JSON are used.

🧱 Available Built-in Auto Modes

Auto Mode Entities Included
network_entities IPV4_PATTERN, IPV6_PATTERN, IP_ADDRESS_PATTERN, IP_ADDRESS_NAME_PATTERN
email_entities EMAIL_ADDRESS_PATTERN, EMAIL_DATETIMETZ_PATTERN, MAPI_ID_PATTERN
communication_entities EMAIL_ADDRESS_PATTERN, US_PHONE_NUMBERS_PATTERN
media_entities IMAGE_URL_PATTERN
all_entities All of the above

⚙️ Behavior Updates

  • Automatically selects regex presets when autoMode is provided.
  • Supports case-insensitive mode names (e.g. "COMMUNICATION_ENTITIES" works).
  • Safely falls back to regex-only mode if no automaton is defined.
  • Maintains backward compatibility with regex JSONs and RocksDB rules.

2️⃣ DocumentNormalizer Enhancements

🆕 New Parameters

Parameter Type Default Description
presetPattern String None Selects a single cleaning function preset (e.g. "CLEAN_BULLETS", "CLEAN_DASHES").
autoMode String None Enables predefined cleaning modes like "html_clean", "light_clean", or "full_auto".

🧱 Available Auto Modes

Auto Mode Functions Applied
light_clean cleanExtraWhitespace, cleanTrailingPunctuation
document_clean cleanBullets, cleanOrderedBullets, cleanDashes, cleanExtraWhitespace
social_clean removePunctuation, cleanDashes, cleanExtraWhitespace
html_clean replaceUnicodeCharacters, cleanNonAsciiChars
full_auto All functional presets combined

⚙️ Behavior Updates

  • Normalizes autoMode (case-insensitive).
  • Prioritizes autoMode when both presetPattern and autoMode are set.
  • Ensures safe fallback to manual normalization if no mode is provided.

Motivation and Context

Both EntityRulerModel and DocumentNormalizer were powerful but required manual configuration for common text extraction or cleaning patterns.
Users had to define or load pattern JSONs or manually build regex rules, even for well-known cases like emails, IP addresses, or HTML tag cleaning.

To improve usability, reproducibility, and out-of-the-box convenience, this PR introduces automatic preset modes and functional pattern selectors.
These allow users to easily apply common entity extraction or normalization behaviors without additional configuration.

How Has This Been Tested?

Screenshots (if appropriate):

  • Local Tests
  • Google Colab

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • Code improvements with no or little impact
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • My code follows the code style of this project.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.
  • I have read the CONTRIBUTING page.
  • I have added tests to cover my changes.
  • All new and existing tests passed.

@danilojsl danilojsl self-assigned this Oct 17, 2025
@danilojsl danilojsl marked this pull request as ready for review October 17, 2025 17:07
@danilojsl danilojsl requested a review from DevinTDHa October 17, 2025 17:07
@DevinTDHa DevinTDHa changed the base branch from master to release/620-release-candidate October 20, 2025 12:47
@DevinTDHa DevinTDHa merged commit 2e8e3f6 into release/620-release-candidate Oct 20, 2025
4 checks passed
@DevinTDHa DevinTDHa mentioned this pull request Oct 20, 2025
10 tasks
@coveralls
Copy link

Pull Request Test Coverage Report for Build 18652478786

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 114 of 149 (76.51%) changed or added relevant lines in 3 files are covered.
  • 40 unchanged lines in 36 files lost coverage.
  • Overall coverage increased (+0.2%) to 55.25%

Changes Missing Coverage Covered Lines Changed/Added Lines %
src/main/scala/com/johnsnowlabs/nlp/annotators/er/EntityRulerModel.scala 88 102 86.27%
src/main/scala/com/johnsnowlabs/nlp/annotators/DocumentNormalizer.scala 25 46 54.35%
Files with Coverage Reduction New Missed Lines %
src/main/scala/com/johnsnowlabs/nlp/annotators/common/Tagged.scala 1 67.82%
src/main/scala/com/johnsnowlabs/nlp/annotators/er/EntityRulerApproach.scala 1 95.65%
src/main/scala/com/johnsnowlabs/nlp/annotators/ner/crf/NerCrfModel.scala 1 69.23%
src/main/scala/com/johnsnowlabs/nlp/annotators/ner/dl/NerDLApproach.scala 1 81.03%
src/main/scala/com/johnsnowlabs/nlp/annotators/parser/dep/GreedyTransition/DependencyMaker.scala 1 81.01%
src/main/scala/com/johnsnowlabs/nlp/annotators/pos/perceptron/PerceptronApproachDistributed.scala 1 82.24%
src/main/scala/com/johnsnowlabs/nlp/annotators/spell/symmetric/SymmetricDeleteModel.scala 1 89.55%
src/main/scala/com/johnsnowlabs/nlp/annotators/TextSplitter.scala 1 92.59%
src/main/scala/com/johnsnowlabs/nlp/annotators/tokenizer/bpe/BpeSpecialTokens.scala 1 59.22%
src/main/scala/com/johnsnowlabs/nlp/annotators/tokenizer/moses/MosesTokenizer.scala 1 95.98%
Totals Coverage Status
Change from base Build 18651502542: 0.2%
Covered Lines: 11919
Relevant Lines: 21573

💛 - Coveralls

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants