Skip to content

Conversation

@danilojsl
Copy link
Contributor

@danilojsl danilojsl commented Sep 30, 2025

Description

This change introduces fault-tolerant processing when encountering malformed XML inputs, allowing the XML reader to continue gracefully rather than failing entirely.

Key changes include:

  • Enhance the XML parsing logic to catch and skip over malformed segments (e.g. mismatched tags, invalid characters) instead of throwing fatal exceptions.
  • Augment unit tests to cover a variety of malformed-XML scenarios (broken tags, missing closures, embedded invalid characters).

Motivation and Context

In real-world datasets, XML files are often imperfect: missing closing tags, unexpected characters, truncated fragments, or encoding issues. Without robust handling, a single malformed file or even a small malformed portion can break an entire ingestion pipeline, requiring manual cleanup or pre-validation steps.

With this enhancement, we aim to:

  • Improve resilience: Let pipelines ingest as much valid data as possible, skipping problematic parts without stopping the job.
  • Reduce manual preprocessing: Avoid needing external validation / repair before feeding XML into Spark NLP.

Overall, this makes the XML reader component more robust, practical, and reliable in messy real-world settings.

How Has This Been Tested?

Screenshots (if appropriate):

  • Local Tests
  • Google Colab

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • Code improvements with no or little impact
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • My code follows the code style of this project.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.
  • I have read the CONTRIBUTING page.
  • I have added tests to cover my changes.
  • All new and existing tests passed.

@danilojsl danilojsl self-assigned this Sep 30, 2025
@danilojsl danilojsl changed the title [SPARKNLP-1292] Adding fault-tolerance support when reading malformed… [SPARKNLP-1292] Adding fault-tolerance support for malformed XML Oct 7, 2025
@danilojsl danilojsl marked this pull request as ready for review October 7, 2025 20:53
@danilojsl danilojsl requested a review from DevinTDHa October 7, 2025 20:53
@DevinTDHa DevinTDHa changed the base branch from master to release/615-release-candidate October 8, 2025 11:42
…/SPARKNLP-1292-Enhance-Readers-Error-Handling-and-Robustness
@DevinTDHa DevinTDHa merged commit 308cf29 into release/615-release-candidate Oct 8, 2025
4 checks passed
@DevinTDHa DevinTDHa mentioned this pull request Oct 8, 2025
10 tasks
@coveralls
Copy link

Pull Request Test Coverage Report for Build 18345733030

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 9 of 9 (100.0%) changed or added relevant lines in 1 file are covered.
  • 4 unchanged lines in 3 files lost coverage.
  • Overall coverage increased (+0.02%) to 54.698%

Files with Coverage Reduction New Missed Lines %
src/main/scala/com/johnsnowlabs/storage/StorageHelper.scala 1 55.88%
src/main/scala/com/johnsnowlabs/util/ZipArchiveUtil.scala 1 88.0%
src/main/scala/com/johnsnowlabs/reader/util/ImageParser.scala 2 58.06%
Totals Coverage Status
Change from base Build 18345128934: 0.02%
Covered Lines: 11607
Relevant Lines: 21220

💛 - Coveralls

DevinTDHa added a commit that referenced this pull request Oct 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants