[SPARKNLP-1119] Adding XML reader #14598

danilojsl · 2025-06-09T20:16:16Z

Description

This pull request introduces a new feature that enables reading and parsing XML files into a structured Spark DataFrame. Leveraging this functionality allows for efficient processing and analysis of XML content, seamlessly integrating with Spark NLP for enhanced downstream natural language processing tasks.

Added sparknlp.read().xml(): This method accepts file paths of XML content.

Use in Partition :

partitioner = Partition(content_type = "application/xml").partition(xml_directory)

Motivation and Context

Structured Data Representation: By transforming raw XML content into a well-defined DataFrame structure, we enable seamless integration with Spark's powerful analytical and data processing capabilities.
Scalability: Leveraging Spark’s distributed architecture, this feature supports the efficient processing of large volumes of XML data, critical for big data and enterprise-level NLP workflows.
Simplified Data Manipulation: A structured DataFrame representation of XML simplifies common data manipulation tasks such as filtering, aggregation, and transformation, thereby reducing complexity and improving productivity.
Enhanced Context for LLM Tasks: Converting XML documents into structured formats allows for more precise content extraction and context-aware processing, enhancing prompt quality and relevance for large language models (LLMs) in downstream NLP applications like information retrieval, question answering, and summarization.

How Has This Been Tested?

Screenshots (if appropriate):

Types of changes

Bug fix (non-breaking change which fixes an issue)
Code improvements with no or little impact
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

My code follows the code style of this project.
My change requires a change to the documentation.
I have updated the documentation accordingly.
I have read the CONTRIBUTING page.
I have added tests to cover my changes.
All new and existing tests passed.

[SPARKNLP-1119] Adding XML reader

6e782e1

danilojsl self-assigned this Jun 9, 2025

danilojsl requested a review from DevinTDHa June 9, 2025 20:16

[SPARKNLP-1119] Adding documentation for XML reader [skip test]

9b45456

DevinTDHa approved these changes Jun 10, 2025

View reviewed changes

DevinTDHa merged commit 1605312 into release/603-release-candidate Jun 10, 2025
4 of 6 checks passed

DevinTDHa mentioned this pull request Jun 10, 2025

Spark NLP 6.0.3 #14600

Merged

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARKNLP-1119] Adding XML reader #14598

[SPARKNLP-1119] Adding XML reader #14598

Uh oh!

danilojsl commented Jun 9, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[SPARKNLP-1119] Adding XML reader #14598

[SPARKNLP-1119] Adding XML reader #14598

Uh oh!

Conversation

danilojsl commented Jun 9, 2025

Description

Motivation and Context

How Has This Been Tested?

Screenshots (if appropriate):

Types of changes

Checklist:

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants