[SPARKNLP-1119] Adding XML reader #14598
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
This pull request introduces a new feature that enables reading and parsing XML files into a structured Spark DataFrame. Leveraging this functionality allows for efficient processing and analysis of XML content, seamlessly integrating with Spark NLP for enhanced downstream natural language processing tasks.
Added
sparknlp.read().xml(): This method accepts file paths of XML content.Use in
Partition:Motivation and Context
Structured Data Representation: By transforming raw XML content into a well-defined DataFrame structure, we enable seamless integration with Spark's powerful analytical and data processing capabilities.
Scalability: Leveraging Spark’s distributed architecture, this feature supports the efficient processing of large volumes of XML data, critical for big data and enterprise-level NLP workflows.
Simplified Data Manipulation: A structured DataFrame representation of XML simplifies common data manipulation tasks such as filtering, aggregation, and transformation, thereby reducing complexity and improving productivity.
Enhanced Context for LLM Tasks: Converting XML documents into structured formats allows for more precise content extraction and context-aware processing, enhancing prompt quality and relevance for large language models (LLMs) in downstream NLP applications like information retrieval, question answering, and summarization.
How Has This Been Tested?
Screenshots (if appropriate):
Types of changes
Checklist: