[SPARKNLP-1299] Add Hierarchical Element Identification to HTMLReader #14675
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
This PR enhances
HTMLReaderby introducing unique element and parent identifiers (element_idandparent_id) to each parsed HTML element.The new metadata enables explicit hierarchical relationships between document structures (e.g., titles → paragraphs → links), unlocking new downstream applications such as hybrid retrieval, contextual analysis, and graph-based document reasoning.
In addition. This change also adds metadata information to Sentence Detectors from previous stages in the pipeline
Motivation and Context
Until now, the HTML reader extracted structural elements (titles, paragraphs, tables, links, etc.) independently, without explicit relationships between them.
While this was sufficient for content extraction and semantic embeddings, it limited the ability to:
This change introduces
element_idandparent_idfields to every element’s metadata, enabling hierarchical queries, local context awareness, and multi-level retrieval strategies.How Has This Been Tested?
Screenshots (if appropriate):
Types of changes
Checklist: