[SPARKNLP-1299] Add Hierarchical Element Identification to HTMLReader #14675

danilojsl · 2025-10-18T01:34:36Z

Description

This PR enhances HTMLReader by introducing unique element and parent identifiers (element_id and parent_id) to each parsed HTML element.
The new metadata enables explicit hierarchical relationships between document structures (e.g., titles → paragraphs → links), unlocking new downstream applications such as hybrid retrieval, contextual analysis, and graph-based document reasoning.

In addition. This change also adds metadata information to Sentence Detectors from previous stages in the pipeline

Motivation and Context

Until now, the HTML reader extracted structural elements (titles, paragraphs, tables, links, etc.) independently, without explicit relationships between them.
While this was sufficient for content extraction and semantic embeddings, it limited the ability to:

Reconstruct document hierarchies (e.g., which paragraph belongs to which title).
Perform context-aware retrieval or narrative linking.
Build graph-based or hybrid search indexes that combine semantic embeddings with symbolic/document structure.

This change introduces element_id and parent_id fields to every element’s metadata, enabling hierarchical queries, local context awareness, and multi-level retrieval strategies.

How Has This Been Tested?

Local Tests
Google Colab

Screenshots (if appropriate):

Types of changes

Bug fix (non-breaking change which fixes an issue)
Code improvements with no or little impact
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

My code follows the code style of this project.
My change requires a change to the documentation.
I have updated the documentation accordingly.
I have read the CONTRIBUTING page.
I have added tests to cover my changes.
All new and existing tests passed.

danilojsl self-assigned this Oct 18, 2025

danilojsl added the enhancement label Oct 18, 2025

danilojsl force-pushed the SPARKNLP-1299-Add-Hierarchical-Element-Identification-to-HTMLReader branch from 94fba76 to 337d088 Compare October 18, 2025 01:36

[SPARKNLP-1299] Add Hierarchical Element Identification to HTMLReader

424e683

danilojsl force-pushed the SPARKNLP-1299-Add-Hierarchical-Element-Identification-to-HTMLReader branch from 337d088 to 424e683 Compare October 18, 2025 01:38

danilojsl marked this pull request as ready for review October 18, 2025 15:53

danilojsl requested a review from DevinTDHa October 18, 2025 15:53

danilojsl added 2 commits October 19, 2025 18:48

[SPARKNLP-1299] Include metadata to Sentence Detectors

6904911

[SPARKNLP-1299] Adding python test

b13d578

DevinTDHa approved these changes Oct 21, 2025

View reviewed changes

DevinTDHa changed the base branch from master to release/620-release-candidate October 21, 2025 13:24

DevinTDHa merged commit 67a4810 into release/620-release-candidate Oct 21, 2025
4 checks passed

DevinTDHa mentioned this pull request Oct 21, 2025

Spark NLP 6.2.0 Release #14676

Merged

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARKNLP-1299] Add Hierarchical Element Identification to HTMLReader #14675

[SPARKNLP-1299] Add Hierarchical Element Identification to HTMLReader #14675

Uh oh!

danilojsl commented Oct 18, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[SPARKNLP-1299] Add Hierarchical Element Identification to HTMLReader #14675

[SPARKNLP-1299] Add Hierarchical Element Identification to HTMLReader #14675

Uh oh!

Conversation

danilojsl commented Oct 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation and Context

How Has This Been Tested?

Screenshots (if appropriate):

Types of changes

Checklist:

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

danilojsl commented Oct 18, 2025 •

edited

Loading