Skip to content

Conversation

@danilojsl
Copy link
Contributor

@danilojsl danilojsl commented Oct 18, 2025

Description

This PR enhances HTMLReader by introducing unique element and parent identifiers (element_id and parent_id) to each parsed HTML element.
The new metadata enables explicit hierarchical relationships between document structures (e.g., titles → paragraphs → links), unlocking new downstream applications such as hybrid retrieval, contextual analysis, and graph-based document reasoning.

In addition. This change also adds metadata information to Sentence Detectors from previous stages in the pipeline

Motivation and Context

Until now, the HTML reader extracted structural elements (titles, paragraphs, tables, links, etc.) independently, without explicit relationships between them.
While this was sufficient for content extraction and semantic embeddings, it limited the ability to:

  • Reconstruct document hierarchies (e.g., which paragraph belongs to which title).
  • Perform context-aware retrieval or narrative linking.
  • Build graph-based or hybrid search indexes that combine semantic embeddings with symbolic/document structure.

This change introduces element_id and parent_id fields to every element’s metadata, enabling hierarchical queries, local context awareness, and multi-level retrieval strategies.

How Has This Been Tested?

  • Local Tests
  • Google Colab

Screenshots (if appropriate):

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • Code improvements with no or little impact
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • My code follows the code style of this project.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.
  • I have read the CONTRIBUTING page.
  • I have added tests to cover my changes.
  • All new and existing tests passed.

@danilojsl danilojsl self-assigned this Oct 18, 2025
@danilojsl danilojsl force-pushed the SPARKNLP-1299-Add-Hierarchical-Element-Identification-to-HTMLReader branch from 94fba76 to 337d088 Compare October 18, 2025 01:36
@danilojsl danilojsl force-pushed the SPARKNLP-1299-Add-Hierarchical-Element-Identification-to-HTMLReader branch from 337d088 to 424e683 Compare October 18, 2025 01:38
@danilojsl danilojsl marked this pull request as ready for review October 18, 2025 15:53
@danilojsl danilojsl requested a review from DevinTDHa October 18, 2025 15:53
@DevinTDHa DevinTDHa changed the base branch from master to release/620-release-candidate October 21, 2025 13:24
@DevinTDHa DevinTDHa merged commit 67a4810 into release/620-release-candidate Oct 21, 2025
4 checks passed
@DevinTDHa DevinTDHa mentioned this pull request Oct 21, 2025
10 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants