SPARKNLP-1259 Introducing Reader2Doc Annotator #14632

danilojsl · 2025-07-20T16:18:25Z

Description

This PR introduces the new Reader2Doc annotator to spark-nlp, providing a streamlined and user-friendly interface for interacting with Spark NLP readers and integrating with spark-nlp pipelines.

Key Improvements:

Simplifies integration with Spark NLP readers through a unified interface
Adds flexibility by enabling more reader-specific configurations
Enhances the maintainability and scalability of data loading workflows

Supported formats include:
- PDFs
- Plain text
- HTML
- Word (.doc/.docx)
- Excel (.xls/.xlsx)
- PowerPoint (.ppt/.pptx)
- Email files (.eml, .msg)
- Markdown (.md)

Motivation and Context

The current approach to interfacing with Spark NLP readers is fragmented and lacks flexibility, often requiring custom code for handling various input sources and options. This makes onboarding harder for new users and hinders reuse across pipelines.

The Reader2Doc component abstracts these complexities by:

Unifying access patterns for multiple readers
Reducing boilerplate code in reader configuration
Making it easier to scale and switch between different data sources

This feature is part of the requirements described in issue #14624

How Has This Been Tested?

Screenshots (if appropriate):

Local Tests
Google Colab notebook
Databricks notebook

Types of changes

Bug fix (non-breaking change which fixes an issue)
Code improvements with no or little impact
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

My code follows the code style of this project.
My change requires a change to the documentation.
I have updated the documentation accordingly.
I have read the CONTRIBUTING page.
I have added tests to cover my changes.
All new and existing tests passed.

danilojsl added 7 commits July 11, 2025 14:25

[SPARKNLP-1235] Adding CSV Reader

66eb2cd

[SPARKNLP-1259] Enhancing HTMLReader parsing capabilities

2059370

[SPARKNLP-1259] Adding sentence metadata to TextReader

8a597cc

[SPARKNLP-1259] Introducing Reader2Doc Annotator

7a06bd4

[SPARKNLP-1259] Adding XML support to Reader2Doc

d1135d5

[SPARKNLP-1259] Adding Reader2Doc documentation

4085485

[SPARKNLP-1259] Adding missing file for readers tests

05eddbe

danilojsl added the new-feature Introducing a new feature label Jul 20, 2025

danilojsl added 2 commits July 20, 2025 19:14

[SPARKNLP-1259] Adding Reader2Doc demo notebook

1ff58c0

[SPARKNLP-1259] Adding slow mark for URLs readers tests

e60f610

danilojsl requested review from DevinTDHa and mehmetbutgul July 21, 2025 13:20

danilojsl self-assigned this Jul 21, 2025

danilojsl marked this pull request as ready for review July 21, 2025 13:20

danilojsl changed the base branch from master to release/610-release-candidate July 23, 2025 11:44

[SPARKNLP-1259] Adjust doc

0fa5541

DevinTDHa approved these changes Jul 23, 2025

View reviewed changes

DevinTDHa merged commit 7e6e464 into release/610-release-candidate Jul 23, 2025
3 of 4 checks passed

DevinTDHa mentioned this pull request Jul 23, 2025

Spark NLP 6.1.0 Release #14634

Merged

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

SPARKNLP-1259 Introducing Reader2Doc Annotator #14632

SPARKNLP-1259 Introducing Reader2Doc Annotator #14632

Uh oh!

danilojsl commented Jul 20, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

SPARKNLP-1259 Introducing Reader2Doc Annotator #14632

SPARKNLP-1259 Introducing Reader2Doc Annotator #14632

Uh oh!

Conversation

danilojsl commented Jul 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation and Context

How Has This Been Tested?

Screenshots (if appropriate):

Types of changes

Checklist:

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

danilojsl commented Jul 20, 2025 •

edited

Loading