Skip to content

Conversation

@danilojsl
Copy link
Contributor

@danilojsl danilojsl commented Jul 20, 2025

Description

This PR introduces the new Reader2Doc annotator to spark-nlp, providing a streamlined and user-friendly interface for interacting with Spark NLP readers and integrating with spark-nlp pipelines.

Key Improvements:

  • Simplifies integration with Spark NLP readers through a unified interface
  • Adds flexibility by enabling more reader-specific configurations
  • Enhances the maintainability and scalability of data loading workflows

Supported formats include:
- PDFs
- Plain text
- HTML
- Word (.doc/.docx)
- Excel (.xls/.xlsx)
- PowerPoint (.ppt/.pptx)
- Email files (.eml, .msg)
- Markdown (.md)

Motivation and Context

The current approach to interfacing with Spark NLP readers is fragmented and lacks flexibility, often requiring custom code for handling various input sources and options. This makes onboarding harder for new users and hinders reuse across pipelines.

The Reader2Doc component abstracts these complexities by:

  • Unifying access patterns for multiple readers
  • Reducing boilerplate code in reader configuration
  • Making it easier to scale and switch between different data sources

This feature is part of the requirements described in issue #14624

How Has This Been Tested?

Screenshots (if appropriate):

  • Local Tests
  • Google Colab notebook
  • Databricks notebook

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • Code improvements with no or little impact
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • My code follows the code style of this project.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.
  • I have read the CONTRIBUTING page.
  • I have added tests to cover my changes.
  • All new and existing tests passed.

@danilojsl danilojsl added the new-feature Introducing a new feature label Jul 20, 2025
@danilojsl danilojsl self-assigned this Jul 21, 2025
@danilojsl danilojsl marked this pull request as ready for review July 21, 2025 13:20
@danilojsl danilojsl changed the base branch from master to release/610-release-candidate July 23, 2025 11:44
@DevinTDHa DevinTDHa merged commit 7e6e464 into release/610-release-candidate Jul 23, 2025
3 of 4 checks passed
@DevinTDHa DevinTDHa mentioned this pull request Jul 23, 2025
10 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

new-feature Introducing a new feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants