[SPARKNLP-1086] Introducing DataFrameOptimizer #14607

danilojsl · 2025-06-17T23:38:43Z

Description

This PR introduces a new Transformer named DataFrameOptimizer

Allows configurable repartitioning based on:
- executorCores × numWorkers OR fixed numPartitions
Supports optional DataFrame caching (doCache)
Adds support for persisting the DataFrame using:
- Supported formats: csv, json, parquet
- Custom writer options via outputOptions
Preserves original schema, useful as a pipeline utility

Motivation and Context

Meant for optimizing and exporting intermediate pipeline outputs in Spark NLP workflows.

How Has This Been Tested?

Screenshots (if appropriate):

Types of changes

Bug fix (non-breaking change which fixes an issue)
Code improvements with no or little impact
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

My code follows the code style of this project.
x ] My change requires a change to the documentation.
I have updated the documentation accordingly.
I have read the CONTRIBUTING page.
I have added tests to cover my changes.
All new and existing tests passed.

Copilot

Pull Request Overview

This PR adds a new DataFrameOptimizer Transformer to let users repartition, cache, and persist DataFrames in both Scala and Python APIs, and introduces corresponding unit tests.

Implements DataFrameOptimizer with configurable executorCores, numWorkers, numPartitions, doCache, and persistence options (persistPath, persistFormat, outputOptions).
Adds Scala and Python test specs covering repartitioning logic, caching behavior, and (Scala) persistence.
Updates documentation comments to explain usage and parameters.

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File	Description
src/main/scala/com/johnsnowlabs/nlp/annotators/DataFrameOptimizer.scala	Core Transformer implementation for partitioning, caching, and persistence
src/test/scala/com/johnsnowlabs/nlp/annotators/DataFrameOptimizerTestSpec.scala	Scala tests for partition count, caching, and persistence
python/sparknlp/annotator/dataframe_optimizer.py	Python Transformer wrapper mirroring Scala functionality
python/test/annotator/dataframe_optimizer_test.py	Python tests for partitioning and caching behavior

Comments suppressed due to low confidence (4)

src/test/scala/com/johnsnowlabs/nlp/annotators/DataFrameOptimizerTestSpec.scala:26

This test does not include any assertions and only calls show(). To make it an effective automated test, add assertions for expected partition count or cache status after transformation.

  "DataFrameOptimizer" should "optimize DataFrame operations" taggedAs FastTest in {

python/test/annotator/dataframe_optimizer_test.py:1

The Python tests cover partitioning and caching but lack a test for persistence behavior (persistPath and persistFormat). Add a test case verifying that files are written and can be read back.

#  Copyright 2017-2025 John Snow Labs

src/test/scala/com/johnsnowlabs/nlp/annotators/DataFrameOptimizerTestSpec.scala:52

The variables documentAssembler and sentenceDetector are not defined or imported in this test, which will cause a compile error. Please import or instantiate these stages before using them in the pipeline.

      .setStages(Array(dataFrameOptimizer, documentAssembler, sentenceDetector))

python/test/annotator/dataframe_optimizer_test.py:38

The trailing backslash ends the line continuation without chaining to the next statement, causing a syntax error. Remove the backslash or continue the method chain properly.

            .setDoCache(True) \

src/main/scala/com/johnsnowlabs/nlp/annotators/DataFrameOptimizer.scala

python/test/annotator/dataframe_optimizer_test.py

* [SPARKNLP-1086] Introducing DataFrameOptimizer * [SPARKNLP-1086] Adding validations and demo notebook

[SPARKNLP-1086] Introducing DataFrameOptimizer

d310f73

danilojsl marked this pull request as ready for review June 18, 2025 02:25

danilojsl requested review from DevinTDHa and Copilot June 18, 2025 02:25

danilojsl self-assigned this Jun 18, 2025

danilojsl added new-feature Introducing a new feature labels Jun 18, 2025

Copilot AI reviewed Jun 18, 2025

View reviewed changes

src/main/scala/com/johnsnowlabs/nlp/annotators/DataFrameOptimizer.scala Show resolved Hide resolved

src/main/scala/com/johnsnowlabs/nlp/annotators/DataFrameOptimizer.scala Show resolved Hide resolved

python/test/annotator/dataframe_optimizer_test.py Show resolved Hide resolved

[SPARKNLP-1086] Adding validations and demo notebook

b994ef5

DevinTDHa changed the base branch from master to release/604-release-candidate June 23, 2025 10:00

DevinTDHa approved these changes Jun 23, 2025

View reviewed changes

DevinTDHa merged commit a5bca6b into release/604-release-candidate Jun 23, 2025
4 of 6 checks passed

DevinTDHa mentioned this pull request Jun 24, 2025

Spark NLP 6.0.4 Release #14611

Merged

10 tasks

DevinTDHa pushed a commit that referenced this pull request Jun 30, 2025

[SPARKNLP-1086] Introducing DataFrameOptimizer (#14607)

7538f7e

* [SPARKNLP-1086] Introducing DataFrameOptimizer * [SPARKNLP-1086] Adding validations and demo notebook

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARKNLP-1086] Introducing DataFrameOptimizer #14607

[SPARKNLP-1086] Introducing DataFrameOptimizer #14607

Uh oh!

danilojsl commented Jun 17, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[SPARKNLP-1086] Introducing DataFrameOptimizer #14607

[SPARKNLP-1086] Introducing DataFrameOptimizer #14607

Uh oh!

Conversation

danilojsl commented Jun 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation and Context

How Has This Been Tested?

Screenshots (if appropriate):

Types of changes

Checklist:

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

danilojsl commented Jun 17, 2025 •

edited

Loading