Skip to content

Conversation

@danilojsl
Copy link
Contributor

@danilojsl danilojsl commented Jun 17, 2025

Description

This PR introduces a new Transformer named DataFrameOptimizer

  • Allows configurable repartitioning based on:
    • executorCores × numWorkers OR fixed numPartitions
  • Supports optional DataFrame caching (doCache)
  • Adds support for persisting the DataFrame using:
    • Supported formats: csv, json, parquet
    • Custom writer options via outputOptions
  • Preserves original schema, useful as a pipeline utility

Motivation and Context

Meant for optimizing and exporting intermediate pipeline outputs in Spark NLP workflows.

How Has This Been Tested?

Screenshots (if appropriate):

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • Code improvements with no or little impact
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • My code follows the code style of this project.
  • x ] My change requires a change to the documentation.
  • I have updated the documentation accordingly.
  • I have read the CONTRIBUTING page.
  • I have added tests to cover my changes.
  • All new and existing tests passed.

@danilojsl danilojsl marked this pull request as ready for review June 18, 2025 02:25
@danilojsl danilojsl requested review from DevinTDHa and Copilot June 18, 2025 02:25
@danilojsl danilojsl self-assigned this Jun 18, 2025
@danilojsl danilojsl added new-feature Introducing a new feature labels Jun 18, 2025
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds a new DataFrameOptimizer Transformer to let users repartition, cache, and persist DataFrames in both Scala and Python APIs, and introduces corresponding unit tests.

  • Implements DataFrameOptimizer with configurable executorCores, numWorkers, numPartitions, doCache, and persistence options (persistPath, persistFormat, outputOptions).
  • Adds Scala and Python test specs covering repartitioning logic, caching behavior, and (Scala) persistence.
  • Updates documentation comments to explain usage and parameters.

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File Description
src/main/scala/com/johnsnowlabs/nlp/annotators/DataFrameOptimizer.scala Core Transformer implementation for partitioning, caching, and persistence
src/test/scala/com/johnsnowlabs/nlp/annotators/DataFrameOptimizerTestSpec.scala Scala tests for partition count, caching, and persistence
python/sparknlp/annotator/dataframe_optimizer.py Python Transformer wrapper mirroring Scala functionality
python/test/annotator/dataframe_optimizer_test.py Python tests for partitioning and caching behavior
Comments suppressed due to low confidence (4)

src/test/scala/com/johnsnowlabs/nlp/annotators/DataFrameOptimizerTestSpec.scala:26

  • This test does not include any assertions and only calls show(). To make it an effective automated test, add assertions for expected partition count or cache status after transformation.
  "DataFrameOptimizer" should "optimize DataFrame operations" taggedAs FastTest in {

python/test/annotator/dataframe_optimizer_test.py:1

  • The Python tests cover partitioning and caching but lack a test for persistence behavior (persistPath and persistFormat). Add a test case verifying that files are written and can be read back.
#  Copyright 2017-2025 John Snow Labs

src/test/scala/com/johnsnowlabs/nlp/annotators/DataFrameOptimizerTestSpec.scala:52

  • The variables documentAssembler and sentenceDetector are not defined or imported in this test, which will cause a compile error. Please import or instantiate these stages before using them in the pipeline.
      .setStages(Array(dataFrameOptimizer, documentAssembler, sentenceDetector))

python/test/annotator/dataframe_optimizer_test.py:38

  • The trailing backslash ends the line continuation without chaining to the next statement, causing a syntax error. Remove the backslash or continue the method chain properly.
            .setDoCache(True) \

@DevinTDHa DevinTDHa changed the base branch from master to release/604-release-candidate June 23, 2025 10:00
@DevinTDHa DevinTDHa merged commit a5bca6b into release/604-release-candidate Jun 23, 2025
4 of 6 checks passed
@DevinTDHa DevinTDHa mentioned this pull request Jun 24, 2025
10 tasks
DevinTDHa pushed a commit that referenced this pull request Jun 30, 2025
* [SPARKNLP-1086] Introducing DataFrameOptimizer

* [SPARKNLP-1086] Adding validations and demo notebook
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

new-feature Introducing a new feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants