-
Notifications
You must be signed in to change notification settings - Fork 733
[SPARKNLP-1086] Introducing DataFrameOptimizer #14607
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARKNLP-1086] Introducing DataFrameOptimizer #14607
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds a new DataFrameOptimizer Transformer to let users repartition, cache, and persist DataFrames in both Scala and Python APIs, and introduces corresponding unit tests.
- Implements
DataFrameOptimizerwith configurableexecutorCores,numWorkers,numPartitions,doCache, and persistence options (persistPath,persistFormat,outputOptions). - Adds Scala and Python test specs covering repartitioning logic, caching behavior, and (Scala) persistence.
- Updates documentation comments to explain usage and parameters.
Reviewed Changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| src/main/scala/com/johnsnowlabs/nlp/annotators/DataFrameOptimizer.scala | Core Transformer implementation for partitioning, caching, and persistence |
| src/test/scala/com/johnsnowlabs/nlp/annotators/DataFrameOptimizerTestSpec.scala | Scala tests for partition count, caching, and persistence |
| python/sparknlp/annotator/dataframe_optimizer.py | Python Transformer wrapper mirroring Scala functionality |
| python/test/annotator/dataframe_optimizer_test.py | Python tests for partitioning and caching behavior |
Comments suppressed due to low confidence (4)
src/test/scala/com/johnsnowlabs/nlp/annotators/DataFrameOptimizerTestSpec.scala:26
- This test does not include any assertions and only calls
show(). To make it an effective automated test, add assertions for expected partition count or cache status after transformation.
"DataFrameOptimizer" should "optimize DataFrame operations" taggedAs FastTest in {
python/test/annotator/dataframe_optimizer_test.py:1
- The Python tests cover partitioning and caching but lack a test for persistence behavior (
persistPathandpersistFormat). Add a test case verifying that files are written and can be read back.
# Copyright 2017-2025 John Snow Labs
src/test/scala/com/johnsnowlabs/nlp/annotators/DataFrameOptimizerTestSpec.scala:52
- The variables
documentAssemblerandsentenceDetectorare not defined or imported in this test, which will cause a compile error. Please import or instantiate these stages before using them in the pipeline.
.setStages(Array(dataFrameOptimizer, documentAssembler, sentenceDetector))
python/test/annotator/dataframe_optimizer_test.py:38
- The trailing backslash ends the line continuation without chaining to the next statement, causing a syntax error. Remove the backslash or continue the method chain properly.
.setDoCache(True) \
a5bca6b
into
release/604-release-candidate
* [SPARKNLP-1086] Introducing DataFrameOptimizer * [SPARKNLP-1086] Adding validations and demo notebook
Description
This PR introduces a new
TransformernamedDataFrameOptimizerexecutorCores×numWorkersOR fixednumPartitionsdoCache)csv,json,parquetoutputOptionsMotivation and Context
Meant for optimizing and exporting intermediate pipeline outputs in Spark NLP workflows.
How Has This Been Tested?
Screenshots (if appropriate):
Types of changes
Checklist: