[SPARK-45386][SQL][3.5] Fix correctness issue with persist using StorageLevel.NONE on Dataset #43213

eejbyfeldt · 2023-10-04T10:32:41Z

What changes were proposed in this pull request?

Support for InMememoryTableScanExec in AQE was added in #39624, but this patch contained a bug when a Dataset is persisted using StorageLevel.NONE. Before that patch a query like:

import org.apache.spark.storage.StorageLevel
spark.createDataset(Seq(1, 2)).persist(StorageLevel.NONE).count()

would correctly return 2. But after that patch it incorrectly returns 0. This is because AQE incorrectly determines based on the runtime statistics that are collected here:

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryRelation.scala

Line 294 in eac5a8c

rowCountStats.add(batch.numRows)

that the input is empty. The problem is that the action that should make sure the statistics are collected here

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/QueryStageExec.scala

Lines 285 to 291 in eac5a8c

    
           sparkContext.submitJob( 
        
             rdd, 
        
             (_: Iterator[CachedBatch]) => (), 
        
             (0 until rdd.getNumPartitions).toSeq, 
        
             (_: Int, _: Unit) => (), 
        
             () 
        
           )

never use the iterator and when we have StorageLevel.NONE the persisting will also not use the iterator and we will not gather the correct statistics.

The proposed fix in the patch just make calling persist with StorageLevel.NONE a no-op. Changing the action since it always "emptied" the iterator would also work but seems like that would be unnecessary work in a lot of normal circumstances.

Why are the changes needed?

The current code has a correctness issue.

Does this PR introduce any user-facing change?

Yes, fixes the correctness issue.

How was this patch tested?

New and existing unit tests.

Was this patch authored or co-authored using generative AI tooling?

No

…evel.NONE on Dataset (apache#43188) * SPARK-45386: Fix correctness issue with StorageLevel.NONE * Move to CacheManager * Add comment

WeichenXu123

LGTM

dongjoon-hyun

+1, LGTM (Pending CIs). Thank you, @eejbyfeldt .

HyukjinKwon · 2023-10-05T00:40:36Z

The test failure was due to 2a9dd2b. I reverted it out of branch-3.5.522af69

HyukjinKwon · 2023-10-05T00:40:44Z

Merged to branch-3.5.

…ageLevel.NONE on Dataset ### What changes were proposed in this pull request? Support for InMememoryTableScanExec in AQE was added in #39624, but this patch contained a bug when a Dataset is persisted using `StorageLevel.NONE`. Before that patch a query like: ``` import org.apache.spark.storage.StorageLevel spark.createDataset(Seq(1, 2)).persist(StorageLevel.NONE).count() ``` would correctly return 2. But after that patch it incorrectly returns 0. This is because AQE incorrectly determines based on the runtime statistics that are collected here: https://github.com/apache/spark/blob/eac5a8c7e6da94bb27e926fc9a681aed6582f7d3/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryRelation.scala#L294 that the input is empty. The problem is that the action that should make sure the statistics are collected here https://github.com/apache/spark/blob/eac5a8c7e6da94bb27e926fc9a681aed6582f7d3/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/QueryStageExec.scala#L285-L291 never use the iterator and when we have `StorageLevel.NONE` the persisting will also not use the iterator and we will not gather the correct statistics. The proposed fix in the patch just make calling persist with StorageLevel.NONE a no-op. Changing the action since it always "emptied" the iterator would also work but seems like that would be unnecessary work in a lot of normal circumstances. ### Why are the changes needed? The current code has a correctness issue. ### Does this PR introduce _any_ user-facing change? Yes, fixes the correctness issue. ### How was this patch tested? New and existing unit tests. ### Was this patch authored or co-authored using generative AI tooling? No Closes #43213 from eejbyfeldt/SPARK-45386-branch-3.5. Authored-by: Emil Ejbyfeldt <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

[SPARK-45386][SQL]: Fix correctness issue with persist using StorageL…

75b7070

…evel.NONE on Dataset (apache#43188) * SPARK-45386: Fix correctness issue with StorageLevel.NONE * Move to CacheManager * Add comment

github-actions bot added the SQL label Oct 4, 2023

eejbyfeldt mentioned this pull request Oct 4, 2023

[SPARK-45386][SQL]: Fix correctness issue with persist using StorageLevel.NONE on Dataset #43188

Merged

WeichenXu123 approved these changes Oct 4, 2023

View reviewed changes

WeichenXu123 requested a review from HyukjinKwon October 4, 2023 13:32

dongjoon-hyun changed the title ~~[SPARK-45386][SQL]: Fix correctness issue with persist using StorageLevel.NONE on Dataset for branch 3.5~~ [SPARK-45386][SQL][3.5] Fix correctness issue with persist using StorageLevel.NONE on Dataset Oct 4, 2023

dongjoon-hyun approved these changes Oct 4, 2023

View reviewed changes

HyukjinKwon approved these changes Oct 5, 2023

View reviewed changes

HyukjinKwon closed this Oct 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-45386][SQL][3.5] Fix correctness issue with persist using StorageLevel.NONE on Dataset #43213

[SPARK-45386][SQL][3.5] Fix correctness issue with persist using StorageLevel.NONE on Dataset #43213

Uh oh!

eejbyfeldt commented Oct 4, 2023

Uh oh!

WeichenXu123 left a comment

Uh oh!

dongjoon-hyun left a comment

Uh oh!

HyukjinKwon commented Oct 5, 2023

Uh oh!

HyukjinKwon commented Oct 5, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

	sparkContext.submitJob(
	rdd,
	(_: Iterator[CachedBatch]) => (),
	(0 until rdd.getNumPartitions).toSeq,
	(_: Int, _: Unit) => (),
	()
	)

[SPARK-45386][SQL][3.5] Fix correctness issue with persist using StorageLevel.NONE on Dataset #43213

[SPARK-45386][SQL][3.5] Fix correctness issue with persist using StorageLevel.NONE on Dataset #43213

Uh oh!

Conversation

eejbyfeldt commented Oct 4, 2023

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

WeichenXu123 left a comment

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Oct 5, 2023

Uh oh!

HyukjinKwon commented Oct 5, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants