[SPARK-45386][SQL][3.5] Fix correctness issue with persist using StorageLevel.NONE on Dataset #43213
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
Support for InMememoryTableScanExec in AQE was added in #39624, but this patch contained a bug when a Dataset is persisted using
StorageLevel.NONE. Before that patch a query like:would correctly return 2. But after that patch it incorrectly returns 0. This is because AQE incorrectly determines based on the runtime statistics that are collected here:
spark/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryRelation.scala
Line 294 in eac5a8c
that the input is empty. The problem is that the action that should make sure the statistics are collected here
spark/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/QueryStageExec.scala
Lines 285 to 291 in eac5a8c
never use the iterator and when we have
StorageLevel.NONEthe persisting will also not use the iterator and we will not gather the correct statistics.The proposed fix in the patch just make calling persist with StorageLevel.NONE a no-op. Changing the action since it always "emptied" the iterator would also work but seems like that would be unnecessary work in a lot of normal circumstances.
Why are the changes needed?
The current code has a correctness issue.
Does this PR introduce any user-facing change?
Yes, fixes the correctness issue.
How was this patch tested?
New and existing unit tests.
Was this patch authored or co-authored using generative AI tooling?
No