-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-39854][SQL] replaceWithAliases should keep the original children for Generate #37348
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -143,6 +143,48 @@ class SparkPlanSuite extends QueryTest with SharedSparkSession { | |
| } | ||
| } | ||
| } | ||
|
|
||
| test("SPARK-39854: replaceWithAliases should keep the order of Generate children") { | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thank you for providing the end-to-end test. Can we have a test case in
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hi @dongjoon-hyun , thanks for taking time looking into this! The test is put in the
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If we add an end-to-end test, the Apache Spark test time increases prohibitively. We prefer to narrow down the issue and have isolated unit tests. So, in this PR,
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I understand. will try to create a test case in |
||
| import org.apache.spark.sql.functions.{explode, struct} | ||
| import org.apache.spark.sql.SparkSession | ||
|
Comment on lines
+148
to
+149
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We usually put imports at the beginning. |
||
| val ss: SparkSession = spark | ||
| import ss.implicits._ | ||
| val testJson = | ||
| """{ | ||
| | "b": { | ||
| | "id": "id00", | ||
| | "data": [{ | ||
| | "b1": "vb1", | ||
| | "b2": 101, | ||
| | "ex2": [ | ||
| | { "fb1": false, "fb2": 11, "fb3": "t1" }, | ||
| | { "fb1": true, "fb2": 12, "fb3": "t2" } | ||
| | ]}, { | ||
| | "b1": "vb2", | ||
| | "b2": 102, | ||
| | "ex2": [ | ||
| | { "fb1": false, "fb2": 13, "fb3": "t3" }, | ||
| | { "fb1": true, "fb2": 14, "fb3": "t4" } | ||
| | ]} | ||
| | ], | ||
| | "fa": "tes", | ||
| | "v": "1.5" | ||
| | } | ||
| |} | ||
| |""".stripMargin | ||
| val df = spark.read.json((testJson :: Nil).toDS()) | ||
| .withColumn("ex_b", explode($"b.data.ex2")) | ||
| .withColumn("ex_b2", explode($"ex_b")) | ||
| val df1 = df | ||
| .withColumn("rt", struct( | ||
| $"b.fa".alias("rt_fa"), | ||
| $"b.v".alias("rt_v") | ||
| )) | ||
| .drop("b", "ex_b") | ||
|
|
||
| val result = df1.collect() | ||
| assert(result.length == 4) | ||
| } | ||
| } | ||
|
|
||
| case class ColumnarOp(child: SparkPlan) extends UnaryExecNode { | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, we intend to replace some attributes with its nested fields if they are accessed on top of the plan. So we can prune unused fields later.
If we keep original outputs, I think pruning will not work actually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may need to add a pruning test case to make sure pruning still works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think, this change does not address the real issue. The real issue is, that Generate contains a list unrequiredChildIndex of child output indices, that are not needed in the Generate output. This list has to be adjusted to fit the inserted Project node of NestedColumnAliasing. Here it just fits accidentally, because the original list is included completely at the beginning of the new Project node. But this may include unnecessary outputs, that ColumnPruning is trying to avoid. I have a different proposal, that adjust the list of indices to point to the new positions after attribute aliasing: #49061