Skip to content

Commit 29da8d6

Browse files
c21dongjoon-hyun
authored andcommitted
[SPARK-34796][SQL][3.1] Initialize counter variable for LIMIT code-gen in doProduce()
This PR is to fix the LIMIT code-gen bug in https://issues.apache.org/jira/browse/SPARK-34796, where the counter variable from `BaseLimitExec` is not initialized but used in code-gen. This is because the limit counter variable will be used in upstream operators (LIMIT's child plan, e.g. `ColumnarToRowExec` operator for early termination), but in the same stage, there can be some operators doing the shortcut and not calling `BaseLimitExec`'s `doConsume()`, e.g. [HashJoin.codegenInner](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashJoin.scala#L402). So if we have query that `LocalLimit - BroadcastHashJoin - FileScan` in the same stage, the whole stage code-gen compilation will be failed. Here is an example: ``` test("failed limit query") { withTable("left_table", "empty_right_table", "output_table") { spark.range(5).toDF("k").write.saveAsTable("left_table") spark.range(0).toDF("k").write.saveAsTable("empty_right_table") withSQLConf(SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> "false") { spark.sql("CREATE TABLE output_table (k INT) USING parquet") spark.sql( s""" |INSERT INTO TABLE output_table |SELECT t1.k FROM left_table t1 |JOIN empty_right_table t2 |ON t1.k = t2.k |LIMIT 3 |""".stripMargin) } } } ``` Query plan: ``` Execute InsertIntoHadoopFsRelationCommand file:/Users/chengsu/spark/sql/core/spark-warehouse/org.apache.spark.sql.SQLQuerySuite/output_table, false, Parquet, Map(path -> file:/Users/chengsu/spark/sql/core/spark-warehouse/org.apache.spark.sql.SQLQuerySuite/output_table), Append, CatalogTable( Database: default Table: output_table Created Time: Thu Mar 18 21:46:26 PDT 2021 Last Access: UNKNOWN Created By: Spark 3.2.0-SNAPSHOT Type: MANAGED Provider: parquet Location: file:/Users/chengsu/spark/sql/core/spark-warehouse/org.apache.spark.sql.SQLQuerySuite/output_table Schema: root |-- k: integer (nullable = true) ), org.apache.spark.sql.execution.datasources.InMemoryFileIndexb25d08b, [k] +- *(3) Project [ansi_cast(k#228L as int) AS k#231] +- *(3) GlobalLimit 3 +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [id=apache#179] +- *(2) LocalLimit 3 +- *(2) Project [k#228L] +- *(2) BroadcastHashJoin [k#228L], [k#229L], Inner, BuildRight, false :- *(2) Filter isnotnull(k#228L) : +- *(2) ColumnarToRow : +- FileScan parquet default.left_table[k#228L] Batched: true, DataFilters: [isnotnull(k#228L)], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/Users/chengsu/spark/sql/core/spark-warehouse/org.apache.spark.sq..., PartitionFilters: [], PushedFilters: [IsNotNull(k)], ReadSchema: struct<k:bigint> +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false]),false), [id=apache#173] +- *(1) Filter isnotnull(k#229L) +- *(1) ColumnarToRow +- FileScan parquet default.empty_right_table[k#229L] Batched: true, DataFilters: [isnotnull(k#229L)], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/Users/chengsu/spark/sql/core/spark-warehouse/org.apache.spark.sq..., PartitionFilters: [], PushedFilters: [IsNotNull(k)], ReadSchema: struct<k:bigint> ``` Codegen failure - https://gist.github.com/c21/ea760c75b546d903247582be656d9d66 . The uninitialized variable `_limit_counter_1` from `LocalLimitExec` is referenced in `ColumnarToRowExec`, but `BroadcastHashJoinExec` does not call `LocalLimitExec.doConsume()` to initialize the counter variable. The fix is to move the counter variable initialization to `doProduce()`, as in whole stage code-gen framework, `doProduce()` will definitely be called if upstream operators `doProduce()`/`doConsume()` is called. Note: this only happens in AQE disabled case, because we have an AQE optimization rule [EliminateUnnecessaryJoin](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/EliminateUnnecessaryJoin.scala#L69) to change the whole query to an empty `LocalRelation` if inner join broadcast side is empty with AQE enabled. Fix query failure. No. Added unit test in `SQLQuerySuite.scala`. Closes apache#31911 from c21/limit-fix-3.1. Authored-by: Cheng Su <chengsu@fb.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
1 parent e640212 commit 29da8d6

2 files changed

Lines changed: 27 additions & 4 deletions

File tree

sql/core/src/main/scala/org/apache/spark/sql/execution/limit.scala

Lines changed: 8 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -122,14 +122,18 @@ trait BaseLimitExec extends LimitExec with CodegenSupport {
122122
}
123123

124124
protected override def doProduce(ctx: CodegenContext): String = {
125-
child.asInstanceOf[CodegenSupport].produce(ctx, this)
126-
}
127-
128-
override def doConsume(ctx: CodegenContext, input: Seq[ExprCode], row: ExprCode): String = {
129125
// The counter name is already obtained by the upstream operators via `limitNotReachedChecks`.
130126
// Here we have to inline it to not change its name. This is fine as we won't have many limit
131127
// operators in one query.
128+
//
129+
// Note: create counter variable here instead of `doConsume()` to avoid compilation error,
130+
// because upstream operators might not call `doConsume()` here
131+
// (e.g. `HashJoin.codegenInner()`).
132132
ctx.addMutableState(CodeGenerator.JAVA_INT, countTerm, forceInline = true, useFreshName = false)
133+
child.asInstanceOf[CodegenSupport].produce(ctx, this)
134+
}
135+
136+
override def doConsume(ctx: CodegenContext, input: Seq[ExprCode], row: ExprCode): String = {
133137
s"""
134138
| if ($countTerm < $limit) {
135139
| $countTerm += 1;

sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3950,6 +3950,25 @@ class SQLQuerySuite extends QueryTest with SharedSparkSession with AdaptiveSpark
39503950
}
39513951
}
39523952
}
3953+
3954+
test("SPARK-34796: Avoid code-gen compilation error for LIMIT query") {
3955+
withTable("left_table", "empty_right_table", "output_table") {
3956+
spark.range(5).toDF("k").write.saveAsTable("left_table")
3957+
spark.range(0).toDF("k").write.saveAsTable("empty_right_table")
3958+
3959+
withSQLConf(SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> "false") {
3960+
spark.sql("CREATE TABLE output_table (k INT) USING parquet")
3961+
spark.sql(
3962+
"""
3963+
|INSERT INTO TABLE output_table
3964+
|SELECT t1.k FROM left_table t1
3965+
|JOIN empty_right_table t2
3966+
|ON t1.k = t2.k
3967+
|LIMIT 3
3968+
""".stripMargin)
3969+
}
3970+
}
3971+
}
39533972
}
39543973

39553974
case class Foo(bar: Option[String])

0 commit comments

Comments
 (0)