[SPARK-28213][SQL] Replace ColumnarBatchScan with equivilant from Columnar #25008

revans2 · 2019-06-28T20:33:52Z

What changes were proposed in this pull request?

This is a second part of the https://issues.apache.org/jira/browse/SPARK-27396 and a follow on to #24795

How was this patch tested?

I did some manual tests and ran/updated the automated tests

I did some simple performance tests on a single node to try to verify that there is no performance impact, and I was not able to measure anything beyond noise.

tgravescs · 2019-06-28T21:06:52Z

add to whitelist

tgravescs · 2019-06-28T21:07:06Z

ok to test

SparkQA · 2019-06-29T00:04:23Z

Test build #107027 has finished for PR 25008 at commit 8c285e5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2019-07-03T03:31:21Z

Retest this please.

dongjoon-hyun · 2019-07-03T03:35:47Z

sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryTableScanExec.scala

  }

-  override def inputRDDs(): Seq[RDD[InternalRow]] = Seq(inputRDD)
+  // override def inputRDDs(): Seq[RDD[InternalRow]] = Seq(inputRDD)


Shall we remove this cleanly?

dongjoon-hyun · 2019-07-03T03:38:03Z

sql/core/src/test/scala/org/apache/spark/sql/SubquerySuite.scala

      val df = sql("SELECT * FROM a WHERE p <= (SELECT MIN(id) FROM b)")
      checkAnswer(df, Seq(Row(0, 0), Row(2, 0)))
      // need to execute the query before we can examine fs.inputRDDs()
+      df.explain


Shall we clean up this?

dongjoon-hyun · 2019-07-03T03:42:40Z

...core/src/test/scala/org/apache/spark/sql/execution/columnar/InMemoryColumnarQuerySuite.scala

    // Push predicate to the cached table.
    val df2 = df1.where("y = 3")

+    logWarning(s"ORIG QUERY PLAN:\n${df2.queryExecution.executedPlan}")


Ur, shall we clean up these three logWarnings in this test suite? This will not be read.

SparkQA · 2019-07-03T06:36:15Z

Test build #107147 has finished for PR 25008 at commit 8c285e5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-07-08T21:52:10Z

Test build #107367 has finished for PR 25008 at commit a9e8aea.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

…itions

SparkQA · 2019-07-09T16:47:22Z

Test build #107407 has finished for PR 25008 at commit 86dd5a0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

abellina

Overall it makes sense. Just had some high level questions.

abellina · 2019-07-09T17:24:33Z

sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryTableScanExec.scala

+    buffers
+      .map(createAndDecompressColumn(_, offHeapColumnVectorEnabled))
+      .map(b => {
+        numOutputRows += b.numRows()


should numOutputRows be max(numRows)?

no because b is a ColumnarBatch, so we are iterating over possibly multiple batches. We are not iterating over individual columns.

ah right, missed that. Thanks

abellina · 2019-07-09T17:34:12Z

sql/core/src/test/scala/org/apache/spark/sql/SubquerySuite.scala

      // need to execute the query before we can examine fs.inputRDDs()
      assert(df.queryExecution.executedPlan match {
-        case WholeStageCodegenExec(fs @ FileSourceScanExec(_, _, _, partitionFilters, _, _, _)) =>
+        case WholeStageCodegenExec(ColumnarToRowExec(InputAdapter(


Ok, so ColumnarToRowExec is here because InputAdapter may supportColumnar, right? Why is InputAdapter getting added here?

WholeStageCodeGen marks the end of a code generation stage. InputAdapter marks the beginning of a code generation stage. So what we had before was a WholeStageCodeGen that had it's first entry a FileSourceScanExec because before this change FileSourceScanExec supported code generation to convert ColumnarBatchs into rows. The InputAdapter would logically have been a child of FileSourceScanExec, but it has no children so it is not there.

After this change ColumnarToRowExec is the only thing in the code generation stage, so it is flanked by the WholeStageCodegenExec and the InputAdaptor. FileSourceScanExec is returning batches and is not doing code gen because it is not needed any longer.

abellina · 2019-07-09T17:35:59Z

sql/core/src/test/scala/org/apache/spark/sql/execution/WholeStageCodegenSuite.scala

    assert(ds.collect() === Array(("a", 10.0), ("b", 3.0), ("c", 1.0)))
  }

-  test("cache for primitive type should be in WholeStageCodegen with InMemoryTableScanExec") {


Not clear to me why this test doesn't apply anymore.

It was verifying that InMemoryTableScanExec was inside of a WholeStageCodegen phase, but after this change InMemoryTableScanExec no longer supports codegen so the test is invalid. The ColumnarToRowExec is what will be in the codegen section instead.

tgravescs

changes look good.

viirya · 2019-07-10T15:08:23Z

sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala

    dataFilters: Seq[Expression],
    override val tableIdentifier: Option[TableIdentifier])
-  extends DataSourceScanExec with ColumnarBatchScan  {
+  extends DataSourceScanExec {


Is this change making all DataSourceScanExec nodes not codegen support, right?

Correct, but there were only 2 things that the code gen was doing. Either convert ColumnarBatch into UnsafeRows or to convert whatever other rows were being returned by the DataSourceScanExec into UnsafeRows. The ColumnarBatch conversion is now covered by ColumnarToRowExec. The row to row conversion is covered by UnsafeProjections that are either inserted as a part of this patch or were already in the code, so we ended up doing a double conversion.

viirya · 2019-07-10T15:10:15Z

sql/core/src/main/scala/org/apache/spark/sql/execution/Columnar.scala

    })
  }

  /**


ColumnarToRowExec's comment also mentions ColumnarBatchScan. If you are like to remove all reference to ColumnarBatchScan...

Great catch. I thought I got rid of all of them. Will grep though again.

viirya · 2019-07-10T15:11:44Z

sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryTableScanExec.scala

 import org.apache.spark.sql.types._
 import org.apache.spark.sql.vectorized.{ColumnarBatch, ColumnVector}

-


It would be good to avoid unnecessary change like this.

SparkQA · 2019-07-10T18:58:03Z

Test build #107467 has finished for PR 25008 at commit 2cce2fa.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tgravescs

+1

…umnar ## What changes were proposed in this pull request? This is a second part of the https://issues.apache.org/jira/browse/SPARK-27396 and a follow on to apache#24795 ## How was this patch tested? I did some manual tests and ran/updated the automated tests I did some simple performance tests on a single node to try to verify that there is no performance impact, and I was not able to measure anything beyond noise. Closes apache#25008 from revans2/columnar-remove-batch-scan. Authored-by: Robert (Bobby) Evans <[email protected]> Signed-off-by: Thomas Graves <[email protected]>

gatorsmile · 2019-07-25T23:12:18Z

sql/core/src/test/scala/org/apache/spark/sql/test/SQLTestUtils.scala

  protected def stripSparkFilter(df: DataFrame): DataFrame = {
    val schema = df.schema
-    val withoutFilters = df.queryExecution.sparkPlan.transform {
+    val withoutFilters = df.queryExecution.executedPlan.transform {


Why making this change in this PR?

Seems like the existing columnar logic within each plans have moved to RowToColumnarExec and ColumnarToRowExec to deduplicate but it's now dependent on ApplyColumnarRulesAndInsertTransitions rule, which, now, requires execution preparation (QueryExecution.preparations).

However, per the doc, executedPlan should be only used for execution ideally. It could have been best to avoid. @revans2 even though it's too late, can you please describe what does this PR fixes in the PR description (presumably by listing each item)?

I have no idea what this PR fixes from reading the PR description and JIRA.

This PR replaces ColumnarBatchScan with ColumnarToRowExec. This involved changes to all subclasses of COlumnarBatchScan and to AdaptiveSparkPlanExec so it would also execute the needed columnar transition rules. I also made any fixes needed for tests. I preferred to keep the changes as small as possible for the tests, which is why I made a small change here in a test utility class.

The issue was that some tests were directly execution the plan returned by this function, which used to work for some very limited use cases, but did not work in all cases. I am happy to try and fix issues with this approach for the tests, just let me know what is the correct way to do it?

I don't know the correct way to fix it for now - it needs some investigations. Can you clarify why it didn't work well?

The problem seems like by the internal behaviour changes - we now rely on ApplyColumnarRulesAndInsertTransitions. Can we make some investigation to confirm that it doesn't affect anything, and list up what changes were made in this PR description?

Yes that is exactly what it is.

@revans2, what I meant is like #25264 rather than single line that describes what this PR proposes.

HyukjinKwon · 2019-07-26T02:46:35Z

sql/core/src/test/scala/org/apache/spark/sql/execution/LogicalPlanTagInSparkPlanSuite.scala

+  // and ColumnarToRow transformations in the middle of it, but they will not have the tag
+  // we want, so skip them if they are the first thing we see
+  private def isScanPlanTree(plan: SparkPlan, first: Boolean): Boolean = plan match {
+    case i: InputAdapter if !first => isScanPlanTree(i.child, false)


Why is InputAdapter pop up?

Prior to this PR, a subclass of ColumanarBatchScan would be in a plan that looked like.

INPUT (subclass of ColumnarBatchScan) -> (code gen supported nodes) -> WholeStageCodegenExec -> ...

After this change, it now looks like

INPUT (not a subclass of ColumnarBatchScan) -> InputAdapter -> ColumnarToRowExec -> (code gen supported nodes) -> WholeStageCodegenExec -> ...

Because the INPUT class no longer supports code generation the code generation rule will insert an InputAdapter after it and before the ColumnarToRowExec that does support code generation.

HyukjinKwon · 2019-07-26T02:48:34Z

Hey, I see roughly the core logic itself is deduplicated fine without changing the exiting codes itself but seems like it changes the other stuff.

Let's be clear on what this PR fixes next time - I thought this was just a simple refactoring but now realised that this is actually pretty invasive.

Replace ColumnarBatchScan with equivilant from Columnar

8c285e5

dongjoon-hyun added the SQL label Jun 28, 2019

dongjoon-hyun reviewed Jul 3, 2019

View reviewed changes

revans2 added 2 commits July 8, 2019 14:23

Merge branch 'master' into columnar-remove-batch-scan

1d70026

Addressed review comments

a9e8aea

Fixed issue with adaptive plan execution not inserting columnar trans…

86dd5a0

…itions

abellina reviewed Jul 9, 2019

View reviewed changes

tgravescs approved these changes Jul 10, 2019

View reviewed changes

viirya reviewed Jul 10, 2019

View reviewed changes

Addressed review comments

2cce2fa

tgravescs approved these changes Jul 11, 2019

View reviewed changes

asfgit closed this in 8dff711 Jul 11, 2019

gatorsmile reviewed Jul 25, 2019

View reviewed changes

HyukjinKwon reviewed Jul 26, 2019

View reviewed changes

cloud-fan mentioned this pull request Jul 26, 2019

[SPARK-28213][SQL][followup] code cleanup and bug fix for columnar execution framework #25264

Closed

                   })
                 }
                 /**

		import org.apache.spark.sql.types._
		import org.apache.spark.sql.vectorized.{ColumnarBatch, ColumnVector}

[SPARK-28213][SQL] Replace ColumnarBatchScan with equivilant from Columnar #25008

[SPARK-28213][SQL] Replace ColumnarBatchScan with equivilant from Columnar #25008

Uh oh!

Conversation

revans2 commented Jun 28, 2019

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

tgravescs commented Jun 28, 2019

Uh oh!

tgravescs commented Jun 28, 2019

Uh oh!

SparkQA commented Jun 29, 2019

Uh oh!

dongjoon-hyun commented Jul 3, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 3, 2019

Uh oh!

SparkQA commented Jul 8, 2019

Uh oh!

SparkQA commented Jul 9, 2019

Uh oh!

abellina left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tgravescs left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 10, 2019

Uh oh!

tgravescs left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Jul 26, 2019