[SPARK-33448][SQL] Support CACHE/UNCACHE TABLE commands for v2 tables #30403

imback82 · 2020-11-18T04:32:30Z

What changes were proposed in this pull request?

This PR proposes to support CHACHE/UNCACHE TABLE commands for v2 tables.

In addtion, this PR proposes to migrate CACHE/UNCACHE TABLE to use UnresolvedTableOrView to resolve the table identifier. This allows consistent resolution rules (temp view first, etc.) to be applied for both v1/v2 commands. More info about the consistent resolution rule proposal can be found in JIRA or proposal doc.

Why are the changes needed?

To support CACHE/UNCACHE TABLE commands for v2 tables.

Note that CACHE/UNCACHE TABLE for v1 tables/views go through SparkSession.table to resolve identifier, which resolves temp views first, so there is no change in the behavior by moving to the new framework.

Does this PR introduce any user-facing change?

Yes. Now the user can run CACHE/UNCACHE TABLE commands on v2 tables.

How was this patch tested?

Added/updated existing tests.

sql/core/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveSessionCatalog.scala

imback82 · 2020-11-18T04:35:46Z

sql/core/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveSessionCatalog.scala

+    // CACHE TABLE ... AS SELECT creates a temp view with the input query.
+    // Thus, use the identifier in UnresolvedTableOrView directly,
+    case CacheTable(u: UnresolvedTableOrView, plan, isLazy, options) if plan.isDefined =>
+      CacheTableCommand(u.multipartIdentifier.asTableIdentifier, plan, isLazy, options)


@cloud-fan Please let me know what you think about having UnresolvedTableOrView here to eagerly use the identifier if plan is defined. Another approach is to have a separate rule to handle CacheTable(u: UnresolvedTableOrView, ...).

This is more like CTAS and the table should be just Seq[String] not LogicalPlan.

How about we have both CacheTable(table: LogicalPlan, ...) and CacheTabeAsSelect(tempViewName: String, ...)?

SparkQA · 2020-11-18T05:19:34Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35849/

SparkQA · 2020-11-18T05:43:43Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35849/

SparkQA · 2020-11-18T06:46:28Z

Test build #131246 has finished for PR 30403 at commit a0687b3.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

sql/hive/src/test/scala/org/apache/spark/sql/hive/CachedTableSuite.scala

SparkQA · 2020-11-18T08:28:17Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35864/

SparkQA · 2020-11-18T08:50:24Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35864/

cloud-fan · 2020-11-18T09:00:27Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/v2Commands.scala

+ * The logical plan for no-op command handling non-existing table.
 */
-case class NoopDropTable(multipartIdentifier: Seq[String]) extends Command
+case class NoopCommand(multipartIdentifier: Seq[String]) extends Command


We should probably add a commandName: String property, which can be DROP TABLE, REFRESH TABLE, etc., so that we can see the original commannd name from the EXPLAIN result.

Added commandName. Now EXPLAIN EXTENDED DROP TABLE looks like the following:

|== Parsed Logical Plan == 'DropTable true, false +- 'UnresolvedTableOrView [testcat, ns1, ns2, tbl], true == Analyzed Logical Plan == NoopCommand DROP TABLE, [testcat, ns1, ns2, tbl] == Optimized Logical Plan == NoopCommand DROP TABLE, [testcat, ns1, ns2, tbl] == Physical Plan == LocalTableScan <empty>

Btw, do we want to introduce NoopCommandExec for physical plan as well?

cloud-fan · 2020-11-18T09:04:00Z

...core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala

      throw new AnalysisException("SHOW CREATE TABLE is not supported for v2 tables.")

+    case CacheTable(_: ResolvedTable, _, _, _) =>
+      throw new AnalysisException("CACHE TABLE is not supported for v2 tables.")


We don't need new v2 APIs to support it. This command touches CacheManager which is Spark internal.

Ah OK. An existing bug, I guess? (it only supported temp view / v1 tables).

Does it make sense to match case CacheTable(_: ResolvedTable, _, _, _) in ResolveSessionCatalog (seems weird) or should we match it in DataSourceV2Strategy with a new CacheTableExec similar to CacheTableCommand?

We can add a CacheTableExec as a v2 version of CacheTableCommand.

Added v2 version.

cloud-fan · 2020-11-18T09:05:18Z

sql/core/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveSessionCatalog.scala

-        parseTempViewOrV1Table(tbl, "CACHE TABLE")
-      }
-      CacheTableCommand(name.asTableIdentifier, plan, isLazy, options)
+    // CACHE TABLE ... AS SELECT creates a temp view with the input query.


What's the behavior of it if the temp view already exists? overwrite?

It would fail with:

org.apache.spark.sql.catalyst.analysis.TempTableAlreadyExistsException: Temporary view 't' already exists;

SparkQA · 2020-11-18T12:35:57Z

Test build #131261 has finished for PR 30403 at commit f36bc59.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-11-21T06:02:22Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36066/

SparkQA · 2020-11-21T06:26:52Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36066/

SparkQA · 2020-11-21T07:34:59Z

Test build #131460 has finished for PR 30403 at commit f232eba.

This patch fails Spark unit tests.
This patch does not merge cleanly.
This patch adds the following public classes (experimental):
case class NoopCommand(command: String, multipartIdentifier: Seq[String]) extends Command
case class CacheTableExec(
case class UncacheTableExec(

SparkQA · 2020-11-21T07:55:35Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36067/

SparkQA · 2020-11-21T08:05:01Z

Test build #131461 has finished for PR 30403 at commit 9085189.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class GetShufflePushMergerLocations(numMergersNeeded: Int, hostsToFilter: Set[String])
case class RemoveShufflePushMergerLocation(host: String) extends ToBlockManagerMaster
case class ParseUrl(children: Seq[Expression], failOnError: Boolean = SQLConf.get.ansiEnabled)

SparkQA · 2020-11-21T08:20:41Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36067/

viirya · 2020-11-22T17:08:58Z

cc @sunchao

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala

SparkQA · 2020-11-25T18:21:36Z

Test build #131793 has finished for PR 30403 at commit 4c2d5e2.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

sql/hive/src/test/scala/org/apache/spark/sql/hive/CachedTableSuite.scala

SparkQA · 2020-11-25T19:24:59Z

Test build #131799 has finished for PR 30403 at commit 3c4a0cf.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-11-25T19:57:01Z

Test build #131804 has finished for PR 30403 at commit 7f5a0b2.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-11-26T01:45:49Z

Test build #131818 has finished for PR 30403 at commit d0f49ef.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DropTableExec.scala

sql/core/src/main/scala/org/apache/spark/sql/execution/command/cache.scala

cloud-fan · 2020-11-26T03:54:37Z

sql/core/src/main/scala/org/apache/spark/sql/internal/CatalogImpl.scala

+    val multipartIdentifier =
+      sparkSession.sessionState.sqlParser.parseMultipartIdentifier(tableName)
+    val cascade = (multipartIdentifier.length <= 2) &&
+      !sessionCatalog.isTemporaryTable(multipartIdentifier.asTableIdentifier)


can we add an overload of isTemporaryTable that takes Seq[String]?

Looks like the overload already exists as isTempView. :)

SparkQA · 2020-11-26T07:46:32Z

Test build #131825 has finished for PR 30403 at commit ed1a6db.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-11-26T09:01:32Z

Test build #131828 has finished for PR 30403 at commit 4e0e82f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-11-28T01:33:00Z

Test build #131888 has finished for PR 30403 at commit 7e788ce.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-11-30T05:36:13Z

sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2SQLSuite.scala

  }

-  private def testNotSupportedV2Command(sqlCommand: String, sqlParams: String): Unit = {
+  private def testNotSupportedV2Command(


unnecessary change. This is minor and let's fix it in your next PR.

Ok, will fix.

cloud-fan · 2020-11-30T05:37:07Z

thanks, merging to master!

cloud-fan · 2020-11-30T05:47:30Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/cache.scala

+import org.apache.spark.sql.connector.catalog.CatalogV2Implicits.MultipartIdentifierHelper
 import org.apache.spark.storage.StorageLevel

 case class CacheTableCommand(


the next thing we can do is to refactor it using the v2 framework (not adding a v2 version). The benefits are: 1. moving the logical plan to catalyst. 2. resolve the table in the analyzer. e.g.

CacheTable(UnresolvedRelation(...), ...) ... case class CacheTableExec(relation: LogicalPlan) { def run() { val df = Dataset.ofRows(spark, relation) .... } }

OK, will do.

One issue I am encountering by moving to the v2 framework (for v2 tables) is the following.

When CACHE TABLE testcat.tbl is run, tbl is changed from DataSourceV2Relation to DataSourceV2ScanRelation in V2ScanRelationPushDown rule, now that the plan goes thru analyzer, optimizer, etc. But, if I run spark.table("testcat.tbl"), the query execution has tbl as DataSourceV2Relation, thus cache is not applied.

ah, one solution is to follow InsertIntoStatement and do not make the table as a child. Then we resolve UnresolvedRelation inside CacheTable manually in ResolveTempViews and other resolution rules.

### What changes were proposed in this pull request? This changes DSv2 refresh table semantics to also recache the target table itself. ### Why are the changes needed? Currently "REFRESH TABLE" in DSv2 only invalidate all caches referencing the table. With #30403 merged which adds support for caching a DSv2 table, we should also recache the target table itself to make the behavior consistent with DSv1. ### Does this PR introduce _any_ user-facing change? Yes, now refreshing table in DSv2 also recache the target table itself. ### How was this patch tested? Added coverage of this new behavior in the existing UT for v2 refresh table command Closes #30742 from sunchao/SPARK-33653. Authored-by: Chao Sun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

### What changes were proposed in this pull request? This changes DSv2 refresh table semantics to also recache the target table itself. ### Why are the changes needed? Currently "REFRESH TABLE" in DSv2 only invalidate all caches referencing the table. With apache#30403 merged which adds support for caching a DSv2 table, we should also recache the target table itself to make the behavior consistent with DSv1. ### Does this PR introduce _any_ user-facing change? Yes, now refreshing table in DSv2 also recache the target table itself. ### How was this patch tested? Added coverage of this new behavior in the existing UT for v2 refresh table command Closes apache#30742 from sunchao/SPARK-33653. Authored-by: Chao Sun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

…itself This is a backport of #30742 for branch-3.1 ### What changes were proposed in this pull request? This changes DSv2 refresh table semantics to also recache the target table itself. ### Why are the changes needed? Currently "REFRESH TABLE" in DSv2 only invalidate all caches referencing the table. With #30403 merged which adds support for caching a DSv2 table, we should also recache the target table itself to make the behavior consistent with DSv1. ### Does this PR introduce _any_ user-facing change? Yes, now refreshing table in DSv2 also recache the target table itself. ### How was this patch tested? Added coverage of this new behavior in the existing UT for v2 refresh table command. Closes #30769 from sunchao/SPARK-33653-branch-3.1. Authored-by: Chao Sun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

imback82 added 3 commits November 10, 2020 22:52

initial commit

3e532c8

Uncache

f4ee301

ResolveCommandsWithIfExists to support uncache table

a0687b3

imback82 commented Nov 18, 2020

View reviewed changes

github-actions bot added the SQL label Nov 18, 2020

Fix tests

f36bc59

imback82 commented Nov 18, 2020

View reviewed changes

sql/hive/src/test/scala/org/apache/spark/sql/hive/CachedTableSuite.scala Show resolved Hide resolved

cloud-fan reviewed Nov 18, 2020

View reviewed changes

imback82 added 3 commits November 19, 2020 18:07

Merge remote-tracking branch 'upstream/master' into cache_table

b3fe647

introduce CacheTableAsSelect

4b2fba0

Address PR comments

f232eba

imback82 added 2 commits November 20, 2020 23:00

Rename to command name

8c0140c

Merge remote-tracking branch 'upstream/master' into cache_table

9085189

imback82 changed the title ~~[SPARK-33448][SQL] Migrate CACHE/UNCACHE TABLE command to use UnresolvedTableOrView to resolve the identifier~~ [SPARK-33448][SQL] Support CACHE/UNCACHE TABLE commands for v2 tables Nov 22, 2020

viirya reviewed Nov 22, 2020

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala Outdated Show resolved Hide resolved

imback82 added 3 commits November 25, 2020 10:04

Address comments

b33d807

revert

4c2d5e2

Fix compilation

3c4a0cf

fix tests

7f5a0b2

imback82 commented Nov 25, 2020

View reviewed changes

sql/hive/src/test/scala/org/apache/spark/sql/hive/CachedTableSuite.scala Show resolved Hide resolved

Fix tests

d0f49ef

fix test

ed1a6db

cloud-fan reviewed Nov 26, 2020

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DropTableExec.scala Show resolved Hide resolved

cloud-fan reviewed Nov 26, 2020

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/command/cache.scala Show resolved Hide resolved

cloud-fan reviewed Nov 26, 2020

View reviewed changes

Address comments

4e0e82f

imback82 added 2 commits November 27, 2020 13:06

Merge remote-tracking branch 'upstream/master' into cache_table

911927d

Address PR comments

7e788ce

cloud-fan reviewed Nov 30, 2020

View reviewed changes

cloud-fan approved these changes Nov 30, 2020

View reviewed changes

cloud-fan closed this in 0fd9f57 Nov 30, 2020

cloud-fan reviewed Nov 30, 2020

View reviewed changes

sunchao mentioned this pull request Dec 11, 2020

[SPARK-33653][SQL] DSv2: REFRESH TABLE should recache the table itself #30742

Closed

sunchao mentioned this pull request Dec 15, 2020

[SPARK-33653][SQL][3.1] DSv2: REFRESH TABLE should recache the table itself #30769

Closed

[SPARK-33448][SQL] Support CACHE/UNCACHE TABLE commands for v2 tables #30403

[SPARK-33448][SQL] Support CACHE/UNCACHE TABLE commands for v2 tables #30403

Uh oh!

Conversation

imback82 commented Nov 18, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Nov 18, 2020

Uh oh!

SparkQA commented Nov 18, 2020

Uh oh!

SparkQA commented Nov 18, 2020

Uh oh!

Uh oh!

SparkQA commented Nov 18, 2020

Uh oh!

SparkQA commented Nov 18, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

imback82 Nov 18, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Nov 18, 2020

Uh oh!

SparkQA commented Nov 21, 2020

Uh oh!

SparkQA commented Nov 21, 2020

Uh oh!

SparkQA commented Nov 21, 2020

Uh oh!

SparkQA commented Nov 21, 2020

Uh oh!

SparkQA commented Nov 21, 2020

Uh oh!

SparkQA commented Nov 21, 2020

Uh oh!

viirya commented Nov 22, 2020

Uh oh!

Uh oh!

SparkQA commented Nov 25, 2020

Uh oh!

Uh oh!

SparkQA commented Nov 25, 2020

Uh oh!

SparkQA commented Nov 25, 2020

Uh oh!

SparkQA commented Nov 26, 2020

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Nov 26, 2020

imback82 commented Nov 18, 2020 •

edited

Loading

imback82 Nov 18, 2020 •

edited

Loading