[SPARK-17729] [SQL] Enable creating hive bucketed tables #17644

tejasapatil · 2017-04-15T23:45:34Z

What changes were proposed in this pull request?

Hive allows inserting data to bucketed table without guaranteeing bucketed and sorted-ness based on these two configs : hive.enforce.bucketing and hive.enforce.sorting.

What does this PR achieve ?

Spark will disallow users from writing outputs to hive bucketed tables by default (given that output won't adhere with Hive's semantics).
IF user still wants to write to hive bucketed table, the only resort is to use hive.enforce.bucketing=false and hive.enforce.sorting=false which means user does NOT care about bucketing guarantees.

Changes done in this PR:

Extract table's bucketing information in HiveClientImpl
While writing table info to metastore, HiveClientImpl now populates the bucketing information in the hive Table object
InsertIntoHiveTable allows inserts to bucketed table only if both hive.enforce.bucketing and hive.enforce.sorting are false

Ability to create bucketed tables will enable adding test cases to Spark while I add more changes related to hive bucketing support. Design doc for hive hive bucketing support : https://docs.google.com/document/d/1a8IDh23RAkrkg9YYAeO51F4aGO8-xAlupKwdshve2fc/edit#

How was this patch tested?

Added test for creating bucketed and sorted table.
Added test to ensure that INSERTs fail if strict bucket / sort is enforced
Added test to ensure that INSERTs can go through if strict bucket / sort is NOT enforced
Added test to validate that bucketing information shows up in output of DESC FORMATTED
Added test to ensure that SHOW CREATE TABLE works for hive bucketed tables

SparkQA · 2017-04-16T01:10:49Z

Test build #75829 has finished for PR 17644 at commit 0348a96.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-04-16T19:13:48Z

Test build #75838 has finished for PR 17644 at commit fc04991.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tejasapatil · 2017-04-16T19:45:39Z

cc @cloud-fan @hvanhovell @sameeragarwal for review

cloud-fan · 2017-04-17T02:38:37Z

I'll review it after branch 2.2 is cut

SparkQA · 2017-04-29T06:22:31Z

Test build #76298 has started for PR 17644 at commit 1b78141.

tejasapatil · 2017-04-30T00:19:43Z

Jenkins test this please

SparkQA · 2017-04-30T02:42:47Z

Test build #76312 has finished for PR 17644 at commit 1b78141.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tejasapatil · 2017-05-01T14:47:35Z

@cloud-fan : ping !!

cloud-fan · 2017-05-04T14:15:30Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala

I don't think we need to do this, according to the document of ExternalCatalog.alterTableSchema, the caller will guarantee that the new schema still contains partition columns and bucket columns.

I have to do this because one of the unit tests was failing :

spark/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/catalog/ExternalCatalogSuite.scala

Line 246 in 703c42c

test("alter table schema") {

In that test case, the table has a bucketing and alter removes the bucketing column. The way I interpret things: if the alter command is not removing the bucketing / sorting columns, we should keep the original bucketing spec. If any one of the bucketing / sorting columns are removed in alter, then the bucketing spec needs to be reset to empty (else we might refer to dangling columns).

does hive allow this? i.e. alter schema without bucket/sort columns.

Unfortunately YES. Below I alter a bucketed table to remove the bucketing column from its schema. Modified table still points to old bucket spec. This is bad.

hive> desc formatted bucketed_partitioned_1; # col_name data_type comment user_id bigint name string # Storage Information .... Num Buckets: 8 Bucket Columns: [user_id] Sort Columns: [Order(col:user_id, order:1)] hive> ALTER TABLE bucketed_partitioned_1 REPLACE COLUMNS ( eid INT, name STRING); OK hive> desc formatted bucketed_partitioned_1; # col_name data_type comment eid int name string # Storage Information ..... Num Buckets: 8 Bucket Columns: [user_id] Sort Columns: [Order(col:user_id, order:1)]

do you mean hive has a wrong behavior about this? Maybe we can diverge from hive and disallow altering schema without bucket/sort columns.

Yeah hive behavior is absurd. If the alter table statement removes bucketing columns, we should either disallow such operation (seems harsh) OR remove the bucketing spec (which is what I am doing in this PR).

... disallow altering schema without bucket/sort columns.

I am not sure if I understand this right. Does the current approach in this PR match this or you are proposing something else ?

let's disallow removing bucketing columns in SessionCatalog.alterTableSchema, to make the logic simplier, we can change the behavior later if we wanna support this.

Updated the PR with this change

cloud-fan · 2017-05-04T14:17:59Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala

is it safe to do so? I think this PR only covers the write path?

I dont think this could cause problems (you can correct me if I am missing anything). Even if bucketing spec is set over here, this PR does not make use of that information in scan operators.

here we are reading hive table as a data source table, so the scan operator is actually FileSourceScanExec, which recognizes bucketing.

You are right. I have corrected this. Thanks for pointing this out !!

cloud-fan · 2017-05-04T14:20:20Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala

why do we need to check this? Doesn't hive metastore already guarantees this?

I had added this for sanity. As you mentioned, hive metastore already guarantees this. I will remove this.

cloud-fan · 2017-05-04T14:23:05Z

A high-level question: if we propagate the bucketing information of a hive table, can we correctly read this table with Spark SQL? Shall we have a config to change Spark's shuffle hash function to use hive hash?

tejasapatil · 2017-05-04T18:10:32Z

@cloud-fan : Thanks for reviewing !!

if we propagate the bucketing information of a hive table, can we correctly read this table with Spark SQL?

With this PR, HiveTableScanExec still doesn't advertise the table to be bucketed so optimizer wont treat the data as bucketed. tldr: It would be read just like a regular table (as it happens now in trunk).

Shall we have a config to change Spark's shuffle hash function to use hive hash?

Yeah. We need to add that but I will do that in a separate PR. I had a PR from last year but given that there are many changes in SQL land, I couldn't rebase it. Am working on creating a new PR instead after studying the new changes since last year.

SparkQA · 2017-05-04T20:32:32Z

Test build #76462 has finished for PR 17644 at commit 6e6e767.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-05-04T21:28:31Z

Test build #76460 has finished for PR 17644 at commit e3c41bf.

This patch passes all tests.
This patch does not merge cleanly.
This patch adds no public classes.

tejasapatil · 2017-05-05T23:33:15Z

Jenkins test this please

SparkQA · 2017-05-06T03:47:39Z

Test build #76506 has finished for PR 17644 at commit 6e6e767.

This patch fails from timeout after a configured wait of `250m`.
This patch merges cleanly.
This patch adds no public classes.

tejasapatil · 2017-05-06T05:59:00Z

Jenkins test this please

SparkQA · 2017-05-06T06:02:35Z

Test build #76515 has started for PR 17644 at commit 6e6e767.

tejasapatil · 2017-05-06T21:42:37Z

Jenkins test this please

SparkQA · 2017-05-07T00:08:18Z

Test build #76529 has finished for PR 17644 at commit 6e6e767.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-05-08T20:20:13Z

Test build #76588 has finished for PR 17644 at commit ab12943.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

tejasapatil · 2017-05-08T22:04:45Z

Jenkins test this please

SparkQA · 2017-05-09T00:31:14Z

Test build #76595 has finished for PR 17644 at commit 7a0be86.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-05-12T22:44:52Z

Test build #76882 has finished for PR 17644 at commit 10a9fd0.

This patch passes all tests.
This patch does not merge cleanly.
This patch adds no public classes.

…umns is not allowed

SparkQA · 2017-05-13T02:08:06Z

Test build #76884 has finished for PR 17644 at commit d6ce8b5.

This patch passes all tests.
This patch does not merge cleanly.
This patch adds no public classes.

SparkQA · 2017-05-13T02:47:15Z

Test build #76887 has finished for PR 17644 at commit 0aa8539.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-05-13T14:40:50Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/ExternalCatalog.scala


 package org.apache.spark.sql.catalyst.catalog

+import org.apache.spark.sql.AnalysisException


please remove these unnecessary changes in this PR.

cloud-fan · 2017-05-13T14:47:22Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala

-        // We can not populate bucketing information for Hive tables as Spark SQL has a different
-        // implementation of hash function from Hive.
-        bucketSpec = None,
+        bucketSpec = bucketSpec,


please add a comment to say that, for data source tables, we will always overwrite the bucket spec in HiveExternalCatalog with the bucketing information in table properties.

cloud-fan · 2017-05-13T14:53:29Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala

+          throw new AnalysisException(message)
+        } else {
+          logWarning(message + s" Inserting data anyways since both $enforceBucketingConfig and " +
+            s"$enforceSortingConfig are set to false.")


shall we remove the bucket properties of the table in this case? what does hive do?

In Hive 1.x:

If enforcing configs are set: data is populated respecting the tables' bucketing spec

If enforcing configs are NOT set: data is populated and not conformant with the tables' bucketing spec

In Hive 2.x, these configs are not there and hive would always populate data conformant with table's bucketing.

With Spark, currently the data would be written out in a non conformant way despite of that config being set or not. This PR will go to the model of Hive 1.x. I am working on a next PR which would make things like Hive 2.0.

so after insertion(if not enforcing), the table is still a buckted table but read it will cause wrong result?

In hive: It would lead to wrong result.

In spark (over master and also after this PR): the table scan operation does not take bucketing into account so it would be read as a regular table. So, it won't be read "wrong", its just that we wont take advantage of bucketing.

SparkQA · 2017-05-14T18:23:38Z

Test build #76909 has finished for PR 17644 at commit efc8c8b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-05-14T18:27:35Z

Test build #76910 has finished for PR 17644 at commit bf306da.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-05-15T08:11:51Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala

-        // We can not populate bucketing information for Hive tables as Spark SQL has a different
-        // implementation of hash function from Hive.
-        bucketSpec = None,
+        // For data source tables, we will always overwrite the bucket spec in


Sorry I made a mistake. This is true even for hive tables.

If the table is written by Spark, we will put bucketing information in table properties, and will always overwrite the bucket spec in hive metastore by the bucketing information in table properties. This means, if we have bucket spec in both hive metastore and table properties, we will trust the one in table properties.

let's update this document.

cloud-fan · 2017-05-15T08:14:16Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala

    }
+
+    table.bucketSpec match {
+      case Some(bucketSpec) if DDLUtils.isHiveTable(table) =>


After thinking it more, I think this one is less important.

Our main goal is to allow spark to read bucketed tables written by hive, but not to allow hive to read bucketed tables written by spark.

How about we remove it for now and add it later after more discussion?

but not to allow hive to read bucketed tables written by spark.

If Hive is NOT able to read the datasource tables which are bucketed, thats OK as its not compatible with hive. But for hive native tables, the interoperability amongst spark and hive is what I want.

ok makes sense

cloud-fan · 2017-05-15T08:17:51Z

LGTM except 2 comments

SparkQA · 2017-05-15T17:01:59Z

Test build #76941 has finished for PR 17644 at commit 865711f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-05-15T17:47:38Z

thanks, merging to master!

## What changes were proposed in this pull request? Hive allows inserting data to bucketed table without guaranteeing bucketed and sorted-ness based on these two configs : `hive.enforce.bucketing` and `hive.enforce.sorting`. What does this PR achieve ? - Spark will disallow users from writing outputs to hive bucketed tables by default (given that output won't adhere with Hive's semantics). - IF user still wants to write to hive bucketed table, the only resort is to use `hive.enforce.bucketing=false` and `hive.enforce.sorting=false` which means user does NOT care about bucketing guarantees. Changes done in this PR: - Extract table's bucketing information in `HiveClientImpl` - While writing table info to metastore, `HiveClientImpl` now populates the bucketing information in the hive `Table` object - `InsertIntoHiveTable` allows inserts to bucketed table only if both `hive.enforce.bucketing` and `hive.enforce.sorting` are `false` Ability to create bucketed tables will enable adding test cases to Spark while I add more changes related to hive bucketing support. Design doc for hive hive bucketing support : https://docs.google.com/document/d/1a8IDh23RAkrkg9YYAeO51F4aGO8-xAlupKwdshve2fc/edit# ## How was this patch tested? - Added test for creating bucketed and sorted table. - Added test to ensure that INSERTs fail if strict bucket / sort is enforced - Added test to ensure that INSERTs can go through if strict bucket / sort is NOT enforced - Added test to validate that bucketing information shows up in output of DESC FORMATTED - Added test to ensure that `SHOW CREATE TABLE` works for hive bucketed tables Author: Tejas Patil <[email protected]> Closes apache#17644 from tejasapatil/SPARK-17729_create_bucketed_table.

cloud-fan · 2017-11-02T11:43:10Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/catalog/ExternalCatalogSuite.scala

    val tbl1 = catalog.getTable("db2", "tbl1")
    val newSchema = StructType(Seq(
-      StructField("new_field_1", IntegerType),
+      StructField("col1", IntegerType),


@tejasapatil do you still remember why we update it?

yes. This was done because before this PR the test case was removing bucketed columns in the alter operation. We decided to disallow removing bucketing columns and support this if needed in future.

Here is the discussion that we both had about this : #17644 (comment)

Huh. If I revert the change to this test case, it does not fail anymore. This is bad because the table properties still say that its bucketed over col1 but col1 is not in the modified table schema. I am taking a look to see what changed.

tejasapatil mentioned this pull request Apr 15, 2017

[SPARK-17729] [SQL] Enable creating hive bucketed tables #15300

Closed

cloud-fan reviewed May 4, 2017

View reviewed changes

tejasapatil force-pushed the SPARK-17729_create_bucketed_table branch 2 times, most recently from e3c41bf to 6e6e767 Compare May 4, 2017 18:12

tejasapatil force-pushed the SPARK-17729_create_bucketed_table branch from 6e6e767 to ab12943 Compare May 8, 2017 18:57

tejasapatil force-pushed the SPARK-17729_create_bucketed_table branch from ab12943 to 7a0be86 Compare May 8, 2017 22:03

tejasapatil force-pushed the SPARK-17729_create_bucketed_table branch from 7a0be86 to 49040e8 Compare May 9, 2017 03:43

tejasapatil force-pushed the SPARK-17729_create_bucketed_table branch from 239beee to 10a9fd0 Compare May 12, 2017 20:26

tejasapatil force-pushed the SPARK-17729_create_bucketed_table branch from 10a9fd0 to d6ce8b5 Compare May 12, 2017 23:41

tejasapatil added 6 commits May 12, 2017 17:17

[SPARK-17729] [SQL] Enable creating hive bucketed tables

3799d18

fixing failed test cases : HiveDDLSuite and HiveExternalCatalogSuite

6315dda

review comment apache#1

303f442

nonEmpty => isDefined

df3fd7a

check explicitly for hive tables

b2784ba

remove check as alterTableSchema ensures that dropping existing col…

0aa8539

…umns is not allowed

tejasapatil force-pushed the SPARK-17729_create_bucketed_table branch from d6ce8b5 to 0aa8539 Compare May 13, 2017 00:17

cloud-fan reviewed May 13, 2017

View reviewed changes

review comment

bf306da

tejasapatil force-pushed the SPARK-17729_create_bucketed_table branch from efc8c8b to bf306da Compare May 14, 2017 15:57

cloud-fan reviewed May 15, 2017

View reviewed changes

fix documentation

865711f

asfgit closed this in d241692 May 15, 2017

tejasapatil deleted the SPARK-17729_create_bucketed_table branch May 15, 2017 22:29

cloud-fan reviewed Nov 2, 2017

View reviewed changes


		package org.apache.spark.sql.catalyst.catalog

		import org.apache.spark.sql.AnalysisException

[SPARK-17729] [SQL] Enable creating hive bucketed tables #17644

[SPARK-17729] [SQL] Enable creating hive bucketed tables #17644

Uh oh!

Conversation

tejasapatil commented Apr 15, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Apr 16, 2017

Uh oh!

SparkQA commented Apr 16, 2017

Uh oh!

tejasapatil commented Apr 16, 2017

Uh oh!

cloud-fan commented Apr 17, 2017

Uh oh!

SparkQA commented Apr 29, 2017

Uh oh!

tejasapatil commented Apr 30, 2017

Uh oh!

SparkQA commented Apr 30, 2017

Uh oh!

tejasapatil commented May 1, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan May 8, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented May 4, 2017

Uh oh!

tejasapatil commented May 4, 2017

Uh oh!

SparkQA commented May 4, 2017

Uh oh!

SparkQA commented May 4, 2017

Uh oh!

tejasapatil commented May 5, 2017

Uh oh!

SparkQA commented May 6, 2017

Uh oh!

tejasapatil commented May 6, 2017

Uh oh!

SparkQA commented May 6, 2017

Uh oh!

tejasapatil commented May 6, 2017

Uh oh!

SparkQA commented May 7, 2017

Uh oh!

SparkQA commented May 8, 2017

Uh oh!

tejasapatil commented May 8, 2017

Uh oh!

SparkQA commented May 9, 2017

tejasapatil commented Apr 15, 2017 •

edited

Loading

cloud-fan May 8, 2017 •

edited

Loading

cloud-fan May 15, 2017 •

edited

Loading