Skip to content

Conversation

@wangyum
Copy link
Member

@wangyum wangyum commented Apr 29, 2019

What changes were proposed in this pull request?

Hive using incorrect InputFormat(org.apache.hadoop.mapred.SequenceFileInputFormat) to read Spark's Parquet bucketed data source table.
Spark side:

spark-sql> CREATE TABLE t (c1 INT, c2 INT) USING parquet CLUSTERED BY (c1) SORTED BY (c1) INTO 2 BUCKETS;
2019-04-29 17:52:05 WARN  HiveExternalCatalog:66 - Persisting bucketed data source table `default`.`t` into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive.
spark-sql> DESC FORMATTED t;
c1	int	NULL
c2	int	NULL

# Detailed Table Information
Database	default
Table	t
Owner	yumwang
Created Time	Mon Apr 29 17:52:05 CST 2019
Last Access	Thu Jan 01 08:00:00 CST 1970
Created By	Spark 2.4.0
Type	MANAGED
Provider	parquet
Num Buckets	2
Bucket Columns	[`c1`]
Sort Columns	[`c1`]
Table Properties	[transient_lastDdlTime=1556531525]
Location	file:/user/hive/warehouse/t
Serde Library	org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat	org.apache.hadoop.mapred.SequenceFileInputFormat
OutputFormat	org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
Storage Properties	[serialization.format=1]

Hive side:

hive> DESC FORMATTED t;
OK
# col_name            	data_type           	comment

c1                  	int
c2                  	int

# Detailed Table Information
Database:           	default
Owner:              	root
CreateTime:         	Wed May 08 03:38:46 GMT-07:00 2019
LastAccessTime:     	UNKNOWN
Retention:          	0
Location:           	file:/user/hive/warehouse/t
Table Type:         	MANAGED_TABLE
Table Parameters:
	bucketing_version   	spark
	spark.sql.create.version	3.0.0-SNAPSHOT
	spark.sql.sources.provider	parquet
	spark.sql.sources.schema.bucketCol.0	c1
	spark.sql.sources.schema.numBucketCols	1
	spark.sql.sources.schema.numBuckets	2
	spark.sql.sources.schema.numParts	1
	spark.sql.sources.schema.numSortCols	1
	spark.sql.sources.schema.part.0	{\"type\":\"struct\",\"fields\":[{\"name\":\"c1\",\"type\":\"integer\",\"nullable\":true,\"metadata\":{}},{\"name\":\"c2\",\"type\":\"integer\",\"nullable\":true,\"metadata\":{}}]}
	spark.sql.sources.schema.sortCol.0	c1
	transient_lastDdlTime	1557311926

# Storage Information
SerDe Library:      	org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
InputFormat:        	org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
OutputFormat:       	org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
Compressed:         	No
Num Buckets:        	-1
Bucket Columns:     	[]
Sort Columns:       	[]
Storage Desc Params:
	path                	file:/user/hive/warehouse/t
	serialization.format	1

So it's non-bucketed table at Hive side. This pr set the SerDe correctly so Hive can read these tables.

Related code:

table.bucketSpec match {
case Some(bucketSpec) if DDLUtils.isHiveTable(table) =>
hiveTable.setNumBuckets(bucketSpec.numBuckets)
hiveTable.setBucketCols(bucketSpec.bucketColumnNames.toList.asJava)
if (bucketSpec.sortColumnNames.nonEmpty) {
hiveTable.setSortCols(
bucketSpec.sortColumnNames
.map(col => new Order(col, HIVE_COLUMN_ORDER_ASC))
.toList
.asJava
)
}
case _ =>
}

if (bucketSpec.isDefined) {
val BucketSpec(numBuckets, bucketColumnNames, sortColumnNames) = bucketSpec.get
properties.put(DATASOURCE_SCHEMA_NUMBUCKETS, numBuckets.toString)
properties.put(DATASOURCE_SCHEMA_NUMBUCKETCOLS, bucketColumnNames.length.toString)
bucketColumnNames.zipWithIndex.foreach { case (bucketCol, index) =>
properties.put(s"$DATASOURCE_SCHEMA_BUCKETCOL_PREFIX$index", bucketCol)
}
if (sortColumnNames.nonEmpty) {
properties.put(DATASOURCE_SCHEMA_NUMSORTCOLS, sortColumnNames.length.toString)
sortColumnNames.zipWithIndex.foreach { case (sortCol, index) =>
properties.put(s"$DATASOURCE_SCHEMA_SORTCOL_PREFIX$index", sortCol)
}
}
}

How was this patch tested?

unit tests

@SparkQA
Copy link

SparkQA commented Apr 29, 2019

Test build #104987 has finished for PR 24486 at commit 6ce0a32.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wangyum
Copy link
Member Author

wangyum commented Apr 29, 2019

retest this please

@SparkQA
Copy link

SparkQA commented Apr 29, 2019

Test build #104994 has finished for PR 24486 at commit 6ce0a32.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wangyum wangyum changed the title [SPARK-27592][SQL] Write the data of table write information to metadata [SPARK-27592][SQL] Set the bucketed data source table SerDe correctly May 4, 2019
@SparkQA
Copy link

SparkQA commented May 4, 2019

Test build #105125 has finished for PR 24486 at commit 2fdc3a6.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wangyum
Copy link
Member Author

wangyum commented May 4, 2019

retest this please

@SparkQA
Copy link

SparkQA commented May 4, 2019

Test build #105126 has finished for PR 24486 at commit 2fdc3a6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

// our bucketing is un-compatible with hive(different hash function).
// but downstream(Hive/Presto) still can read it as not bucketed table.
// We set the SerDe correctly and bucketing_version to spark.
// The downstream decides how to read it by themselves, a similar implementation:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the impact if the downstream makes a wrong decision?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry. It's not bucketed table at Hive side. Related code:

if (bucketSpec.isDefined) {
val BucketSpec(numBuckets, bucketColumnNames, sortColumnNames) = bucketSpec.get
properties.put(DATASOURCE_SCHEMA_NUMBUCKETS, numBuckets.toString)
properties.put(DATASOURCE_SCHEMA_NUMBUCKETCOLS, bucketColumnNames.length.toString)
bucketColumnNames.zipWithIndex.foreach { case (bucketCol, index) =>
properties.put(s"$DATASOURCE_SCHEMA_BUCKETCOL_PREFIX$index", bucketCol)
}
if (sortColumnNames.nonEmpty) {
properties.put(DATASOURCE_SCHEMA_NUMSORTCOLS, sortColumnNames.length.toString)
sortColumnNames.zipWithIndex.foreach { case (sortCol, index) =>
properties.put(s"$DATASOURCE_SCHEMA_SORTCOL_PREFIX$index", sortCol)
}
}
}

table.bucketSpec match {
case Some(bucketSpec) if DDLUtils.isHiveTable(table) =>
hiveTable.setNumBuckets(bucketSpec.numBuckets)
hiveTable.setBucketCols(bucketSpec.bucketColumnNames.toList.asJava)
if (bucketSpec.sortColumnNames.nonEmpty) {
hiveTable.setSortCols(
bucketSpec.sortColumnNames
.map(col => new Order(col, HIVE_COLUMN_ORDER_ASC))
.toList
.asJava
)
}
case _ =>
}

@SparkQA
Copy link

SparkQA commented May 8, 2019

Test build #105253 has finished for PR 24486 at commit 4e1dd5c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you manually tested it? To read the Spark bucketed table in Hive side as non-bucketed table?

@wangyum
Copy link
Member Author

wangyum commented May 9, 2019

Have you manually tested it? To read the Spark bucketed table in Hive side as non-bucketed table?

Yes. I have tested it. Note that we should set spark.sql.parquet.writeLegacyFormat to true.

(None, message)
"Hive metastore in Spark SQL specific format, which is NOT compatible with Hive. " +
"But Hive can read it as not bucketed table."
(Some(newHiveCompatibleMetastoreTable(serde)), message)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we set bucketSpec = None?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessary:

table.bucketSpec match {
case Some(bucketSpec) if DDLUtils.isHiveTable(table) =>
hiveTable.setNumBuckets(bucketSpec.numBuckets)
hiveTable.setBucketCols(bucketSpec.bucketColumnNames.toList.asJava)
if (bucketSpec.sortColumnNames.nonEmpty) {
hiveTable.setSortCols(
bucketSpec.sortColumnNames
.map(col => new Order(col, HIVE_COLUMN_ORDER_ASC))
.toList
.asJava
)
}
case _ =>
}

// It's not a bucketed table at Hive side
val client =
spark.sharedState.externalCatalog.unwrapped.asInstanceOf[HiveExternalCatalog].client
val hiveSide = client.runSqlHive("DESC FORMATTED t")
Copy link
Member

@gatorsmile gatorsmile May 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also check the results of the read path.

@SparkQA
Copy link

SparkQA commented May 23, 2019

Test build #105723 has finished for PR 24486 at commit 26c895a.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 23, 2019

Test build #105725 has finished for PR 24486 at commit 87b302c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wangyum
Copy link
Member Author

wangyum commented Aug 15, 2019

ping @cloud-fan


}

test("Set the bucketed data source table SerDe correctly") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's include the jira id in test name.

@cloud-fan
Copy link
Contributor

LGTM

@SparkQA
Copy link

SparkQA commented Aug 15, 2019

Test build #109147 has finished for PR 24486 at commit 842bd3e.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wangyum
Copy link
Member Author

wangyum commented Aug 15, 2019

retest this please

@SparkQA
Copy link

SparkQA commented Aug 15, 2019

Test build #109149 has finished for PR 24486 at commit 842bd3e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 1b416a0 Aug 15, 2019
@wangyum wangyum deleted the SPARK-27592 branch August 15, 2019 09:39
|CLUSTERED BY (c1)
|SORTED BY (c1)
|INTO 2 BUCKETS
|AS SELECT 1 AS c1, 2 AS c2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only one row is hard to prove Hive can read it correctly. Could you improve the tests?

In addition, try to create a partitioned and bucked table and see whether they are readable by Hive.

You can create a separate test suite for it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants