[SPARK-33477][SQL] Hive Metastore support filter by date type #30408

wangyum · 2020-11-18T08:57:24Z

What changes were proposed in this pull request?

Hive Metastore supports strings and integral types in filters. It could also support dates. Please see HIVE-5679 for more details.

This pr add support it.

Why are the changes needed?

Improve query performance.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Unit test.

SparkQA · 2020-11-18T11:07:04Z

Test build #131274 has finished for PR 30408 at commit ba2f553.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-11-18T11:47:54Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35878/

SparkQA · 2020-11-18T12:17:47Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35878/

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala

sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveShim.scala

cloud-fan · 2020-11-18T13:51:42Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/client/FiltersSuite.scala

    (Literal(1) === a("intcol", IntegerType)) :: (Literal("a") === a("strcol", IntegerType)) :: Nil,
    "1 = intcol and \"a\" = strcol")

+  filterTest("date filter",


do we run these test with different hive versions?

Different hive versions tested by HivePartitionFilteringSuite:

spark/sql/hive/src/test/scala/org/apache/spark/sql/hive/client/HivePartitionFilteringSuite.scala

Lines 290 to 343 in ba2f553

test("getPartitionsByFilter: date type pruning by metastore") {

val table = CatalogTable(

identifier = TableIdentifier("test_date", Some("default")),

tableType = CatalogTableType.MANAGED,

schema = new StructType().add("value", "int").add("part", "date"),

partitionColumnNames = Seq("part"),

storage = storageFormat)

client.createTable(table, ignoreIfExists = false)

val partitions =

for {

date <- Seq("2019-01-01", "2019-01-02", "2019-01-03", "2019-01-04")

} yield CatalogTablePartition(Map(

"part" -> date

), storageFormat)

assert(partitions.size == 4)

client.createPartitions("default", "test_date", partitions, ignoreIfExists = false)

def testDataTypeFiltering(

filterExprs: Seq[Expression],

expectedPartitionCubes: Seq[Seq[Date]]): Unit = {

val filteredPartitions = client.getPartitionsByFilter(

client.getTable("default", "test_date"),

filterExprs,

SQLConf.get.sessionLocalTimeZone)

val expectedPartitions = expectedPartitionCubes.map {

expectedDt =>

for {

dt <- expectedDt

} yield Set(

"part" -> dt.toString

)

}.reduce(_ ++ _)

assert(filteredPartitions.map(_.spec.toSet).toSet == expectedPartitions.toSet)

}

testDataTypeFiltering(

Seq(AttributeReference("part", DateType)() === Date.valueOf("2019-01-01")),

Seq("2019-01-01").map(Date.valueOf) :: Nil)

testDataTypeFiltering(

Seq(AttributeReference("part", DateType)() > Date.valueOf("2019-01-02")),

Seq("2019-01-03", "2019-01-04").map(Date.valueOf) :: Nil)

testDataTypeFiltering(

Seq(In(AttributeReference("part", DateType)(),

Seq("2019-01-01", "2019-01-02").map(d => Literal(Date.valueOf(d))))),

Seq("2019-01-01", "2019-01-02").map(Date.valueOf) :: Nil)

testDataTypeFiltering(

Seq(InSet(AttributeReference("part", DateType)(),

Set("2019-01-01", "2019-01-02").map(d => Literal(Date.valueOf(d)).eval(EmptyRow)))),

Seq("2019-01-01", "2019-01-02").map(Date.valueOf) :: Nil)

}

cloud-fan · 2020-11-19T13:11:08Z

@wangyum can you resolve the conflicts? thanks!

# Conflicts: # sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala # sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveShim.scala # sql/hive/src/test/scala/org/apache/spark/sql/hive/client/HivePartitionFilteringSuite.scala

cloud-fan · 2020-11-19T15:11:47Z

last question about correctness: Does hive execute the partition predicate as date comparison or string comparison? The later can be problematic.

SparkQA · 2020-11-19T16:07:44Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35961/

SparkQA · 2020-11-19T16:33:55Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35961/

SparkQA · 2020-11-19T17:19:18Z

Test build #131357 has finished for PR 30408 at commit ce5f0d1.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class ExecutorSource(
implicit class MetadataColumnsHelper(metadata: Array[MetadataColumn])

wangyum · 2020-11-19T23:27:29Z

It's date comparison: https://github.com/apache/hive/blob/rel/release-2.3.7/metastore/src/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java#L1141-L1148

cloud-fan · 2020-11-23T15:10:42Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/client/FiltersSuite.scala

+    "2019-01-01 = datecol and \"a\" = strcol")
+
+  filterTest("date filter with null",
+    (a("datecol", DateType) ===  Literal(null)) :: Nil,


not related to this PR, but we can pushdown col is null predicate to hive for this case.

cloud-fan · 2020-11-23T15:14:28Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/client/HivePartitionFilteringSuite.scala

+    }
+
+    testDataTypeFiltering(
+      Seq(AttributeReference("part", DateType)() === Date.valueOf("2019-01-01")),


can we create an attr method to get the AttributeReference from the table? to follow other tests.

cloud-fan

LGTM except one comment for test

maropu

The other parts looks fine.

maropu · 2020-11-24T02:13:24Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveShim.scala

+
+      def unapply(values: Set[Any]): Option[Seq[String]] = {
+        val extractables = values.toSeq.map(valueToLiteralString.lift)
+        if (extractables.nonEmpty && extractables.forall(_.isDefined)) {


Why do we need forall here? InSet can have mixed values: int and other types?

Otherwise this test will fail:

filterTest("string filter with InSet predicate", (InSet(a("stringcol", StringType), Range(1, 3).map(d => UTF8String.fromString(d.toString)).toSet)) :: Nil, "(stringcol = \"1\" or stringcol = \"2\")")

None.get java.util.NoSuchElementException: None.get at scala.None$.get(Option.scala:529) at scala.None$.get(Option.scala:527) at org.apache.spark.sql.hive.client.Shim_v0_13$ExtractableDateValues$1$.$anonfun$unapply$7(HiveShim.scala:720) at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) at scala.collection.Iterator.foreach(Iterator.scala:941) at scala.collection.Iterator.foreach$(Iterator.scala:941)

Ah, ok. Thanks.

SparkQA · 2020-11-24T07:12:59Z

Test build #131624 has finished for PR 30408 at commit 752eb8d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-11-24T07:19:38Z

retest this please

SparkQA · 2020-11-24T09:09:09Z

Test build #131631 has finished for PR 30408 at commit 752eb8d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

wangyum · 2020-11-24T09:24:52Z

@shaneknapp Did you set :export LANG=en_US.UTF-8?

org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.nio.file.InvalidPathException: Malformed input or input contains unmappable characters: /home/jenkins/workspace/SparkPullRequestBuilder@3/sql/hive/target/tmp/hive_execution_test_group/warehouse-1355e680-268f-4224-b549-eaddcadcf136/DaTaBaSe_I.db/tab_ı);

This issue should be fixed if we set export LANG=en_US.UTF-8, more details:https://issues.apache.org/jira/browse/SPARK-27177

maropu · 2020-11-24T09:58:17Z

@wangyum How about asking it in the spark-dev thread so that Shane could notice it quickly?http://apache-spark-developers-list.1001551.n3.nabble.com/jenkins-downtime-tomorrow-evening-weekend-tt30405.html

wangyum · 2020-11-24T18:05:56Z

retest this please

SparkQA · 2020-11-24T20:29:58Z

Test build #131689 has finished for PR 30408 at commit 752eb8d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveShim.scala

sql/hive/src/test/scala/org/apache/spark/sql/hive/client/FiltersSuite.scala

SparkQA · 2020-11-25T06:53:25Z

Test build #131740 has finished for PR 30408 at commit 29c489a.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

wangyum · 2020-11-25T07:14:17Z

retest this please.

HyukjinKwon · 2020-11-25T07:39:04Z

Merged to master.

SparkQA · 2020-11-25T09:21:17Z

Test build #131751 has finished for PR 30408 at commit 29c489a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2021-01-03T03:04:28Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/client/HivePartitionFilteringSuite.scala

+
+    val partitions =
+      for {
+        date <- Seq("2019-01-01", "2019-01-02", "2019-01-03", "2019-01-04")


How about NULL?

gatorsmile · 2021-01-03T03:09:35Z

@wangyum Could you add more test cases to check the NULL handling cases? For example,

Include NULL values in the data set
Include NULL values in the predicates
Include null-safe equals

Please check https://spark.apache.org/docs/3.0.1/sql-ref-null-semantics.html#comp-operators

wangyum · 2021-01-03T03:19:11Z

@wangyum Could you add more test cases to check the NULL handling cases? For example,

Include NULL values in the data set

Include NULL values in the predicates

Include null-safe equals

Please check https://spark.apache.org/docs/3.0.1/sql-ref-null-semantics.html#comp-operators

OK

Add support filtering on date type

ba2f553

github-actions bot added the SQL label Nov 18, 2020

wangyum requested review from HyukjinKwon, cloud-fan, dongjoon-hyun and maropu November 18, 2020 12:24

cloud-fan reviewed Nov 18, 2020

View reviewed changes

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala Show resolved Hide resolved

cloud-fan reviewed Nov 18, 2020

View reviewed changes

sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveShim.scala Show resolved Hide resolved

cloud-fan reviewed Nov 18, 2020

View reviewed changes

cloud-fan reviewed Nov 23, 2020

View reviewed changes

cloud-fan approved these changes Nov 23, 2020

View reviewed changes

maropu approved these changes Nov 24, 2020

View reviewed changes

Fix

752eb8d

wangyum changed the title ~~[SPARK-33477][SQL] Hive Metastore should support filter by date type~~ [SPARK-33477][SQL] Hive Metastore support filter by date type Nov 25, 2020

dongjoon-hyun reviewed Nov 25, 2020

View reviewed changes

sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveShim.scala Outdated Show resolved Hide resolved

dongjoon-hyun reviewed Nov 25, 2020

View reviewed changes

sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveShim.scala Outdated Show resolved Hide resolved

dongjoon-hyun reviewed Nov 25, 2020

View reviewed changes

sql/hive/src/test/scala/org/apache/spark/sql/hive/client/FiltersSuite.scala Show resolved Hide resolved

dongjoon-hyun reviewed Nov 25, 2020

View reviewed changes

sql/hive/src/test/scala/org/apache/spark/sql/hive/client/FiltersSuite.scala Show resolved Hide resolved

dongjoon-hyun approved these changes Nov 25, 2020

View reviewed changes

HyukjinKwon approved these changes Nov 25, 2020

View reviewed changes

Fix

29c489a

HyukjinKwon closed this in 781e19c Nov 25, 2020

wangyum deleted the SPARK-33477 branch November 25, 2020 08:16

gatorsmile reviewed Jan 3, 2021

View reviewed changes

map9000 mentioned this pull request Jan 13, 2022

Reading data set with date filter does not work awslabs/aws-glue-data-catalog-client-for-apache-hive-metastore#45

Open

	test("getPartitionsByFilter: date type pruning by metastore") {
	val table = CatalogTable(
	identifier = TableIdentifier("test_date", Some("default")),
	tableType = CatalogTableType.MANAGED,
	schema = new StructType().add("value", "int").add("part", "date"),
	partitionColumnNames = Seq("part"),
	storage = storageFormat)
	client.createTable(table, ignoreIfExists = false)

	val partitions =
	for {
	date <- Seq("2019-01-01", "2019-01-02", "2019-01-03", "2019-01-04")
	} yield CatalogTablePartition(Map(
	"part" -> date
	), storageFormat)
	assert(partitions.size == 4)

	client.createPartitions("default", "test_date", partitions, ignoreIfExists = false)

	def testDataTypeFiltering(
	filterExprs: Seq[Expression],
	expectedPartitionCubes: Seq[Seq[Date]]): Unit = {
	val filteredPartitions = client.getPartitionsByFilter(
	client.getTable("default", "test_date"),
	filterExprs,
	SQLConf.get.sessionLocalTimeZone)

	val expectedPartitions = expectedPartitionCubes.map {
	expectedDt =>
	for {
	dt <- expectedDt
	} yield Set(
	"part" -> dt.toString
	)
	}.reduce(_ ++ _)

	assert(filteredPartitions.map(_.spec.toSet).toSet == expectedPartitions.toSet)
	}

	testDataTypeFiltering(
	Seq(AttributeReference("part", DateType)() === Date.valueOf("2019-01-01")),
	Seq("2019-01-01").map(Date.valueOf) :: Nil)
	testDataTypeFiltering(
	Seq(AttributeReference("part", DateType)() > Date.valueOf("2019-01-02")),
	Seq("2019-01-03", "2019-01-04").map(Date.valueOf) :: Nil)
	testDataTypeFiltering(
	Seq(In(AttributeReference("part", DateType)(),
	Seq("2019-01-01", "2019-01-02").map(d => Literal(Date.valueOf(d))))),
	Seq("2019-01-01", "2019-01-02").map(Date.valueOf) :: Nil)
	testDataTypeFiltering(
	Seq(InSet(AttributeReference("part", DateType)(),
	Set("2019-01-01", "2019-01-02").map(d => Literal(Date.valueOf(d)).eval(EmptyRow)))),
	Seq("2019-01-01", "2019-01-02").map(Date.valueOf) :: Nil)
	}

[SPARK-33477][SQL] Hive Metastore support filter by date type #30408

[SPARK-33477][SQL] Hive Metastore support filter by date type #30408

Uh oh!

Conversation

wangyum commented Nov 18, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

SparkQA commented Nov 18, 2020

Uh oh!

SparkQA commented Nov 18, 2020

Uh oh!

SparkQA commented Nov 18, 2020

Uh oh!

Uh oh!

Uh oh!

cloud-fan Nov 18, 2020

Choose a reason for hiding this comment

Uh oh!

wangyum Nov 18, 2020

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Nov 19, 2020

Uh oh!

cloud-fan commented Nov 19, 2020

Uh oh!

SparkQA commented Nov 19, 2020

Uh oh!

SparkQA commented Nov 19, 2020

Uh oh!

SparkQA commented Nov 19, 2020

Uh oh!

wangyum commented Nov 19, 2020

Uh oh!

cloud-fan Nov 23, 2020

Choose a reason for hiding this comment

Uh oh!

cloud-fan Nov 23, 2020

Choose a reason for hiding this comment

Uh oh!

cloud-fan left a comment

Choose a reason for hiding this comment

Uh oh!

maropu left a comment

Choose a reason for hiding this comment

Uh oh!

maropu Nov 24, 2020

Choose a reason for hiding this comment

Uh oh!

wangyum Nov 24, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maropu Nov 24, 2020

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Nov 24, 2020

Uh oh!

cloud-fan commented Nov 24, 2020

Uh oh!

SparkQA commented Nov 24, 2020

Uh oh!

wangyum commented Nov 24, 2020

Uh oh!

maropu commented Nov 24, 2020

Uh oh!

wangyum commented Nov 24, 2020

Uh oh!

SparkQA commented Nov 24, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

SparkQA commented Nov 25, 2020

Uh oh!

wangyum commented Nov 25, 2020

Uh oh!

wangyum Nov 24, 2020 •

edited

Loading