-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-33477][SQL] Hive Metastore support filter by date type #30408
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #131274 has finished for PR 30408 at commit
|
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala
Show resolved
Hide resolved
| (Literal(1) === a("intcol", IntegerType)) :: (Literal("a") === a("strcol", IntegerType)) :: Nil, | ||
| "1 = intcol and \"a\" = strcol") | ||
|
|
||
| filterTest("date filter", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we run these test with different hive versions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Different hive versions tested by HivePartitionFilteringSuite:
spark/sql/hive/src/test/scala/org/apache/spark/sql/hive/client/HivePartitionFilteringSuite.scala
Lines 290 to 343 in ba2f553
| test("getPartitionsByFilter: date type pruning by metastore") { | |
| val table = CatalogTable( | |
| identifier = TableIdentifier("test_date", Some("default")), | |
| tableType = CatalogTableType.MANAGED, | |
| schema = new StructType().add("value", "int").add("part", "date"), | |
| partitionColumnNames = Seq("part"), | |
| storage = storageFormat) | |
| client.createTable(table, ignoreIfExists = false) | |
| val partitions = | |
| for { | |
| date <- Seq("2019-01-01", "2019-01-02", "2019-01-03", "2019-01-04") | |
| } yield CatalogTablePartition(Map( | |
| "part" -> date | |
| ), storageFormat) | |
| assert(partitions.size == 4) | |
| client.createPartitions("default", "test_date", partitions, ignoreIfExists = false) | |
| def testDataTypeFiltering( | |
| filterExprs: Seq[Expression], | |
| expectedPartitionCubes: Seq[Seq[Date]]): Unit = { | |
| val filteredPartitions = client.getPartitionsByFilter( | |
| client.getTable("default", "test_date"), | |
| filterExprs, | |
| SQLConf.get.sessionLocalTimeZone) | |
| val expectedPartitions = expectedPartitionCubes.map { | |
| expectedDt => | |
| for { | |
| dt <- expectedDt | |
| } yield Set( | |
| "part" -> dt.toString | |
| ) | |
| }.reduce(_ ++ _) | |
| assert(filteredPartitions.map(_.spec.toSet).toSet == expectedPartitions.toSet) | |
| } | |
| testDataTypeFiltering( | |
| Seq(AttributeReference("part", DateType)() === Date.valueOf("2019-01-01")), | |
| Seq("2019-01-01").map(Date.valueOf) :: Nil) | |
| testDataTypeFiltering( | |
| Seq(AttributeReference("part", DateType)() > Date.valueOf("2019-01-02")), | |
| Seq("2019-01-03", "2019-01-04").map(Date.valueOf) :: Nil) | |
| testDataTypeFiltering( | |
| Seq(In(AttributeReference("part", DateType)(), | |
| Seq("2019-01-01", "2019-01-02").map(d => Literal(Date.valueOf(d))))), | |
| Seq("2019-01-01", "2019-01-02").map(Date.valueOf) :: Nil) | |
| testDataTypeFiltering( | |
| Seq(InSet(AttributeReference("part", DateType)(), | |
| Set("2019-01-01", "2019-01-02").map(d => Literal(Date.valueOf(d)).eval(EmptyRow)))), | |
| Seq("2019-01-01", "2019-01-02").map(Date.valueOf) :: Nil) | |
| } |
|
@wangyum can you resolve the conflicts? thanks! |
# Conflicts: # sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala # sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveShim.scala # sql/hive/src/test/scala/org/apache/spark/sql/hive/client/HivePartitionFilteringSuite.scala
|
last question about correctness: Does hive execute the partition predicate as date comparison or string comparison? The later can be problematic. |
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
|
Test build #131357 has finished for PR 30408 at commit
|
| "2019-01-01 = datecol and \"a\" = strcol") | ||
|
|
||
| filterTest("date filter with null", | ||
| (a("datecol", DateType) === Literal(null)) :: Nil, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not related to this PR, but we can pushdown col is null predicate to hive for this case.
| } | ||
|
|
||
| testDataTypeFiltering( | ||
| Seq(AttributeReference("part", DateType)() === Date.valueOf("2019-01-01")), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we create an attr method to get the AttributeReference from the table? to follow other tests.
cloud-fan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM except one comment for test
maropu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other parts looks fine.
|
|
||
| def unapply(values: Set[Any]): Option[Seq[String]] = { | ||
| val extractables = values.toSeq.map(valueToLiteralString.lift) | ||
| if (extractables.nonEmpty && extractables.forall(_.isDefined)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need forall here? InSet can have mixed values: int and other types?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise this test will fail:
filterTest("string filter with InSet predicate",
(InSet(a("stringcol", StringType),
Range(1, 3).map(d => UTF8String.fromString(d.toString)).toSet)) :: Nil,
"(stringcol = \"1\" or stringcol = \"2\")")None.get
java.util.NoSuchElementException: None.get
at scala.None$.get(Option.scala:529)
at scala.None$.get(Option.scala:527)
at org.apache.spark.sql.hive.client.Shim_v0_13$ExtractableDateValues$1$.$anonfun$unapply$7(HiveShim.scala:720)
at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
at scala.collection.Iterator.foreach(Iterator.scala:941)
at scala.collection.Iterator.foreach$(Iterator.scala:941)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, ok. Thanks.
|
Test build #131624 has finished for PR 30408 at commit
|
|
retest this please |
|
Test build #131631 has finished for PR 30408 at commit
|
|
@shaneknapp Did you set : This issue should be fixed if we set |
|
@wangyum How about asking it in the spark-dev thread so that Shane could notice it quickly?http://apache-spark-developers-list.1001551.n3.nabble.com/jenkins-downtime-tomorrow-evening-weekend-tt30405.html |
|
retest this please |
|
Test build #131689 has finished for PR 30408 at commit
|
sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveShim.scala
Outdated
Show resolved
Hide resolved
sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveShim.scala
Outdated
Show resolved
Hide resolved
sql/hive/src/test/scala/org/apache/spark/sql/hive/client/FiltersSuite.scala
Show resolved
Hide resolved
sql/hive/src/test/scala/org/apache/spark/sql/hive/client/FiltersSuite.scala
Show resolved
Hide resolved
|
Test build #131740 has finished for PR 30408 at commit
|
|
retest this please. |
|
Merged to master. |
|
Test build #131751 has finished for PR 30408 at commit
|
|
|
||
| val partitions = | ||
| for { | ||
| date <- Seq("2019-01-01", "2019-01-02", "2019-01-03", "2019-01-04") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about NULL?
|
@wangyum Could you add more test cases to check the NULL handling cases? For example,
Please check https://spark.apache.org/docs/3.0.1/sql-ref-null-semantics.html#comp-operators |
OK |
What changes were proposed in this pull request?
Hive Metastore supports strings and integral types in filters. It could also support dates. Please see HIVE-5679 for more details.
This pr add support it.
Why are the changes needed?
Improve query performance.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Unit test.