Skip to content

Conversation

@wangyum
Copy link
Member

@wangyum wangyum commented Nov 18, 2020

What changes were proposed in this pull request?

Hive Metastore supports strings and integral types in filters. It could also support dates. Please see HIVE-5679 for more details.

This pr add support it.

Why are the changes needed?

Improve query performance.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Unit test.

@github-actions github-actions bot added the SQL label Nov 18, 2020
@SparkQA
Copy link

SparkQA commented Nov 18, 2020

Test build #131274 has finished for PR 30408 at commit ba2f553.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 18, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35878/

@SparkQA
Copy link

SparkQA commented Nov 18, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35878/

(Literal(1) === a("intcol", IntegerType)) :: (Literal("a") === a("strcol", IntegerType)) :: Nil,
"1 = intcol and \"a\" = strcol")

filterTest("date filter",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we run these test with different hive versions?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Different hive versions tested by HivePartitionFilteringSuite:

test("getPartitionsByFilter: date type pruning by metastore") {
val table = CatalogTable(
identifier = TableIdentifier("test_date", Some("default")),
tableType = CatalogTableType.MANAGED,
schema = new StructType().add("value", "int").add("part", "date"),
partitionColumnNames = Seq("part"),
storage = storageFormat)
client.createTable(table, ignoreIfExists = false)
val partitions =
for {
date <- Seq("2019-01-01", "2019-01-02", "2019-01-03", "2019-01-04")
} yield CatalogTablePartition(Map(
"part" -> date
), storageFormat)
assert(partitions.size == 4)
client.createPartitions("default", "test_date", partitions, ignoreIfExists = false)
def testDataTypeFiltering(
filterExprs: Seq[Expression],
expectedPartitionCubes: Seq[Seq[Date]]): Unit = {
val filteredPartitions = client.getPartitionsByFilter(
client.getTable("default", "test_date"),
filterExprs,
SQLConf.get.sessionLocalTimeZone)
val expectedPartitions = expectedPartitionCubes.map {
expectedDt =>
for {
dt <- expectedDt
} yield Set(
"part" -> dt.toString
)
}.reduce(_ ++ _)
assert(filteredPartitions.map(_.spec.toSet).toSet == expectedPartitions.toSet)
}
testDataTypeFiltering(
Seq(AttributeReference("part", DateType)() === Date.valueOf("2019-01-01")),
Seq("2019-01-01").map(Date.valueOf) :: Nil)
testDataTypeFiltering(
Seq(AttributeReference("part", DateType)() > Date.valueOf("2019-01-02")),
Seq("2019-01-03", "2019-01-04").map(Date.valueOf) :: Nil)
testDataTypeFiltering(
Seq(In(AttributeReference("part", DateType)(),
Seq("2019-01-01", "2019-01-02").map(d => Literal(Date.valueOf(d))))),
Seq("2019-01-01", "2019-01-02").map(Date.valueOf) :: Nil)
testDataTypeFiltering(
Seq(InSet(AttributeReference("part", DateType)(),
Set("2019-01-01", "2019-01-02").map(d => Literal(Date.valueOf(d)).eval(EmptyRow)))),
Seq("2019-01-01", "2019-01-02").map(Date.valueOf) :: Nil)
}

@cloud-fan
Copy link
Contributor

@wangyum can you resolve the conflicts? thanks!

# Conflicts:
#	sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala
#	sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveShim.scala
#	sql/hive/src/test/scala/org/apache/spark/sql/hive/client/HivePartitionFilteringSuite.scala
@cloud-fan
Copy link
Contributor

last question about correctness: Does hive execute the partition predicate as date comparison or string comparison? The later can be problematic.

@SparkQA
Copy link

SparkQA commented Nov 19, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35961/

@SparkQA
Copy link

SparkQA commented Nov 19, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35961/

@SparkQA
Copy link

SparkQA commented Nov 19, 2020

Test build #131357 has finished for PR 30408 at commit ce5f0d1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class ExecutorSource(
  • implicit class MetadataColumnsHelper(metadata: Array[MetadataColumn])

@wangyum
Copy link
Member Author

wangyum commented Nov 19, 2020

"2019-01-01 = datecol and \"a\" = strcol")

filterTest("date filter with null",
(a("datecol", DateType) === Literal(null)) :: Nil,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not related to this PR, but we can pushdown col is null predicate to hive for this case.

}

testDataTypeFiltering(
Seq(AttributeReference("part", DateType)() === Date.valueOf("2019-01-01")),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we create an attr method to get the AttributeReference from the table? to follow other tests.

Copy link
Contributor

@cloud-fan cloud-fan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except one comment for test

Copy link
Member

@maropu maropu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other parts looks fine.


def unapply(values: Set[Any]): Option[Seq[String]] = {
val extractables = values.toSeq.map(valueToLiteralString.lift)
if (extractables.nonEmpty && extractables.forall(_.isDefined)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need forall here? InSet can have mixed values: int and other types?

Copy link
Member Author

@wangyum wangyum Nov 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise this test will fail:

  filterTest("string filter with InSet predicate",
    (InSet(a("stringcol", StringType),
      Range(1, 3).map(d => UTF8String.fromString(d.toString)).toSet)) :: Nil,
    "(stringcol = \"1\" or stringcol = \"2\")")
None.get
java.util.NoSuchElementException: None.get
	at scala.None$.get(Option.scala:529)
	at scala.None$.get(Option.scala:527)
	at org.apache.spark.sql.hive.client.Shim_v0_13$ExtractableDateValues$1$.$anonfun$unapply$7(HiveShim.scala:720)
	at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
	at scala.collection.Iterator.foreach(Iterator.scala:941)
	at scala.collection.Iterator.foreach$(Iterator.scala:941)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, ok. Thanks.

@SparkQA
Copy link

SparkQA commented Nov 24, 2020

Test build #131624 has finished for PR 30408 at commit 752eb8d.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Nov 24, 2020

Test build #131631 has finished for PR 30408 at commit 752eb8d.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wangyum
Copy link
Member Author

wangyum commented Nov 24, 2020

@shaneknapp Did you set :export LANG=en_US.UTF-8?

org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.nio.file.InvalidPathException: Malformed input or input contains unmappable characters: /home/jenkins/workspace/SparkPullRequestBuilder@3/sql/hive/target/tmp/hive_execution_test_group/warehouse-1355e680-268f-4224-b549-eaddcadcf136/DaTaBaSe_I.db/tab_ı);

This issue should be fixed if we set export LANG=en_US.UTF-8, more details:https://issues.apache.org/jira/browse/SPARK-27177

@maropu
Copy link
Member

maropu commented Nov 24, 2020

@wangyum How about asking it in the spark-dev thread so that Shane could notice it quickly?http://apache-spark-developers-list.1001551.n3.nabble.com/jenkins-downtime-tomorrow-evening-weekend-tt30405.html

@wangyum
Copy link
Member Author

wangyum commented Nov 24, 2020

retest this please

@SparkQA
Copy link

SparkQA commented Nov 24, 2020

Test build #131689 has finished for PR 30408 at commit 752eb8d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wangyum wangyum changed the title [SPARK-33477][SQL] Hive Metastore should support filter by date type [SPARK-33477][SQL] Hive Metastore support filter by date type Nov 25, 2020
@SparkQA
Copy link

SparkQA commented Nov 25, 2020

Test build #131740 has finished for PR 30408 at commit 29c489a.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wangyum
Copy link
Member Author

wangyum commented Nov 25, 2020

retest this please.

@HyukjinKwon
Copy link
Member

Merged to master.

@wangyum wangyum deleted the SPARK-33477 branch November 25, 2020 08:16
@SparkQA
Copy link

SparkQA commented Nov 25, 2020

Test build #131751 has finished for PR 30408 at commit 29c489a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.


val partitions =
for {
date <- Seq("2019-01-01", "2019-01-02", "2019-01-03", "2019-01-04")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about NULL?

@gatorsmile
Copy link
Member

@wangyum Could you add more test cases to check the NULL handling cases? For example,

  • Include NULL values in the data set
  • Include NULL values in the predicates
  • Include null-safe equals

Please check https://spark.apache.org/docs/3.0.1/sql-ref-null-semantics.html#comp-operators

@wangyum
Copy link
Member Author

wangyum commented Jan 3, 2021

@wangyum Could you add more test cases to check the NULL handling cases? For example,

  • Include NULL values in the data set
  • Include NULL values in the predicates
  • Include null-safe equals

Please check https://spark.apache.org/docs/3.0.1/sql-ref-null-semantics.html#comp-operators

OK

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants