Skip to content
Closed
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 0 additions & 12 deletions docs/sql-data-sources-parquet.md
Original file line number Diff line number Diff line change
Expand Up @@ -295,18 +295,6 @@ Configuration of Parquet can be done using the `setConf` method on `SparkSession
</p>
</td>
</tr>
<tr>
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The configuration is not only about Parquet. Since it is disabled, and to avoid confusing users, I think we can just remove the doc here.

<td><code>spark.sql.optimizer.metadataOnly</code></td>
<td>true</td>
<td>
<p>
When true, enable the metadata-only query optimization that use the table's metadata to
produce the partition columns instead of table scans. It applies when all the columns scanned
are partition columns and the query has an aggregate operator that satisfies distinct
semantics.
</p>
</td>
</tr>
<tr>
<td><code>spark.sql.parquet.writeLegacyFormat</code></td>
<td>false</td>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -585,9 +585,10 @@ object SQLConf {
.doc("When true, enable the metadata-only query optimization that use the table's metadata " +
"to produce the partition columns instead of table scans. It applies when all the columns " +
"scanned are partition columns and the query has an aggregate operator that satisfies " +
"distinct semantics.")
"distinct semantics. By default the optimization is disabled, since it may return " +
"incorrect results with empty tables.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may return incorrect results when the files are empty.

.booleanConf
.createWithDefault(true)
.createWithDefault(false)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about making this conf. internal and always printing a warning message when this flag enabled and the rule applied?


val COLUMN_NAME_OF_CORRUPT_RECORD = buildConf("spark.sql.columnNameOfCorruptRecord")
.doc("The name of internal column for storing raw/un-parsed JSON and CSV records that fail " +
Expand Down
13 changes: 12 additions & 1 deletion sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala
Original file line number Diff line number Diff line change
Expand Up @@ -2422,7 +2422,7 @@ class SQLQuerySuite extends QueryTest with SharedSQLContext {
Row(s"$expected") :: Nil)
}

test("SPARK-15752 optimize metadata only query for datasource table") {
ignore("SPARK-15752 optimize metadata only query for datasource table") {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is kind of aggressive. I can revert this one if anyone objects.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I object. You are removing integration testing for whomever it using the feature.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should either fix the rule, or remove it completely if it has fundamental issues. I don't think the Hive approach is the right thing to do: we should not leave a problematic rule that users can enable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hvanhovell @cloud-fan I see.
As per the comments, the most appropriate solution is to delete the whole rule/configuration. I will update this PR.

withSQLConf(SQLConf.OPTIMIZER_METADATA_ONLY.key -> "true") {
withTable("srcpart_15752") {
val data = (1 to 10).map(i => (i, s"data-$i", i % 2, if ((i % 2) == 0) "a" else "b"))
Expand Down Expand Up @@ -2966,6 +2966,17 @@ class SQLQuerySuite extends QueryTest with SharedSQLContext {
}
}
}

test("SPARK-26709: OptimizeMetadataOnlyQuery does not handle empty records correctly") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also add a case that illustrates that OptimizeMetadataOnlyQuery can produce wrong results.

withSQLConf(SQLConf.OPTIMIZER_METADATA_ONLY.key -> "false") {
withTable("t") {
sql("CREATE TABLE t (col1 INT, p1 INT) USING PARQUET PARTITIONED BY (p1)")
sql("INSERT INTO TABLE t PARTITION (p1 = 5) SELECT ID FROM range(1, 1)")
checkAnswer(sql("SELECT MAX(p1) FROM t"), Row(null))
checkAnswer(sql("SELECT MAX(col1) FROM t"), Row(null))
}
}
}
}

case class Foo(bar: Option[String])
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,7 @@ class OptimizeMetadataOnlyQuerySuite extends QueryTest with SharedSQLContext {
}

private def testMetadataOnly(name: String, sqls: String*): Unit = {
test(name) {
ignore(name) {
withSQLConf(SQLConf.OPTIMIZER_METADATA_ONLY.key -> "true") {
sqls.foreach { case q => assertMetadataOnlyQuery(sql(q)) }
}
Expand All @@ -69,7 +69,7 @@ class OptimizeMetadataOnlyQuerySuite extends QueryTest with SharedSQLContext {
}

private def testNotMetadataOnly(name: String, sqls: String*): Unit = {
test(name) {
ignore(name) {
withSQLConf(SQLConf.OPTIMIZER_METADATA_ONLY.key -> "true") {
sqls.foreach { case q => assertNotMetadataOnlyQuery(sql(q)) }
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -86,6 +86,17 @@ class SQLQuerySuite extends QueryTest with SQLTestUtils with TestHiveSingleton {
assert(message.contains("Table or view not found"))
}

test("SPARK-26709: OptimizeMetadataOnlyQuery does not handle empty records correctly") {
withSQLConf(SQLConf.OPTIMIZER_METADATA_ONLY.key -> "false") {
withTable("t") {
sql("CREATE TABLE t (col1 INT, p1 INT) USING PARQUET PARTITIONED BY (p1)")
sql("INSERT INTO TABLE t PARTITION (p1 = 5) SELECT ID FROM range(1, 1)")
checkAnswer(sql("SELECT MAX(p1) FROM t"), Row(null))
checkAnswer(sql("SELECT MAX(col1) FROM t"), Row(null))
}
}
}

test("script") {
assume(TestUtils.testCommandAvailable("/bin/bash"))
assume(TestUtils.testCommandAvailable("echo | sed"))
Expand Down Expand Up @@ -1770,7 +1781,7 @@ class SQLQuerySuite extends QueryTest with SQLTestUtils with TestHiveSingleton {
}
}

test("SPARK-15752 optimize metadata only query for hive table") {
ignore("SPARK-15752 optimize metadata only query for hive table") {
withSQLConf(SQLConf.OPTIMIZER_METADATA_ONLY.key -> "true") {
withTable("data_15752", "srcpart_15752", "srctext_15752") {
val df = Seq((1, "2"), (3, "4")).toDF("key", "value")
Expand Down