Skip to content

Commit bd83f7b

Browse files
committed
[SPARK-23421][SPARK-22356][SQL] Document the behavior change in
## What changes were proposed in this pull request? #19579 introduces a behavior change. We need to document it in the migration guide. ## How was this patch tested? Also update the HiveExternalCatalogVersionsSuite to verify it. Author: gatorsmile <gatorsmile@gmail.com> Closes #20606 from gatorsmile/addMigrationGuide. (cherry picked from commit a77ebb0) Signed-off-by: gatorsmile <gatorsmile@gmail.com>
1 parent a5a8a86 commit bd83f7b

2 files changed

Lines changed: 4 additions & 2 deletions

File tree

docs/sql-programming-guide.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1963,6 +1963,8 @@ working with timestamps in `pandas_udf`s to get the best performance, see
19631963
## Upgrading From Spark SQL 2.1 to 2.2
19641964

19651965
- Spark 2.1.1 introduced a new configuration key: `spark.sql.hive.caseSensitiveInferenceMode`. It had a default setting of `NEVER_INFER`, which kept behavior identical to 2.1.0. However, Spark 2.2.0 changes this setting's default value to `INFER_AND_SAVE` to restore compatibility with reading Hive metastore tables whose underlying file schema have mixed-case column names. With the `INFER_AND_SAVE` configuration value, on first access Spark will perform schema inference on any Hive metastore table for which it has not already saved an inferred schema. Note that schema inference can be a very time consuming operation for tables with thousands of partitions. If compatibility with mixed-case column names is not a concern, you can safely set `spark.sql.hive.caseSensitiveInferenceMode` to `NEVER_INFER` to avoid the initial overhead of schema inference. Note that with the new default `INFER_AND_SAVE` setting, the results of the schema inference are saved as a metastore key for future use. Therefore, the initial schema inference occurs only at a table's first access.
1966+
1967+
- Since Spark 2.2.1 and 2.3.0, the schema is always inferred at runtime when the data source tables have the columns that exist in both partition schema and data schema. The inferred schema does not have the partitioned columns. When reading the table, Spark respects the partition values of these overlapping columns instead of the values stored in the data source files. In 2.2.0 and 2.1.x release, the inferred schema is partitioned but the data of the table is invisible to users (i.e., the result set is empty).
19661968

19671969
## Upgrading From Spark SQL 2.0 to 2.1
19681970

sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogVersionsSuite.scala

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -195,7 +195,7 @@ class HiveExternalCatalogVersionsSuite extends SparkSubmitTestUtils {
195195

196196
object PROCESS_TABLES extends QueryTest with SQLTestUtils {
197197
// Tests the latest version of every release line.
198-
val testingVersions = Seq("2.0.2", "2.1.2", "2.2.0")
198+
val testingVersions = Seq("2.0.2", "2.1.2", "2.2.0", "2.2.1")
199199

200200
protected var spark: SparkSession = _
201201

@@ -249,7 +249,7 @@ object PROCESS_TABLES extends QueryTest with SQLTestUtils {
249249

250250
// SPARK-22356: overlapped columns between data and partition schema in data source tables
251251
val tbl_with_col_overlap = s"tbl_with_col_overlap_$index"
252-
// For Spark 2.2.0 and 2.1.x, the behavior is different from Spark 2.0.
252+
// For Spark 2.2.0 and 2.1.x, the behavior is different from Spark 2.0, 2.2.1, 2.3+
253253
if (testingVersions(index).startsWith("2.1") || testingVersions(index) == "2.2.0") {
254254
spark.sql("msck repair table " + tbl_with_col_overlap)
255255
assert(spark.table(tbl_with_col_overlap).columns === Array("i", "j", "p"))

0 commit comments

Comments
 (0)