-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-23313][DOC] Add a migration guide for ORC #20484
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 9 commits
20f99c6
1bb23ef
df08899
0aecd5d
239714a
fc5b395
7b3b0a4
cb149f2
436c0f4
354a525
d259d66
a693446
40c8e02
59e957a
6136d25
f2bd2c8
8ae87fc
6887d19
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -1776,6 +1776,42 @@ working with timestamps in `pandas_udf`s to get the best performance, see | |
|
|
||
| ## Upgrading From Spark SQL 2.2 to 2.3 | ||
|
|
||
| - Since Spark 2.3, Spark supports a vectorized ORC reader with a new ORC file format for ORC files and Hive ORC tables. To do that, the following configurations are newly added or change their default values. | ||
|
|
||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Shall we separate newly added configurations and changed ones?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yep. Now, we have two tables. |
||
| - New configurations | ||
|
|
||
| <table class="table"> | ||
| <tr><th><b>Property Name</b></th><th><b>Default</b></th><th><b>Meaning</b></th></tr> | ||
| <tr> | ||
| <td><code>spark.sql.orc.impl</code></td> | ||
| <td><code>native</code></td> | ||
| <td>The name of ORC implementation. It can be one of <code>native</code> and <code>hive</code>. <code>native</code> means the native ORC support that is built on Apache ORC 1.4.1. `hive` means the ORC library in Hive 1.2.1 which is used prior to Spark 2.3.</td> | ||
| </tr> | ||
| <tr> | ||
| <td><code>spark.sql.orc.enableVectorizedReader</code></td> | ||
| <td><code>true</code></td> | ||
| <td>Enables vectorized orc decoding in <code>native</code> implementation. If <code>false</code>, a new non-vectorized ORC reader is used in <code>native</code> implementation. For <code>hive</code> implementation, this is ignored.</td> | ||
| </tr> | ||
| </table> | ||
|
|
||
| - Changed configurations | ||
|
|
||
| <table class="table"> | ||
| <tr><th><b>Property Name</b></th><th><b>Default</b></th><th><b>Meaning</b></th></tr> | ||
| <tr> | ||
| <td><code>spark.sql.orc.filterPushdown</code></td> | ||
| <td><code>true</code></td> | ||
| <td>Enables filter pushdown for ORC files. It is <code>false</code> by default prior to Spark 2.3.</td> | ||
| </tr> | ||
| <tr> | ||
| <td><code>spark.sql.hive.convertMetastoreOrc</code></td> | ||
| <td><code>true</code></td> | ||
| <td>Enable the Spark's ORC support, which can be configured by <code>spark.sql.orc.impl</code>, instead of Hive SerDe when reading from and writing to Hive ORC tables. It is <code>false</code> by default prior to Spark 2.3.</td> | ||
|
||
| </tr> | ||
| </table> | ||
|
|
||
| - Since Apache ORC 1.4.1 is a standalone library providing a subset of Hive ORC related configurations, you can use ORC configuration name and Hive configuration name. To see a full list of supported ORC configurations, see <a href="https://github.com/apache/orc/blob/master/java/core/src/java/org/apache/orc/OrcConf.java">OrcConf.java</a>. | ||
|
||
|
|
||
| - Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when the referenced columns only include the internal corrupt record column (named `_corrupt_record` by default). For example, `spark.read.schema(schema).json(file).filter($"_corrupt_record".isNotNull).count()` and `spark.read.schema(schema).json(file).select("_corrupt_record").show()`. Instead, you can cache or save the parsed results and then send the same query. For example, `val df = spark.read.schema(schema).json(file).cache()` and then `df.filter($"_corrupt_record".isNotNull).count()`. | ||
| - The `percentile_approx` function previously accepted numeric type input and output double type results. Now it supports date type, timestamp type and numeric types as input types. The result type is also changed to be the same as the input type, which is more reasonable for percentiles. | ||
| - Since Spark 2.3, the Join/Filter's deterministic predicates that are after the first non-deterministic predicates are also pushed down/through the child operators, if possible. In prior Spark versions, these filters are not eligible for predicate pushdown. | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
re:
the following configurations are newly added or change their default values.these are all new, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The last two are existing one~