Skip to content
Closed
Changes from 9 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
36 changes: 36 additions & 0 deletions docs/sql-programming-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -1776,6 +1776,42 @@ working with timestamps in `pandas_udf`s to get the best performance, see

## Upgrading From Spark SQL 2.2 to 2.3

- Since Spark 2.3, Spark supports a vectorized ORC reader with a new ORC file format for ORC files and Hive ORC tables. To do that, the following configurations are newly added or change their default values.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re: the following configurations are newly added or change their default values.
these are all new, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The last two are existing one~


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we separate newly added configurations and changed ones?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. Now, we have two tables.

- New configurations

<table class="table">
<tr><th><b>Property Name</b></th><th><b>Default</b></th><th><b>Meaning</b></th></tr>
<tr>
<td><code>spark.sql.orc.impl</code></td>
<td><code>native</code></td>
<td>The name of ORC implementation. It can be one of <code>native</code> and <code>hive</code>. <code>native</code> means the native ORC support that is built on Apache ORC 1.4.1. `hive` means the ORC library in Hive 1.2.1 which is used prior to Spark 2.3.</td>
</tr>
<tr>
<td><code>spark.sql.orc.enableVectorizedReader</code></td>
<td><code>true</code></td>
<td>Enables vectorized orc decoding in <code>native</code> implementation. If <code>false</code>, a new non-vectorized ORC reader is used in <code>native</code> implementation. For <code>hive</code> implementation, this is ignored.</td>
</tr>
</table>

- Changed configurations

<table class="table">
<tr><th><b>Property Name</b></th><th><b>Default</b></th><th><b>Meaning</b></th></tr>
<tr>
<td><code>spark.sql.orc.filterPushdown</code></td>
<td><code>true</code></td>
<td>Enables filter pushdown for ORC files. It is <code>false</code> by default prior to Spark 2.3.</td>
</tr>
<tr>
<td><code>spark.sql.hive.convertMetastoreOrc</code></td>
<td><code>true</code></td>
<td>Enable the Spark's ORC support, which can be configured by <code>spark.sql.orc.impl</code>, instead of Hive SerDe when reading from and writing to Hive ORC tables. It is <code>false</code> by default prior to Spark 2.3.</td>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this isn't entirely clear to me. I assume this has to be true for spark.sql.orc.impl to work? If so perhaps we should mention it above in spark.sql.orc.impl. If this is false what happens, it can't read Orc format? or it just falls back to spark.sql.orc.impl=hive

Copy link
Member Author

@dongjoon-hyun dongjoon-hyun Feb 7, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. This has to be true for only Hive ORC table.
But, for the other Spark tables created by 'USING ORC', this is irrelevant.

spark.sql.orc.impl and spark.sql.hive.convertMetastoreOrc is orthogonal.
spark.sql.orc.impl=hive and spark.sql.hive.convertMetastoreOrc=true converts Hive ORC tables into legacy OrcFileFormat based on Hive 1.2.1.

</tr>
</table>

- Since Apache ORC 1.4.1 is a standalone library providing a subset of Hive ORC related configurations, you can use ORC configuration name and Hive configuration name. To see a full list of supported ORC configurations, see <a href="https://github.com/apache/orc/blob/master/java/core/src/java/org/apache/orc/OrcConf.java">OrcConf.java</a>.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gatorsmile . I added this note.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might need to explicitly mention they need to specify the corresponding ORC configuration names when they explicitly or implicitly use the native readers.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For supported confs, OrcConf provides a pair of ORC/Hive key names. ORC keys are recommended but not needed.

  STRIPE_SIZE("orc.stripe.size", "hive.exec.orc.default.stripe.size",
      64L * 1024 * 1024, "Define the default ORC stripe size, in bytes."),

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean these hive conf works for our native readers? Could you add test cases for them?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's possible in another PR. BTW, about the test coverage,

  • Do you want to see specifically orc.stripe.size and hive.exec.orc.default.stripe.size only?
  • Do we have a test coverage before for old Hive ORC code path?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can do a search. We need to improve our ORC test coverage for sure.

If possible, please add test cases to see whether both orc.stripe.size and hive.exec.orc.default.stripe.size work for two Spark's ORC readers. We also need the same tests for checking whether hive.exec.orc.default.stripe.size works for Hive serde tables.

To ensure the correctness of the documentation, I hope we can at least submit a PR for testing them before merging this PR?

Copy link
Member Author

@dongjoon-hyun dongjoon-hyun Feb 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. +1. I'll make another PR for that today. @gatorsmile .
(I was wondering if I need to do for all the other Hive/ORC configurations.)

Copy link
Member

@gatorsmile gatorsmile Feb 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. We can check whether some important conf works

For example,

create table if not exists vectororc (s1 string, s2 string)
stored as ORC tblproperties(
  "orc.row.index.stride"="1000", 
  "hive.exec.orc.default.stripe.size"="100000",
   "orc.compress.size"="10000");

After auto conversion, do these confs in tblproperties are still being used by our native readers?

We also need to check whether the confs set in the configuration file are also recognized by our native readers.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any update on this? @dongjoon-hyun

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for late response. @gatorsmile .

Here, your example is a mixed scenario. First of all, I made a PR, #20517, for "Add ORC configuration tests for ORC data source". It adds a test coverage for ORC and Hive configuration names for native and hive OrcFileFormat. The PR aims to focus on name compatibility for those important confs.

For convertMetastoreOrc, the table properties are retained when we check by using spark.sessionState.catalog.getTableMetadata(TableIdentifier(tableName)). However, it seems to be ignored on some cases. I guess it also does in Parquet. I'm working on it separately.


- Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when the referenced columns only include the internal corrupt record column (named `_corrupt_record` by default). For example, `spark.read.schema(schema).json(file).filter($"_corrupt_record".isNotNull).count()` and `spark.read.schema(schema).json(file).select("_corrupt_record").show()`. Instead, you can cache or save the parsed results and then send the same query. For example, `val df = spark.read.schema(schema).json(file).cache()` and then `df.filter($"_corrupt_record".isNotNull).count()`.
- The `percentile_approx` function previously accepted numeric type input and output double type results. Now it supports date type, timestamp type and numeric types as input types. The result type is also changed to be the same as the input type, which is more reasonable for percentiles.
- Since Spark 2.3, the Join/Filter's deterministic predicates that are after the first non-deterministic predicates are also pushed down/through the child operators, if possible. In prior Spark versions, these filters are not eligible for predicate pushdown.
Expand Down