Skip to content

Commit 057ccb1

Browse files
dongjoon-hyunfli
authored andcommitted
[SPARK-24322][BUILD] Upgrade Apache ORC to 1.4.4
ORC 1.4.4 includes [nine fixes](https://issues.apache.org/jira/issues/?filter=12342568&jql=project%20%3D%20ORC%20AND%20resolution%20%3D%20Fixed%20AND%20fixVersion%20%3D%201.4.4). One of the issues is about `Timestamp` bug (ORC-306) which occurs when `native` ORC vectorized reader reads ORC column vector's sub-vector `times` and `nanos`. ORC-306 fixes this according to the [original definition](https://github.com/apache/hive/blob/master/storage-api/src/java/org/apache/hadoop/hive/ql/exec/vector/TimestampColumnVector.java#L45-L46) and this PR includes the updated interpretation on ORC column vectors. Note that `hive` ORC reader and ORC MR reader is not affected. ```scala scala> spark.version res0: String = 2.3.0 scala> spark.sql("set spark.sql.orc.impl=native") scala> Seq(java.sql.Timestamp.valueOf("1900-05-05 12:34:56.000789")).toDF().write.orc("/tmp/orc") scala> spark.read.orc("/tmp/orc").show(false) +--------------------------+ |value | +--------------------------+ |1900-05-05 12:34:55.000789| +--------------------------+ ``` This PR aims to update Apache Spark to use it. **FULL LIST** ID | TITLE -- | -- ORC-281 | Fix compiler warnings from clang 5.0 ORC-301 | `extractFileTail` should open a file in `try` statement ORC-304 | Fix TestRecordReaderImpl to not fail with new storage-api ORC-306 | Fix incorrect workaround for bug in java.sql.Timestamp ORC-324 | Add support for ARM and PPC arch ORC-330 | Remove unnecessary Hive artifacts from root pom ORC-332 | Add syntax version to orc_proto.proto ORC-336 | Remove avro and parquet dependency management entries ORC-360 | Implement error checking on subtype fields in Java Pass the Jenkins. Author: Dongjoon Hyun <dongjoon@apache.org> Closes apache#21372 from dongjoon-hyun/SPARK_ORC144.
1 parent 20ed1f0 commit 057ccb1

6 files changed

Lines changed: 16 additions & 8 deletions

File tree

dev/deps/spark-deps-hadoop-2.6

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -156,8 +156,8 @@ objenesis-2.1.jar
156156
okhttp-3.8.1.jar
157157
okio-1.13.0.jar
158158
opencsv-2.3.jar
159-
orc-core-1.4.3-nohive.jar
160-
orc-mapreduce-1.4.3-nohive.jar
159+
orc-core-1.4.4-nohive.jar
160+
orc-mapreduce-1.4.4-nohive.jar
161161
oro-2.0.8.jar
162162
osgi-resource-locator-1.0.1.jar
163163
paranamer-2.8.jar

dev/deps/spark-deps-hadoop-2.7

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -157,8 +157,8 @@ objenesis-2.1.jar
157157
okhttp-3.8.1.jar
158158
okio-1.13.0.jar
159159
opencsv-2.3.jar
160-
orc-core-1.4.3-nohive.jar
161-
orc-mapreduce-1.4.3-nohive.jar
160+
orc-core-1.4.4-nohive.jar
161+
orc-mapreduce-1.4.4-nohive.jar
162162
oro-2.0.8.jar
163163
osgi-resource-locator-1.0.1.jar
164164
paranamer-2.8.jar

pom.xml

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -130,8 +130,7 @@
130130
<!-- Version used for internal directory structure -->
131131
<hive.version.short>1.2.1</hive.version.short>
132132
<derby.version>10.12.1.1</derby.version>
133-
<parquet.version>1.8.2</parquet.version>
134-
<orc.version>1.4.3</orc.version>
133+
<orc.version>1.4.4</orc.version>
135134
<orc.classifier>nohive</orc.classifier>
136135
<hive.parquet.version>1.6.0</hive.parquet.version>
137136
<jetty.version>9.3.20.v20170531</jetty.version>

sql/core/src/main/java/org/apache/spark/sql/execution/datasources/orc/OrcColumnVector.java

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -136,7 +136,7 @@ public int getInt(int rowId) {
136136
public long getLong(int rowId) {
137137
int index = getRowIndex(rowId);
138138
if (isTimestamp) {
139-
return timestampData.time[index] * 1000 + timestampData.nanos[index] / 1000;
139+
return timestampData.time[index] * 1000 + timestampData.nanos[index] / 1000 % 1000;
140140
} else {
141141
return longData.vector[index];
142142
}

sql/core/src/main/java/org/apache/spark/sql/execution/datasources/orc/OrcColumnarBatchReader.java

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -495,7 +495,7 @@ private void putValues(
495495
* Returns the number of micros since epoch from an element of TimestampColumnVector.
496496
*/
497497
private static long fromTimestampColumnVector(TimestampColumnVector vector, int index) {
498-
return vector.time[index] * 1000L + vector.nanos[index] / 1000L;
498+
return vector.time[index] * 1000 + (vector.nanos[index] / 1000 % 1000);
499499
}
500500

501501
/**

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcSourceSuite.scala

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,7 @@
1818
package org.apache.spark.sql.execution.datasources.orc
1919

2020
import java.io.File
21+
import java.sql.Timestamp
2122
import java.util.Locale
2223

2324
import org.apache.orc.OrcConf.COMPRESS
@@ -169,6 +170,14 @@ abstract class OrcSuite extends OrcTest with BeforeAndAfterAll {
169170
}
170171
}
171172
}
173+
174+
test("SPARK-24322 Fix incorrect workaround for bug in java.sql.Timestamp") {
175+
withTempPath { path =>
176+
val ts = Timestamp.valueOf("1900-05-05 12:34:56.000789")
177+
Seq(ts).toDF.write.orc(path.getCanonicalPath)
178+
checkAnswer(spark.read.orc(path.getCanonicalPath), Row(ts))
179+
}
180+
}
172181
}
173182

174183
class OrcSourceSuite extends OrcSuite with SharedSQLContext {

0 commit comments

Comments
 (0)