Skip to content

Commit d65f534

Browse files
yaooqinncloud-fan
authored andcommitted
[SPARK-31414][SQL] Fix performance regression with new TimestampFormatter for json and csv time parsing
### What changes were proposed in this pull request? With benchmark original, where the timestamp values are valid to the new parser the result is ```scala [info] Running benchmark: Read dates and timestamps [info] Running case: timestamp strings [info] Stopped after 3 iterations, 5781 ms [info] Running case: parse timestamps from Dataset[String] [info] Stopped after 3 iterations, 44764 ms [info] Running case: infer timestamps from Dataset[String] [info] Stopped after 3 iterations, 93764 ms [info] Running case: from_json(timestamp) [info] Stopped after 3 iterations, 59021 ms ``` When we modify the benchmark to ```scala def timestampStr: Dataset[String] = { spark.range(0, rowsNum, 1, 1).mapPartitions { iter => iter.map(i => s"""{"timestamp":"1970-01-01T01:02:03.${i % 100}"}""") }.select($"value".as("timestamp")).as[String] } readBench.addCase("timestamp strings", numIters) { _ => timestampStr.noop() } readBench.addCase("parse timestamps from Dataset[String]", numIters) { _ => spark.read.schema(tsSchema).json(timestampStr).noop() } readBench.addCase("infer timestamps from Dataset[String]", numIters) { _ => spark.read.json(timestampStr).noop() } ``` where the timestamp values are invalid for the new parser which causes a fallback to legacy parser(2.4). the result is ```scala [info] Running benchmark: Read dates and timestamps [info] Running case: timestamp strings [info] Stopped after 3 iterations, 5623 ms [info] Running case: parse timestamps from Dataset[String] [info] Stopped after 3 iterations, 506637 ms [info] Running case: infer timestamps from Dataset[String] [info] Stopped after 3 iterations, 509076 ms ``` About 10x perf-regression BUT if we modify the timestamp pattern to `....HH:mm:ss[.SSS][XXX]` which make all timestamp values valid for the new parser to prohibit fallback, the result is ```scala [info] Running benchmark: Read dates and timestamps [info] Running case: timestamp strings [info] Stopped after 3 iterations, 5623 ms [info] Running case: parse timestamps from Dataset[String] [info] Stopped after 3 iterations, 506637 ms [info] Running case: infer timestamps from Dataset[String] [info] Stopped after 3 iterations, 509076 ms ``` ### Why are the changes needed? Fix performance regression. ### Does this PR introduce any user-facing change? NO ### How was this patch tested? new tests added. Closes #28181 from yaooqinn/SPARK-31414. Authored-by: Kent Yao <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
1 parent 1b87015 commit d65f534

10 files changed

Lines changed: 261 additions & 224 deletions

File tree

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala

Lines changed: 7 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,7 @@ import com.univocity.parsers.csv.{CsvParserSettings, CsvWriterSettings, Unescape
2626
import org.apache.spark.internal.Logging
2727
import org.apache.spark.sql.catalyst.util._
2828
import org.apache.spark.sql.internal.SQLConf
29+
import org.apache.spark.sql.internal.SQLConf.LegacyBehaviorPolicy
2930

3031
class CSVOptions(
3132
@transient val parameters: CaseInsensitiveMap[String],
@@ -148,8 +149,12 @@ class CSVOptions(
148149

149150
val dateFormat: String = parameters.getOrElse("dateFormat", DateFormatter.defaultPattern)
150151

151-
val timestampFormat: String =
152-
parameters.getOrElse("timestampFormat", s"${DateFormatter.defaultPattern}'T'HH:mm:ss.SSSXXX")
152+
val timestampFormat: String = parameters.getOrElse("timestampFormat",
153+
if (SQLConf.get.legacyTimeParserPolicy == LegacyBehaviorPolicy.LEGACY) {
154+
s"${DateFormatter.defaultPattern}'T'HH:mm:ss.SSSXXX"
155+
} else {
156+
s"${DateFormatter.defaultPattern}'T'HH:mm:ss[.SSS][XXX]"
157+
})
153158

154159
val multiLine = parameters.get("multiLine").map(_.toBoolean).getOrElse(false)
155160

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JSONOptions.scala

Lines changed: 7 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,7 @@ import com.fasterxml.jackson.core.json.JsonReadFeature
2727
import org.apache.spark.internal.Logging
2828
import org.apache.spark.sql.catalyst.util._
2929
import org.apache.spark.sql.internal.SQLConf
30+
import org.apache.spark.sql.internal.SQLConf.LegacyBehaviorPolicy
3031

3132
/**
3233
* Options for parsing JSON data into Spark SQL rows.
@@ -90,8 +91,12 @@ private[sql] class JSONOptions(
9091

9192
val dateFormat: String = parameters.getOrElse("dateFormat", DateFormatter.defaultPattern)
9293

93-
val timestampFormat: String =
94-
parameters.getOrElse("timestampFormat", s"${DateFormatter.defaultPattern}'T'HH:mm:ss.SSSXXX")
94+
val timestampFormat: String = parameters.getOrElse("timestampFormat",
95+
if (SQLConf.get.legacyTimeParserPolicy == LegacyBehaviorPolicy.LEGACY) {
96+
s"${DateFormatter.defaultPattern}'T'HH:mm:ss.SSSXXX"
97+
} else {
98+
s"${DateFormatter.defaultPattern}'T'HH:mm:ss[.SSS][XXX]"
99+
})
95100

96101
val multiLine = parameters.get("multiLine").map(_.toBoolean).getOrElse(false)
97102

sql/core/benchmarks/CSVBenchmark-jdk11-results.txt

Lines changed: 44 additions & 44 deletions
Original file line numberDiff line numberDiff line change
@@ -2,66 +2,66 @@
22
Benchmark to measure CSV read/write performance
33
================================================================================================
44

5-
OpenJDK 64-Bit Server VM 11.0.5+10-post-Ubuntu-0ubuntu1.118.04 on Linux 4.15.0-1044-aws
6-
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
5+
Java HotSpot(TM) 64-Bit Server VM 11.0.5+10-LTS on Mac OS X 10.15.4
6+
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
77
Parsing quoted values: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
88
------------------------------------------------------------------------------------------------------------------------
9-
One quoted string 44297 44515 373 0.0 885948.7 1.0X
9+
One quoted string 24907 29374 NaN 0.0 498130.5 1.0X
1010

11-
OpenJDK 64-Bit Server VM 11.0.5+10-post-Ubuntu-0ubuntu1.118.04 on Linux 4.15.0-1044-aws
12-
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
11+
Java HotSpot(TM) 64-Bit Server VM 11.0.5+10-LTS on Mac OS X 10.15.4
12+
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
1313
Wide rows with 1000 columns: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
1414
------------------------------------------------------------------------------------------------------------------------
15-
Select 1000 columns 196720 197783 1560 0.0 196719.8 1.0X
16-
Select 100 columns 46691 46861 219 0.0 46691.4 4.2X
17-
Select one column 36811 36922 111 0.0 36811.3 5.3X
18-
count() 8520 8610 106 0.1 8520.5 23.1X
19-
Select 100 columns, one bad input field 67914 67994 136 0.0 67914.0 2.9X
20-
Select 100 columns, corrupt record field 77272 77445 214 0.0 77272.0 2.5X
15+
Select 1000 columns 62811 63690 1416 0.0 62811.4 1.0X
16+
Select 100 columns 23839 24064 230 0.0 23839.5 2.6X
17+
Select one column 19936 20641 827 0.1 19936.4 3.2X
18+
count() 4174 4380 206 0.2 4174.4 15.0X
19+
Select 100 columns, one bad input field 41015 42380 1688 0.0 41015.4 1.5X
20+
Select 100 columns, corrupt record field 46281 46338 93 0.0 46280.5 1.4X
2121

22-
OpenJDK 64-Bit Server VM 11.0.5+10-post-Ubuntu-0ubuntu1.118.04 on Linux 4.15.0-1044-aws
23-
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
22+
Java HotSpot(TM) 64-Bit Server VM 11.0.5+10-LTS on Mac OS X 10.15.4
23+
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
2424
Count a dataset with 10 columns: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
2525
------------------------------------------------------------------------------------------------------------------------
26-
Select 10 columns + count() 25965 26054 103 0.4 2596.5 1.0X
27-
Select 1 column + count() 18591 18666 91 0.5 1859.1 1.4X
28-
count() 6102 6119 18 1.6 610.2 4.3X
26+
Select 10 columns + count() 10810 10997 163 0.9 1081.0 1.0X
27+
Select 1 column + count() 7608 7641 47 1.3 760.8 1.4X
28+
count() 2415 2462 77 4.1 241.5 4.5X
2929

30-
OpenJDK 64-Bit Server VM 11.0.5+10-post-Ubuntu-0ubuntu1.118.04 on Linux 4.15.0-1044-aws
31-
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
30+
Java HotSpot(TM) 64-Bit Server VM 11.0.5+10-LTS on Mac OS X 10.15.4
31+
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
3232
Write dates and timestamps: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
3333
------------------------------------------------------------------------------------------------------------------------
34-
Create a dataset of timestamps 2142 2161 17 4.7 214.2 1.0X
35-
to_csv(timestamp) 14744 14950 182 0.7 1474.4 0.1X
36-
write timestamps to files 12078 12202 175 0.8 1207.8 0.2X
37-
Create a dataset of dates 2275 2291 18 4.4 227.5 0.9X
38-
to_csv(date) 11407 11464 51 0.9 1140.7 0.2X
39-
write dates to files 7638 7702 90 1.3 763.8 0.3X
34+
Create a dataset of timestamps 874 914 37 11.4 87.4 1.0X
35+
to_csv(timestamp) 7051 7223 250 1.4 705.1 0.1X
36+
write timestamps to files 6712 6741 31 1.5 671.2 0.1X
37+
Create a dataset of dates 909 945 35 11.0 90.9 1.0X
38+
to_csv(date) 4222 4231 8 2.4 422.2 0.2X
39+
write dates to files 3799 3813 14 2.6 379.9 0.2X
4040

41-
OpenJDK 64-Bit Server VM 11.0.5+10-post-Ubuntu-0ubuntu1.118.04 on Linux 4.15.0-1044-aws
42-
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
41+
Java HotSpot(TM) 64-Bit Server VM 11.0.5+10-LTS on Mac OS X 10.15.4
42+
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
4343
Read dates and timestamps: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
4444
------------------------------------------------------------------------------------------------------------------------
45-
read timestamp text from files 2578 2590 10 3.9 257.8 1.0X
46-
read timestamps from files 60103 60694 512 0.2 6010.3 0.0X
47-
infer timestamps from files 107871 108268 351 0.1 10787.1 0.0X
48-
read date text from files 2306 2310 4 4.3 230.6 1.1X
49-
read date from files 47415 47657 367 0.2 4741.5 0.1X
50-
infer date from files 35261 35447 164 0.3 3526.1 0.1X
51-
timestamp strings 3045 3056 11 3.3 304.5 0.8X
52-
parse timestamps from Dataset[String] 62221 63173 849 0.2 6222.1 0.0X
53-
infer timestamps from Dataset[String] 118838 119629 697 0.1 11883.8 0.0X
54-
date strings 3459 3481 19 2.9 345.9 0.7X
55-
parse dates from Dataset[String] 51026 51447 503 0.2 5102.6 0.1X
56-
from_csv(timestamp) 60738 61818 936 0.2 6073.8 0.0X
57-
from_csv(date) 46012 46278 370 0.2 4601.2 0.1X
45+
read timestamp text from files 1342 1364 35 7.5 134.2 1.0X
46+
read timestamps from files 20300 20473 247 0.5 2030.0 0.1X
47+
infer timestamps from files 40705 40744 54 0.2 4070.5 0.0X
48+
read date text from files 1146 1151 6 8.7 114.6 1.2X
49+
read date from files 12278 12408 117 0.8 1227.8 0.1X
50+
infer date from files 12734 12872 220 0.8 1273.4 0.1X
51+
timestamp strings 1467 1482 15 6.8 146.7 0.9X
52+
parse timestamps from Dataset[String] 21708 22234 477 0.5 2170.8 0.1X
53+
infer timestamps from Dataset[String] 42357 43253 922 0.2 4235.7 0.0X
54+
date strings 1512 1532 18 6.6 151.2 0.9X
55+
parse dates from Dataset[String] 13436 13470 33 0.7 1343.6 0.1X
56+
from_csv(timestamp) 20390 20486 95 0.5 2039.0 0.1X
57+
from_csv(date) 12592 12693 139 0.8 1259.2 0.1X
5858

59-
OpenJDK 64-Bit Server VM 11.0.5+10 on Mac OS X 10.15.2
60-
Intel(R) Core(TM) i7-4850HQ CPU @ 2.30GHz
59+
Java HotSpot(TM) 64-Bit Server VM 11.0.5+10-LTS on Mac OS X 10.15.4
60+
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
6161
Filters pushdown: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
6262
------------------------------------------------------------------------------------------------------------------------
63-
w/o filters 11889 11945 52 0.0 118893.1 1.0X
64-
pushdown disabled 11790 11860 115 0.0 117902.3 1.0X
65-
w/ filters 1240 1278 33 0.1 12400.8 9.6X
63+
w/o filters 12535 12606 67 0.0 125348.8 1.0X
64+
pushdown disabled 12611 12672 91 0.0 126112.9 1.0X
65+
w/ filters 1093 1099 11 0.1 10928.3 11.5X
6666

6767

0 commit comments

Comments
 (0)