Skip to content

Commit 6a57616

Browse files
JiJiTangdbtsai
authored andcommitted
[SPARK-31364][SQL][TESTS] Benchmark Parquet Nested Field Predicate Pushdown
### What changes were proposed in this pull request? This PR aims to add a benchmark suite for nested predicate pushdown with parquet file: Performance comparison: Nested predicate pushdown disabled vs enabled, with the following queries scenarios: 1. When predicate pushed down, parquet reader are able to filter out all the row groups without loading them. 2. When predicate pushed down, parquet reader only loads one of the row groups. 3. When predicate pushed down, parquet reader can't filter out any row group in order to see if we introduce too much overhead or not when enabling nested predicate push down. ### Why are the changes needed? No benchmark exists today for nested fields predicate pushdown performance evaluation. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Benchmark runs and reporting result. Closes #28319 from JiJiTang/SPARK-31364. Authored-by: Jian Tang <jian_tang@apple.com> Signed-off-by: DB Tsai <d_tsai@apple.com>
1 parent 249b214 commit 6a57616

3 files changed

Lines changed: 155 additions & 0 deletions

File tree

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
OpenJDK 64-Bit Server VM 11.0.2+9 on Mac OS X 10.14.6
2+
Intel(R) Core(TM) i7-7920HQ CPU @ 3.10GHz
3+
Can skip all row groups: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
4+
------------------------------------------------------------------------------------------------------------------------
5+
Without nested predicate Pushdown 34214 35752 NaN 3.1 326.3 1.0X
6+
With nested predicate Pushdown 86 102 11 1216.2 0.8 396.8X
7+
8+
OpenJDK 64-Bit Server VM 11.0.2+9 on Mac OS X 10.14.6
9+
Intel(R) Core(TM) i7-7920HQ CPU @ 3.10GHz
10+
Can skip some row groups: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
11+
------------------------------------------------------------------------------------------------------------------------
12+
Without nested predicate Pushdown 34211 35162 843 3.1 326.3 1.0X
13+
With nested predicate Pushdown 3470 3514 36 30.2 33.1 9.9X
14+
15+
OpenJDK 64-Bit Server VM 11.0.2+9 on Mac OS X 10.14.6
16+
Intel(R) Core(TM) i7-7920HQ CPU @ 3.10GHz
17+
Can skip no row groups: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
18+
------------------------------------------------------------------------------------------------------------------------
19+
Without nested predicate Pushdown 37533 37919 329 2.8 357.9 1.0X
20+
With nested predicate Pushdown 37876 39132 536 2.8 361.2 1.0X
21+
Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
OpenJDK 64-Bit Server VM 1.8.0_252-b09 on Mac OS X 10.14.6
2+
Intel(R) Core(TM) i7-7920HQ CPU @ 3.10GHz
3+
Can skip all row groups: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
4+
------------------------------------------------------------------------------------------------------------------------
5+
Without nested predicate Pushdown 30687 31552 NaN 3.4 292.7 1.0X
6+
With nested predicate Pushdown 105 150 61 999.3 1.0 292.5X
7+
8+
OpenJDK 64-Bit Server VM 1.8.0_252-b09 on Mac OS X 10.14.6
9+
Intel(R) Core(TM) i7-7920HQ CPU @ 3.10GHz
10+
Can skip some row groups: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
11+
------------------------------------------------------------------------------------------------------------------------
12+
Without nested predicate Pushdown 30505 31828 NaN 3.4 290.9 1.0X
13+
With nested predicate Pushdown 3156 3215 77 33.2 30.1 9.7X
14+
15+
OpenJDK 64-Bit Server VM 1.8.0_252-b09 on Mac OS X 10.14.6
16+
Intel(R) Core(TM) i7-7920HQ CPU @ 3.10GHz
17+
Can skip no row groups: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
18+
------------------------------------------------------------------------------------------------------------------------
19+
Without nested predicate Pushdown 34475 35302 NaN 3.0 328.8 1.0X
20+
With nested predicate Pushdown 34003 34596 567 3.1 324.3 1.0X
21+
Lines changed: 113 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,113 @@
1+
/*
2+
* Licensed to the Apache Software Foundation (ASF) under one or more
3+
* contributor license agreements. See the NOTICE file distributed with
4+
* this work for additional information regarding copyright ownership.
5+
* The ASF licenses this file to You under the Apache License, Version 2.0
6+
* (the "License"); you may not use this file except in compliance with
7+
* the License. You may obtain a copy of the License at
8+
*
9+
* http://www.apache.org/licenses/LICENSE-2.0
10+
*
11+
* Unless required by applicable law or agreed to in writing, software
12+
* distributed under the License is distributed on an "AS IS" BASIS,
13+
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14+
* See the License for the specific language governing permissions and
15+
* limitations under the License.
16+
*/
17+
18+
package org.apache.spark.sql.execution.benchmark
19+
20+
import org.apache.spark.SparkConf
21+
import org.apache.spark.benchmark.Benchmark
22+
import org.apache.spark.sql.{DataFrame, SaveMode, SparkSession}
23+
import org.apache.spark.sql.internal.SQLConf
24+
25+
/**
26+
* Synthetic benchmark for nested fields predicate push down performance for Parquet datasource.
27+
* To run this benchmark:
28+
* {{{
29+
* 1. without sbt:
30+
* bin/spark-submit --class <this class> --jars <spark core test jar> <sql core test jar>
31+
* 2. build/sbt "sql/test:runMain <this class>"
32+
* 3. generate result:
33+
* SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain <this class>"
34+
* Results will be written to "benchmarks/ParquetNestedPredicatePushDownBenchmark-results.txt".
35+
* }}}
36+
*/
37+
object ParquetNestedPredicatePushDownBenchmark extends SqlBasedBenchmark {
38+
39+
private val N = 100 * 1024 * 1024
40+
private val NUMBER_OF_ITER = 10
41+
42+
private val df: DataFrame = spark
43+
.range(1, N, 1, 4)
44+
.toDF("id")
45+
.selectExpr("id", "STRUCT(id x, STRUCT(CAST(id AS STRING) z) y) nested")
46+
.sort("id")
47+
48+
private def addCase(
49+
benchmark: Benchmark,
50+
inputPath: String,
51+
enableNestedPD: Boolean,
52+
name: String,
53+
withFilter: DataFrame => DataFrame): Unit = {
54+
val loadDF = spark.read.parquet(inputPath)
55+
benchmark.addCase(name) { _ =>
56+
withSQLConf((SQLConf.NESTED_PREDICATE_PUSHDOWN_ENABLED.key, enableNestedPD.toString)) {
57+
withFilter(loadDF).noop()
58+
}
59+
}
60+
}
61+
62+
private def createAndRunBenchmark(name: String, withFilter: DataFrame => DataFrame): Unit = {
63+
withTempPath { tempDir =>
64+
val outputPath = tempDir.getCanonicalPath
65+
df.write.mode(SaveMode.Overwrite).parquet(outputPath)
66+
val benchmark = new Benchmark(name, N, NUMBER_OF_ITER, output = output)
67+
addCase(
68+
benchmark,
69+
outputPath,
70+
enableNestedPD = false,
71+
"Without nested predicate Pushdown",
72+
withFilter)
73+
addCase(
74+
benchmark,
75+
outputPath,
76+
enableNestedPD = true,
77+
"With nested predicate Pushdown",
78+
withFilter)
79+
benchmark.run()
80+
}
81+
}
82+
83+
/**
84+
* Benchmark for sorted data with a filter which allows to filter out all the row groups
85+
* when nested fields predicate push down enabled
86+
*/
87+
def runLoadNoRowGroupWhenPredicatePushedDown(): Unit = {
88+
createAndRunBenchmark("Can skip all row groups", _.filter("nested.x < 0"))
89+
}
90+
91+
/**
92+
* Benchmark with a filter which allows to load only some row groups
93+
* when nested fields predicate push down enabled
94+
*/
95+
def runLoadSomeRowGroupWhenPredicatePushedDown(): Unit = {
96+
createAndRunBenchmark("Can skip some row groups", _.filter("nested.x = 100"))
97+
}
98+
99+
/**
100+
* Benchmark with a filter which still requires to
101+
* load all the row groups on sorted data to see if we introduce too much
102+
* overhead or not if enable nested predicate push down.
103+
*/
104+
def runLoadAllRowGroupsWhenPredicatePushedDown(): Unit = {
105+
createAndRunBenchmark("Can skip no row groups", _.filter(s"nested.x >= 0 and nested.x <= $N"))
106+
}
107+
108+
override def runBenchmarkSuite(mainArgs: Array[String]): Unit = {
109+
runLoadNoRowGroupWhenPredicatePushedDown()
110+
runLoadSomeRowGroupWhenPredicatePushedDown()
111+
runLoadAllRowGroupsWhenPredicatePushedDown()
112+
}
113+
}

0 commit comments

Comments
 (0)