[SPARK-32346][SQL] Support filters pushdown in Avro datasource #29145

MaxGekk · 2020-07-17T13:04:05Z

What changes were proposed in this pull request?

In the PR, I propose to support pushed down filters in Avro datasource V1 and V2.

Added new SQL config spark.sql.avro.filterPushdown.enabled to control filters pushdown to Avro datasource. It is on by default.
Renamed CSVFilters to OrderedFilters.
OrderedFilters is used in AvroFileFormat (DSv1) and in AvroPartitionReaderFactory (DSv2)
Modified AvroDeserializer to return None from the deserialize method when pushdown filters return false.

Why are the changes needed?

The changes improve performance on synthetic benchmarks up to 2 times on JDK 11:

OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
Filters pushdown:                         Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
w/o filters                                        9614           9669          54          0.1        9614.1       1.0X
pushdown disabled                                 10077          10141          66          0.1       10077.2       1.0X
w/ filters                                         4681           4713          29          0.2        4681.5       2.1X

Does this PR introduce any user-facing change?

No

How was this patch tested?

Added UT to AvroCatalystDataConversionSuite and AvroSuite
Re-running AvroReadBenchmark using Amazon EC2:

Item	Description
Region	us-west-2 (Oregon)
Instance	r3.xlarge (spot instance)
AMI	ami-06f2f779464715dc5 (ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1)
Java	OpenJDK8/11 installed by`sudo add-apt-repository ppa:openjdk-r/ppa` & `sudo apt install openjdk-11-jdk`

and ./dev/run-benchmarks:

#!/usr/bin/env python3

import os
from sparktestsupport.shellutils import run_cmd

benchmarks = [
  ['avro/test', 'org.apache.spark.sql.execution.benchmark.AvroReadBenchmark']
]

print('Set SPARK_GENERATE_BENCHMARK_FILES=1')
os.environ['SPARK_GENERATE_BENCHMARK_FILES'] = '1'

for b in benchmarks:
    print("Run benchmark: %s" % b[1])
    run_cmd(['build/sbt', '%s:runMain %s' % (b[0], b[1])])

SparkQA · 2020-07-17T13:31:59Z

Test build #126053 has finished for PR 29145 at commit c010a4e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-07-17T16:58:55Z

Test build #126063 has finished for PR 29145 at commit ab17bd0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-07-17T20:37:04Z

Test build #126072 has finished for PR 29145 at commit 0ea0ef7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-07-17T21:13:13Z

Test build #126074 has finished for PR 29145 at commit 5c4c177.

This patch fails PySpark pip packaging tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-07-18T19:05:17Z

Test build #126114 has finished for PR 29145 at commit 72022bb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-07-19T01:01:54Z

Test build #126115 has finished for PR 29145 at commit 072eab0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-07-19T13:42:49Z

Test build #126128 has finished for PR 29145 at commit 668e497.

This patch fails PySpark pip packaging tests.
This patch merges cleanly.
This patch adds no public classes.

MaxGekk · 2020-07-19T16:24:58Z

@gengliangwang @dongjoon-hyun @HyukjinKwon @cloud-fan Please, take a look at this PR.

dongjoon-hyun · 2020-07-19T19:37:02Z

Thank you for pinging me, @MaxGekk .

MaxGekk · 2020-07-20T20:19:57Z

@dongjoon-hyun I am looking forward to review comments from you.

SparkQA · 2020-07-21T02:13:09Z

Test build #126199 has finished for PR 29145 at commit 74f8f1b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala

external/avro/src/main/scala/org/apache/spark/sql/avro/AvroFileFormat.scala

external/avro/benchmarks/AvroReadBenchmark-jdk11-results.txt

SparkQA · 2020-07-21T18:38:07Z

Test build #126255 has finished for PR 29145 at commit 438e634.

This patch fails PySpark pip packaging tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
trait RowReader

HyukjinKwon · 2020-07-22T05:59:03Z

retest this please

SparkQA · 2020-07-22T07:05:02Z

Test build #126303 has finished for PR 29145 at commit 438e634.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds the following public classes (experimental):
trait RowReader

MaxGekk · 2020-07-22T07:12:57Z

jenkins, retest this, please

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/OrderedFilters.scala

SparkQA · 2020-07-22T12:51:10Z

Test build #126310 has finished for PR 29145 at commit 438e634.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
trait RowReader

SparkQA · 2020-07-22T20:20:12Z

Test build #126339 has finished for PR 29145 at commit f6b2a9a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…checking out ### What changes were proposed in this pull request? Refactoring of `JsonFilters`: - Add an assert to the `skipRow` method to check the input `index` - Move checking of the SQL config `spark.sql.json.filterPushdown.enabled` from `JsonFilters` to `JacksonParser`. ### Why are the changes needed? 1. The assert should catch incorrect usage of `JsonFilters` 2. The config checking out of `JsonFilters` makes it consistent with `OrderedFilters` (see #29145). 3. `JsonFilters` can be used by other datasource in the future and don't depend from the JSON configs. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By existing tests suites: ``` $ build/sbt "sql/test:testOnly org.apache.spark.sql.execution.datasources.json.*" $ build/sbt "test:testOnly org.apache.spark.sql.catalyst.json.*" ``` Closes #29206 from MaxGekk/json-filters-pushdown-followup. Authored-by: Max Gekk <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>

MaxGekk · 2020-07-27T07:58:07Z

@cloud-fan Please, review this PR.

MaxGekk · 2020-07-29T12:08:06Z

@gengliangwang Are you ok with this PR?

external/avro/src/test/scala/org/apache/spark/sql/avro/AvroCatalystDataConversionSuite.scala

gengliangwang

LGTM except for one comment

gengliangwang · 2020-07-29T17:36:49Z

Thanks, merging to master

Support row skipping in Avro datasource

c010a4e

probot-autolabeler bot added AVRO SQL labels Jul 17, 2020

Apply NoopFilters so far

ab17bd0

MaxGekk added 5 commits July 17, 2020 22:17

Pushdown real filters

665865d

Add a benchmark

33d9383

Fix the benchmark

ad21524

Make timestamp read more expensive

0ea0ef7

Update AvroReadBenchmark-results.txt

cf083e2

Update AvroReadBenchmark-jdk11-results.txt

5c4c177

MaxGekk added 2 commits July 18, 2020 21:37

Tests for Avro deserializer

eb0c983

Set JIRA ID in the benchmark

72022bb

CSVFilters -> OrderedFilters

072eab0

MaxGekk added 5 commits July 19, 2020 11:19

Move config checking from OrderedFilters

7eb3f50

Add the SQL config spark.sql.avro.filterPushdown.enabled

dc2f2a7

Check the flag spark.sql.avro.filterPushdown.enabled

fdfd2ca

Add JIRA id to AvroCatalystDataConversionSuite

92efd4b

Add a test to AvroSuite to check the SQL config

668e497

MaxGekk changed the title ~~[WIP][SPARK-32346][SQL] Support filters pushdown in Avro datasource~~ [SPARK-32346][SQL] Support filters pushdown in Avro datasource Jul 19, 2020

MaxGekk added 2 commits July 20, 2020 23:14

Add an assert

fc48d13

Add a comment to AvroReadBenchmark

74f8f1b

gengliangwang reviewed Jul 21, 2020

View reviewed changes

external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala Show resolved Hide resolved

external/avro/src/main/scala/org/apache/spark/sql/avro/AvroFileFormat.scala Outdated Show resolved Hide resolved

gengliangwang reviewed Jul 21, 2020

View reviewed changes

external/avro/benchmarks/AvroReadBenchmark-jdk11-results.txt Show resolved Hide resolved

MaxGekk added 2 commits July 21, 2020 15:32

Add comment for nested records

19e3321

Put common code to RowReader

438e634

HyukjinKwon reviewed Jul 22, 2020

View reviewed changes

MaxGekk added 3 commits July 22, 2020 16:20

!isEmpty -> nonEmpty

0756a2d

Remove an unused import

a20c04c

Add an assert to skipRow()

f6b2a9a

MaxGekk mentioned this pull request Jul 23, 2020

[SPARK-30648][SQL][FOLLOWUP] Refactoring of JsonFilters: move config checking out #29206

Closed

gengliangwang reviewed Jul 29, 2020

View reviewed changes

external/avro/src/test/scala/org/apache/spark/sql/avro/AvroCatalystDataConversionSuite.scala Show resolved Hide resolved

gengliangwang reviewed Jul 29, 2020

View reviewed changes

gengliangwang closed this in d897825 Jul 29, 2020

gengliangwang mentioned this pull request Nov 2, 2020

[SPARK-33314][SQL] Avoid dropping rows in Avro reader #30221

Closed

MaxGekk deleted the avro-filters-pushdown branch December 11, 2020 20:27

jsoref mentioned this pull request Mar 21, 2021

[MINOR][SQL] Spelling: filters - PushedFilers #30678

Closed

[SPARK-32346][SQL] Support filters pushdown in Avro datasource #29145

[SPARK-32346][SQL] Support filters pushdown in Avro datasource #29145

Uh oh!

Conversation

MaxGekk commented Jul 17, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

SparkQA commented Jul 17, 2020

Uh oh!

SparkQA commented Jul 17, 2020

Uh oh!

SparkQA commented Jul 17, 2020

Uh oh!

SparkQA commented Jul 17, 2020

Uh oh!

SparkQA commented Jul 18, 2020

Uh oh!

SparkQA commented Jul 19, 2020

Uh oh!

SparkQA commented Jul 19, 2020

Uh oh!

MaxGekk commented Jul 19, 2020

Uh oh!

dongjoon-hyun commented Jul 19, 2020

Uh oh!

MaxGekk commented Jul 20, 2020

Uh oh!

SparkQA commented Jul 21, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

SparkQA commented Jul 21, 2020

Uh oh!

HyukjinKwon commented Jul 22, 2020

Uh oh!

SparkQA commented Jul 22, 2020

Uh oh!

MaxGekk commented Jul 22, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

SparkQA commented Jul 22, 2020

Uh oh!

SparkQA commented Jul 22, 2020

Uh oh!

MaxGekk commented Jul 27, 2020

Uh oh!

MaxGekk commented Jul 29, 2020

Uh oh!

Uh oh!

gengliangwang left a comment

Choose a reason for hiding this comment

Uh oh!

gengliangwang commented Jul 29, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

MaxGekk commented Jul 17, 2020 •

edited

Loading