Skip to content

Conversation

@MaxGekk
Copy link
Member

@MaxGekk MaxGekk commented Jul 17, 2020

What changes were proposed in this pull request?

In the PR, I propose to support pushed down filters in Avro datasource V1 and V2.

  1. Added new SQL config spark.sql.avro.filterPushdown.enabled to control filters pushdown to Avro datasource. It is on by default.
  2. Renamed CSVFilters to OrderedFilters.
  3. OrderedFilters is used in AvroFileFormat (DSv1) and in AvroPartitionReaderFactory (DSv2)
  4. Modified AvroDeserializer to return None from the deserialize method when pushdown filters return false.

Why are the changes needed?

The changes improve performance on synthetic benchmarks up to 2 times on JDK 11:

OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
Filters pushdown:                         Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
w/o filters                                        9614           9669          54          0.1        9614.1       1.0X
pushdown disabled                                 10077          10141          66          0.1       10077.2       1.0X
w/ filters                                         4681           4713          29          0.2        4681.5       2.1X

Does this PR introduce any user-facing change?

No

How was this patch tested?

  • Added UT to AvroCatalystDataConversionSuite and AvroSuite
  • Re-running AvroReadBenchmark using Amazon EC2:
Item Description
Region us-west-2 (Oregon)
Instance r3.xlarge (spot instance)
AMI ami-06f2f779464715dc5 (ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1)
Java OpenJDK8/11 installed bysudo add-apt-repository ppa:openjdk-r/ppa & sudo apt install openjdk-11-jdk

and ./dev/run-benchmarks:

#!/usr/bin/env python3

import os
from sparktestsupport.shellutils import run_cmd

benchmarks = [
  ['avro/test', 'org.apache.spark.sql.execution.benchmark.AvroReadBenchmark']
]

print('Set SPARK_GENERATE_BENCHMARK_FILES=1')
os.environ['SPARK_GENERATE_BENCHMARK_FILES'] = '1'

for b in benchmarks:
    print("Run benchmark: %s" % b[1])
    run_cmd(['build/sbt', '%s:runMain %s' % (b[0], b[1])])

@SparkQA
Copy link

SparkQA commented Jul 17, 2020

Test build #126053 has finished for PR 29145 at commit c010a4e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 17, 2020

Test build #126063 has finished for PR 29145 at commit ab17bd0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 17, 2020

Test build #126072 has finished for PR 29145 at commit 0ea0ef7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 17, 2020

Test build #126074 has finished for PR 29145 at commit 5c4c177.

  • This patch fails PySpark pip packaging tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 18, 2020

Test build #126114 has finished for PR 29145 at commit 72022bb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 19, 2020

Test build #126115 has finished for PR 29145 at commit 072eab0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk MaxGekk changed the title [WIP][SPARK-32346][SQL] Support filters pushdown in Avro datasource [SPARK-32346][SQL] Support filters pushdown in Avro datasource Jul 19, 2020
@SparkQA
Copy link

SparkQA commented Jul 19, 2020

Test build #126128 has finished for PR 29145 at commit 668e497.

  • This patch fails PySpark pip packaging tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk
Copy link
Member Author

MaxGekk commented Jul 19, 2020

@gengliangwang @dongjoon-hyun @HyukjinKwon @cloud-fan Please, take a look at this PR.

@dongjoon-hyun
Copy link
Member

Thank you for pinging me, @MaxGekk .

@MaxGekk
Copy link
Member Author

MaxGekk commented Jul 20, 2020

@dongjoon-hyun I am looking forward to review comments from you.

@SparkQA
Copy link

SparkQA commented Jul 21, 2020

Test build #126199 has finished for PR 29145 at commit 74f8f1b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 21, 2020

Test build #126255 has finished for PR 29145 at commit 438e634.

  • This patch fails PySpark pip packaging tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • trait RowReader

@HyukjinKwon
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Jul 22, 2020

Test build #126303 has finished for PR 29145 at commit 438e634.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • trait RowReader

@MaxGekk
Copy link
Member Author

MaxGekk commented Jul 22, 2020

jenkins, retest this, please

@SparkQA
Copy link

SparkQA commented Jul 22, 2020

Test build #126310 has finished for PR 29145 at commit 438e634.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • trait RowReader

@SparkQA
Copy link

SparkQA commented Jul 22, 2020

Test build #126339 has finished for PR 29145 at commit f6b2a9a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

HyukjinKwon pushed a commit that referenced this pull request Jul 24, 2020
…checking out

### What changes were proposed in this pull request?
Refactoring of `JsonFilters`:
- Add an assert to the `skipRow` method to check the input `index`
- Move checking of the SQL config `spark.sql.json.filterPushdown.enabled` from `JsonFilters` to `JacksonParser`.

### Why are the changes needed?
1. The assert should catch incorrect usage of `JsonFilters`
2. The config checking out of `JsonFilters` makes it consistent with `OrderedFilters` (see #29145).
3. `JsonFilters` can be used by other datasource in the future and don't depend from the JSON configs.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By existing tests suites:
```
$ build/sbt "sql/test:testOnly org.apache.spark.sql.execution.datasources.json.*"
$ build/sbt "test:testOnly org.apache.spark.sql.catalyst.json.*"
```

Closes #29206 from MaxGekk/json-filters-pushdown-followup.

Authored-by: Max Gekk <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
@MaxGekk
Copy link
Member Author

MaxGekk commented Jul 27, 2020

@cloud-fan Please, review this PR.

@MaxGekk
Copy link
Member Author

MaxGekk commented Jul 29, 2020

@gengliangwang Are you ok with this PR?

Copy link
Member

@gengliangwang gengliangwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except for one comment

@gengliangwang
Copy link
Member

Thanks, merging to master

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants