[HUDI-6226] Support parquet native bloom filters #8716

parisni · 2023-05-15T20:48:04Z

Change Logs

Provides support for parquet native bloom filters to improve read performances. fixes #7117

Impact

All hudi operations (bulk_insert, insert, upsert ...) will be able to write parquet bloom filters if configured within the application hadoop conf. The hudi reader already leverages the blooms if the predicates match the columns configured at write time.

Bloom indexes only works for COW tables since it needs parquet files and is unlikely to work with MOR logs files even configured to write parquet format instead of avro.

Risk level (write none, low medium or high below)

If medium or high, explain what verification was done to mitigate the risks.

Documentation Update

The doc will be added a new guide, in the same maner of current encryption page.

The config description must be updated if new configs are added or the default value of the configs are changed
Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
ticket number here and follow the instruction to make
changes to the website.

Contributor's checklist

Read through contributor's guide
Change Logs and Impact were stated clearly
Adequate tests were added if applicable
CI passed

parisni · 2023-05-16T20:21:02Z

@danny0405 could you please review this ? unfortunatly my IDE still does not work with scala, sorry for the poor formating on the tests.

danny0405 · 2023-05-17T02:43:55Z

hudi-common/src/main/java/org/apache/hudi/io/storage/HoodieBaseParquetWriter.java

-        FSUtils.registerFileSystem(file, parquetConfig.getHadoopConf()));
+    ParquetWriter.Builder parquetWriterbuilder = new ParquetWriter.Builder(HoodieWrapperFileSystem.convertToHoodiePath(file, parquetConfig.getHadoopConf())) {
+      @Override
+      protected ParquetWriter.Builder self() {


Can you show me the code how spark adapter to the Parquet SSBF support?

Sorry not fully understood what you want. What is Parquet SSBF? Do you mean how spark itself handle parquet blooms ?

Yes, can you show me that code snippet?

please see my notes for the lineage on spark: it basically use an other parquet class to write, which get the hadoop conf directly. (ie ParquetOutputFormat) while hudi uses ParquetWriter directly

ParquetOutputWriter in spark
asks for parquet:ParquetOutputFormat which get the bloom configs

ParquetUtils in spark
has PrepareWrite function, which propagate to ParquetOurputWriter

ParquetWrite in spark
has prepareWrite function, which propagate to ParquetUtils.prepareWrite

ParquetTable in spark
uses ParquetWrite

ParquetDatasourceV2 in spark
uses ParquetTable in getTable (then for read and write)

Thanks, another question is does Delta can gain benefits from these bloomfilters, how much regression for writer path when enabling the BloomFilter for parquet.

"Delta" do you mean MOR logs files configured in parquet format ? If you mean delta-lake then yes, as well as iceberg they likely rely on spark writer so benefit from bloom.

By regression do you mean performance regression ? Each column configured with bloom will introduce overhead at write time, but faster subsequent reads with predicates on the column. I haven't benchmarked that but I can say blooms will faster reads significantly by skipping lot of parquet scan that hudi stats index won't cover. ie uuids, strings, high cardinality dictionaries

BTW in this PR up to the user to enable blooms so no regression is expected

hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/TestHoodieParquetBloom.scala

hudi-common/src/main/java/org/apache/hudi/io/storage/HoodieBaseParquetWriter.java

parisni · 2023-05-24T15:10:06Z

OK understood. Thx

…

On May 24, 2023 6:23:42 AM UTC, Danny Chan ***@***.***> wrote: @danny0405 commented on this pull request. > @@ -67,6 +89,22 @@ public HoodieBaseParquetWriter(Path file, this.recordCountForNextSizeCheck = DEFAULT_MINIMUM_RECORD_COUNT_FOR_CHECK; } + protected void handleParquetBloomFilters(ParquetWriter.Builder parquetWriterbuilder, Configuration hadoopConf) { + parquetWriterbuilder.withBloomFilterEnabled(ParquetOutputFormat.getBloomFilterEnabled(hadoopConf)); + // inspired from https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetOutputFormat.java#L458-L464 + hadoopConf.forEach(conf -> { + String key = conf.getKey(); + if (key.startsWith(BLOOM_FILTER_ENABLED)) { + String column = key.substring(BLOOM_FILTER_ENABLED.length() + 1, key.length()); + parquetWriterbuilder.withBloomFilterEnabled(column, Boolean.valueOf(conf.getValue())); The writer should be shared by all the engines, not just Spark, so it is necessary to add a basic UT for it. -- Reply to this email directly or view it on GitHub: #8716 (comment) You are receiving this because you authored the thread. Message ID: ***@***.***>

parisni · 2023-05-25T12:20:36Z

For reference the delta bloom documentation https://docs.databricks.com/optimizations/bloom-filters.html

…

On May 24, 2023 6:23:42 AM UTC, Danny Chan ***@***.***> wrote: @danny0405 commented on this pull request. > @@ -67,6 +89,22 @@ public HoodieBaseParquetWriter(Path file, this.recordCountForNextSizeCheck = DEFAULT_MINIMUM_RECORD_COUNT_FOR_CHECK; } + protected void handleParquetBloomFilters(ParquetWriter.Builder parquetWriterbuilder, Configuration hadoopConf) { + parquetWriterbuilder.withBloomFilterEnabled(ParquetOutputFormat.getBloomFilterEnabled(hadoopConf)); + // inspired from https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetOutputFormat.java#L458-L464 + hadoopConf.forEach(conf -> { + String key = conf.getKey(); + if (key.startsWith(BLOOM_FILTER_ENABLED)) { + String column = key.substring(BLOOM_FILTER_ENABLED.length() + 1, key.length()); + parquetWriterbuilder.withBloomFilterEnabled(column, Boolean.valueOf(conf.getValue())); The writer should be shared by all the engines, not just Spark, so it is necessary to add a basic UT for it. -- Reply to this email directly or view it on GitHub: #8716 (comment) You are receiving this because you authored the thread. Message ID: ***@***.***>

danny0405 · 2023-05-26T05:19:14Z

Thanks for the sharing, I think the Databricks BloomFilter index mainly serves as query optimization purposes right? Do they also use this to accelarate the data skipping during data ingestion, aka the UPSERTS ?

parisni · 2023-05-26T14:39:09Z

No AFAIK both iceberg and delta use bloom in the same way the current PR does: read time parquet push down. Just to mention spark also provides a [join based on bloom](apache/spark#35789) since 3.3 but in this cas it is a on the fly built bloom, and not the parquet bloom filter.

…

On May 26, 2023 5:19:25 AM UTC, Danny Chan ***@***.***> wrote: Thanks for the sharing, I think the Databricks BloomFilter index mainly serves as query optimization purposes right? Do they also use this to accelate the data skipping during data ingestion, aka the UPSERTS ? -- Reply to this email directly or view it on GitHub: #8716 (comment) You are receiving this because you authored the thread. Message ID: ***@***.***>

danny0405 · 2023-05-27T02:32:51Z

@parisni Thanks for the clarification, can you supplement the tests and rebase with the latest master code, then re-trigger the tests again?

parisni · 2023-06-05T13:35:26Z

@danny0405 done for UT + rebase, let me know

danny0405 · 2023-06-06T03:57:10Z

@danny0405 done for UT + rebase, let me know

Still see many compilure failures, can you take a look again ~

parisni · 2023-06-06T19:42:14Z

Still see many compilure failures, can you take a look again ~

@danny0405 This is a non trivial compilation problem. The bloom stuff only works with parquet >= 1.12. Then any engine shipping a lower version won't compile. So far I have been working with spark3.2 which handle bloom and did not encounter this. Example:

[ERROR]   symbol:   method withBloomFilterEnabled(boolean)

One way to fixing this is to create an implementation of the writer for each engine version, depending of their parquet support.
A second way (hugly but cheap) would be to use reflexion to call the method if exist. I gave it a try. I will have to move the existing tests within module which support parquet 1.12. But before WDYT of this approach ?

danny0405 · 2023-06-07T02:05:59Z

One way to fixing this is to create an implementation of the writer for each engine version, depending of their parquet support

I checked that spark 3.0 and spark 2.4 still use parquet 1.10.1, spark3.0+ release use parquet 1.12.2, Flink uses parquet 1.12.2, let's try the first approach first.

parisni · 2023-06-07T07:32:55Z

Flink uses parquet 1.12.2

Weird since all flink based build were failing the same way spark 2.4

parisni · 2023-06-10T14:05:44Z

@danny0405 @yihua can you please guide me on how to implement that parquet writer for each engine ? I am confortable with the reflexion way which is fairly simple, but dealing with every engine is non trivial since currently the parquet builder is a common class

danny0405 · 2023-06-12T03:14:26Z

can you please guide me on how to implement that parquet writer for each engine ?

One way is to make the parquet writer abstract and add a interface like supportsBloomFilter for different engines. And we remve the different impls when migrating all engines to 1.12.x Parquet.

parisni · 2023-06-20T19:21:56Z

One way is to make the parquet writer abstract and add a interface like supportsBloomFilter for different engines. And we remve the different impls when migrating all engines to 1.12.x Parquet.

i have no idea how to inject the proper impls in each client, since so far each client do use the common parquet writer. This looks a huge refactoring.

danny0405 · 2023-06-21T00:58:03Z

We can restart the task when Hudi has dropped spark 2.4, after which the parquet version can be upgraded.

parisni · 2023-06-21T06:24:11Z

What's the timeline for dropping 2.4 ? Also we could use the reflexion way and drop it when parquet get updated. There is several advantages for this: 1. the feature is important to fill the gap soon with other acid tables 2. we remove here the use of deprecated parquet constructors 3. the patch is tiny

…

On June 21, 2023 12:58:14 AM UTC, Danny Chan ***@***.***> wrote: We can restart the task when Hudi has dropped spark 2.4, after which the parquet version can be upgraded. -- Reply to this email directly or view it on GitHub: #8716 (comment) You are receiving this because you were mentioned. Message ID: ***@***.***>

danny0405 · 2023-06-21T10:52:49Z

Also we could use the reflexion way and drop it when parquet get updated.

Maybe 1.0 release, the only reason we have very old parquet version is because the Spark 2.4.

parisni · 2023-06-21T22:00:57Z

ok let me fix the build w/ reflexion and then decide

hudi-bot · 2023-06-23T00:44:14Z

CI report:

2b6691f UNKNOWN
1351072 Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

parisni · 2023-06-23T08:46:53Z

@danny0405 build and tests are passing now. I propose to bring this to 0.14 and when spark 2.4 will be dropped, clean this up. Would be glad to write the documentation in how to setup parquet bloom (same way as parquet modular encryption)

danny0405 · 2023-06-25T02:41:12Z

I saw the unit test for the new writer was dropped, was it because of the compaibility issue?

parisni · 2023-06-25T06:48:16Z

I saw the unit test for the new writer was dropped, was it because of the compaibility issue?

Yes. In order to check the bloom existence we relied on parquet tools newer version. It breaks the build.

By the way the other test already makes sure of bloom existence.

I might be able to restore the test now and trick it again with reflexion. Or I could restore it when spark 2.4 support gets dropped

nsivabalan · 2023-07-14T18:02:06Z

hey @parisni : good job on the patch. Curious to know if you have any perf nos on this. on both write and read side. whats the perf overhead we are seeing on the write side and how much improvement we are seeing on the read side w/ the bloom filter.

Also, would you provide a short write up(whats this support is all about, how users can leverage this and whats the benefit) on this that we can use it in our release page?

parisni · 2023-07-14T19:21:42Z

@nsivabalan There is existing spark benchmarks here. Basically 20% slower for writes and up to 4x for reads. https://github.com/apache/spark/blob/18d0a276c501a102af3e7ed251831983b9148a4f/sql/core/benchmarks/BloomFilterBenchmark-jdk11-results.txt As for documentation plz consider this pr #9056

…

On July 14, 2023 6:02:18 PM UTC, Sivabalan Narayanan ***@***.***> wrote: hey @parisni : good job on the patch. Curious to know if you have any perf nos on this. on both write and read side. whats the perf overhead we are seeing on the write side and how much improvement we are seeing on the read side w/ the bloom filter. Also, would you provide a short write up(whats this support is all about, how users can leverage this and whats the benefit) on this that we can use it in our release page? -- Reply to this email directly or view it on GitHub: #8716 (comment) You are receiving this because you were mentioned. Message ID: ***@***.***>

parisni added 4 commits May 15, 2023 22:46

Make bloom working with the OP example

e2bd706

Parse hadoop conf to infer bloom config

5c5f4de

Fix test

c9d1268

Add parquet bloom filter test

1e1a061

parisni changed the title ~~[WIP] Support parquet native bloom filters~~ [HUDI-6226] Support parquet native bloom filters May 16, 2023

Also add other operations

2b6691f

danny0405 reviewed May 17, 2023

View reviewed changes

danny0405 self-assigned this May 18, 2023

danny0405 added writer-core priority:medium Moderate impact; usability gaps labels May 18, 2023

danny0405 reviewed May 23, 2023

View reviewed changes

hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/TestHoodieParquetBloom.scala Show resolved Hide resolved

danny0405 reviewed May 23, 2023

View reviewed changes

hudi-common/src/main/java/org/apache/hudi/io/storage/HoodieBaseParquetWriter.java Outdated Show resolved Hide resolved

parisni added 2 commits June 5, 2023 15:32

Add UT

ae0d26f

Merge remote-tracking branch 'hudi/master' into parquet-bloom-suport

0d5a19f

Fix style

bfa81a4

Fix import static

d51e9b6

Use reflexion to handle multiple parquet versions

0c2ed7d

parisni force-pushed the parquet-bloom-suport branch from 85df05e to 0c2ed7d Compare June 6, 2023 21:08

parisni added 2 commits June 22, 2023 21:18

Merge remote-tracking branch 'hudi/master' into parquet-bloom-suport

8dcc242

Adapt test for both spark2 and 3

1351072

parisni requested a review from danny0405 June 22, 2023 21:04

danny0405 approved these changes Jun 26, 2023

View reviewed changes

danny0405 merged commit e038901 into apache:master Jun 26, 2023

parisni mentioned this pull request Jul 20, 2023

Fix parquet bloom filters impl and tests #9245

Merged

4 tasks

[HUDI-6226] Support parquet native bloom filters #8716

[HUDI-6226] Support parquet native bloom filters #8716

Uh oh!

Conversation

parisni commented May 15, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Change Logs

Impact

Risk level (write none, low medium or high below)

Documentation Update

Contributor's checklist

Uh oh!

parisni commented May 16, 2023

Uh oh!

danny0405 May 17, 2023

Choose a reason for hiding this comment

Uh oh!

parisni May 17, 2023

Choose a reason for hiding this comment

Uh oh!

danny0405 May 17, 2023

Choose a reason for hiding this comment

Uh oh!

parisni May 17, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

danny0405 May 18, 2023

Choose a reason for hiding this comment

Uh oh!

parisni May 18, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

parisni commented May 24, 2023 via email

Uh oh!

parisni commented May 25, 2023 via email

Uh oh!

danny0405 commented May 26, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

parisni commented May 26, 2023 via email

Uh oh!

danny0405 commented May 27, 2023

Uh oh!

parisni commented Jun 5, 2023

Uh oh!

danny0405 commented Jun 6, 2023

Uh oh!

parisni commented Jun 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

danny0405 commented Jun 7, 2023

Uh oh!

parisni commented Jun 7, 2023

Uh oh!

parisni commented Jun 10, 2023

Uh oh!

danny0405 commented Jun 12, 2023

Uh oh!

parisni commented Jun 20, 2023

Uh oh!

danny0405 commented Jun 21, 2023

Uh oh!

parisni commented Jun 21, 2023 via email • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

danny0405 commented Jun 21, 2023

Uh oh!

parisni commented Jun 21, 2023

Uh oh!

hudi-bot commented Jun 23, 2023

CI report:

Uh oh!

parisni commented Jun 23, 2023

Uh oh!

danny0405 commented Jun 25, 2023

Uh oh!

parisni commented May 15, 2023 •

edited

Loading

parisni May 17, 2023 •

edited

Loading

parisni May 18, 2023 •

edited

Loading

danny0405 commented May 26, 2023 •

edited

Loading

parisni commented Jun 6, 2023 •

edited

Loading

parisni commented Jun 21, 2023 via email •

edited

Loading