Skip to content

Conversation

@parisni
Copy link
Contributor

@parisni parisni commented May 15, 2023

Change Logs

Provides support for parquet native bloom filters to improve read performances. fixes #7117

Impact

All hudi operations (bulk_insert, insert, upsert ...) will be able to write parquet bloom filters if configured within the application hadoop conf. The hudi reader already leverages the blooms if the predicates match the columns configured at write time.

Bloom indexes only works for COW tables since it needs parquet files and is unlikely to work with MOR logs files even configured to write parquet format instead of avro.

Risk level (write none, low medium or high below)

If medium or high, explain what verification was done to mitigate the risks.

Documentation Update

The doc will be added a new guide, in the same maner of current encryption page.

  • The config description must be updated if new configs are added or the default value of the configs are changed
  • Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
    ticket number here and follow the instruction to make
    changes to the website.

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@parisni parisni changed the title [WIP] Support parquet native bloom filters [HUDI-6226] Support parquet native bloom filters May 16, 2023
@parisni
Copy link
Contributor Author

parisni commented May 16, 2023

@danny0405 could you please review this ? unfortunatly my IDE still does not work with scala, sorry for the poor formating on the tests.

FSUtils.registerFileSystem(file, parquetConfig.getHadoopConf()));
ParquetWriter.Builder parquetWriterbuilder = new ParquetWriter.Builder(HoodieWrapperFileSystem.convertToHoodiePath(file, parquetConfig.getHadoopConf())) {
@Override
protected ParquetWriter.Builder self() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you show me the code how spark adapter to the Parquet SSBF support?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry not fully understood what you want. What is Parquet SSBF? Do you mean how spark itself handle parquet blooms ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, can you show me that code snippet?

Copy link
Contributor Author

@parisni parisni May 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please see my notes for the lineage on spark: it basically use an other parquet class to write, which get the hadoop conf directly. (ie ParquetOutputFormat) while hudi uses ParquetWriter directly

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, another question is does Delta can gain benefits from these bloomfilters, how much regression for writer path when enabling the BloomFilter for parquet.

Copy link
Contributor Author

@parisni parisni May 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Delta" do you mean MOR logs files configured in parquet format ? If you mean delta-lake then yes, as well as iceberg they likely rely on spark writer so benefit from bloom.

By regression do you mean performance regression ? Each column configured with bloom will introduce overhead at write time, but faster subsequent reads with predicates on the column. I haven't benchmarked that but I can say blooms will faster reads significantly by skipping lot of parquet scan that hudi stats index won't cover. ie uuids, strings, high cardinality dictionaries

BTW in this PR up to the user to enable blooms so no regression is expected

@danny0405 danny0405 self-assigned this May 18, 2023
@danny0405 danny0405 added writer-core priority:medium Moderate impact; usability gaps labels May 18, 2023
@parisni
Copy link
Contributor Author

parisni commented May 24, 2023 via email

@parisni
Copy link
Contributor Author

parisni commented May 25, 2023 via email

@danny0405
Copy link
Contributor

danny0405 commented May 26, 2023

Thanks for the sharing, I think the Databricks BloomFilter index mainly serves as query optimization purposes right? Do they also use this to accelarate the data skipping during data ingestion, aka the UPSERTS ?

@parisni
Copy link
Contributor Author

parisni commented May 26, 2023 via email

@danny0405
Copy link
Contributor

@parisni Thanks for the clarification, can you supplement the tests and rebase with the latest master code, then re-trigger the tests again?

@parisni
Copy link
Contributor Author

parisni commented Jun 5, 2023

@danny0405 done for UT + rebase, let me know

@danny0405
Copy link
Contributor

@danny0405 done for UT + rebase, let me know

Still see many compilure failures, can you take a look again ~

@parisni
Copy link
Contributor Author

parisni commented Jun 6, 2023

Still see many compilure failures, can you take a look again ~

@danny0405 This is a non trivial compilation problem. The bloom stuff only works with parquet >= 1.12. Then any engine shipping a lower version won't compile. So far I have been working with spark3.2 which handle bloom and did not encounter this. Example:

[ERROR]   symbol:   method withBloomFilterEnabled(boolean)
  1. One way to fixing this is to create an implementation of the writer for each engine version, depending of their parquet support.
  2. A second way (hugly but cheap) would be to use reflexion to call the method if exist. I gave it a try. I will have to move the existing tests within module which support parquet 1.12. But before WDYT of this approach ?

@parisni parisni force-pushed the parquet-bloom-suport branch from 85df05e to 0c2ed7d Compare June 6, 2023 21:08
@danny0405
Copy link
Contributor

  1. One way to fixing this is to create an implementation of the writer for each engine version, depending of their parquet support

I checked that spark 3.0 and spark 2.4 still use parquet 1.10.1, spark3.0+ release use parquet 1.12.2, Flink uses parquet 1.12.2, let's try the first approach first.

@parisni
Copy link
Contributor Author

parisni commented Jun 7, 2023

Flink uses parquet 1.12.2

Weird since all flink based build were failing the same way spark 2.4

@parisni
Copy link
Contributor Author

parisni commented Jun 10, 2023

@danny0405 @yihua can you please guide me on how to implement that parquet writer for each engine ? I am confortable with the reflexion way which is fairly simple, but dealing with every engine is non trivial since currently the parquet builder is a common class

@danny0405
Copy link
Contributor

can you please guide me on how to implement that parquet writer for each engine ?

One way is to make the parquet writer abstract and add a interface like supportsBloomFilter for different engines. And we remve the different impls when migrating all engines to 1.12.x Parquet.

@parisni
Copy link
Contributor Author

parisni commented Jun 20, 2023

One way is to make the parquet writer abstract and add a interface like supportsBloomFilter for different engines. And we remve the different impls when migrating all engines to 1.12.x Parquet.

i have no idea how to inject the proper impls in each client, since so far each client do use the common parquet writer. This looks a huge refactoring.

@danny0405
Copy link
Contributor

We can restart the task when Hudi has dropped spark 2.4, after which the parquet version can be upgraded.

@parisni
Copy link
Contributor Author

parisni commented Jun 21, 2023 via email

@danny0405
Copy link
Contributor

Also we could use the reflexion way and drop it when parquet get updated.

Maybe 1.0 release, the only reason we have very old parquet version is because the Spark 2.4.

@parisni
Copy link
Contributor Author

parisni commented Jun 21, 2023

ok let me fix the build w/ reflexion and then decide

@parisni parisni requested a review from danny0405 June 22, 2023 21:04
@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@parisni
Copy link
Contributor Author

parisni commented Jun 23, 2023

@danny0405 build and tests are passing now. I propose to bring this to 0.14 and when spark 2.4 will be dropped, clean this up. Would be glad to write the documentation in how to setup parquet bloom (same way as parquet modular encryption)

@danny0405
Copy link
Contributor

I saw the unit test for the new writer was dropped, was it because of the compaibility issue?

@parisni
Copy link
Contributor Author

parisni commented Jun 25, 2023

I saw the unit test for the new writer was dropped, was it because of the compaibility issue?

Yes. In order to check the bloom existence we relied on parquet tools newer version. It breaks the build.

By the way the other test already makes sure of bloom existence.

I might be able to restore the test now and trick it again with reflexion. Or I could restore it when spark 2.4 support gets dropped

@danny0405 danny0405 merged commit e038901 into apache:master Jun 26, 2023
@nsivabalan
Copy link
Contributor

hey @parisni : good job on the patch. Curious to know if you have any perf nos on this. on both write and read side. whats the perf overhead we are seeing on the write side and how much improvement we are seeing on the read side w/ the bloom filter.

Also, would you provide a short write up(whats this support is all about, how users can leverage this and whats the benefit) on this that we can use it in our release page?

@parisni
Copy link
Contributor Author

parisni commented Jul 14, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

priority:medium Moderate impact; usability gaps

Projects

Status: ✅ Done

Development

Successfully merging this pull request may close these issues.

[SUPPORT] parquet bloom filters not supported by hudi

4 participants