[HUDI-5534] Optimizing Bloom Index lookup when using Bloom Filters from Metadata Table #7642

alexeykudinkin · 2023-01-11T06:25:49Z

Change Logs

Most recently, while trying to use Metadata Table in Bloom Index it was resulting in failures due to exhaustion of S3 connection pool no matter how (reasonably big) we're setting the pool size (we've tested up to 3k connections).

This PR focuses on optimizing the Bloom Index lookup sequence in case when it's leveraging Bloom Filter partition in Metadata Table. The premise of this change is based on the following observations:

Increasing the size of the batch of the requests to MT allows to amortize the cost of processing it (bigger the batch, lesser the cost).
Having too few partitions in the Bloom Index path however, starts to hurt parallelism when we actually probe individual files whether they actually contain target keys or not. Solution to this is to split these 2 in different stages w/ drastically different parallelism levels: constrain parallelism when reading from MT (10s of tasks) and keep at the current level for probing individual files (100s of tasks)
Current way of partitioning records (relying on Spark's default partitioner) was entailing that every Spark executor with high likelihood will be opening up (and processing) every file-group of the MT Bloom Filter partition. To alleviate that same hashing algorithm used by MT should be used to partition records into Spark's individual partitions, so that we can limit every task to open no more than 1 file-group in Bloom Filter's partition of MT

To achieve that following changes in Bloom Index sequence (leveraging MT) are implemented

Bloom Filter probing and actual File Probing are split into 2 separate operations (so that parallelism of each of them could be controlled individually)
Requests to MT are replaced to invoke batch APIs
Custom partitioner is introduced AffineBloomIndexFileGroupPartitioner repartitioning dataset of filenames with corresponding record keys in a way that is affine w/ MT Bloom Filters' partitioning (allowing us to open no more than a single file-group per Spark's task)

Additionally, this PR addresses some of the low-hanging performance optimizations that could considerably improve performance of the Bloom Index lookup sequence like mapping file-comparison pairs to PairRDD (where key is file-name, and value is record-key) instead of RDD so that we could:

Do in-partition sorting by filename (to make sure we check all records w/in the file all at once) w/in a single Spark partition instead of global one (reducing shuffling as well)
Avoid re-shuffling (by re-mapping from RDD to PairRDD later)

Impact

Impact of this PR is more than 10x improvement of the record tagging sequence using Bloom Index leveraging Bloom Filters persisted in MT.

Risk level (write none, low medium or high below)

Low

Documentation Update

We need to update documentation to still elaborate that if using MT partition path users still need to increase # of S3 connections to roughly ~30 per executors core, for it to avert hitting Connection Pool exhaustion problem.

Contributor's checklist

Read through contributor's guide
Change Logs and Impact were stated clearly
Adequate tests were added if applicable
CI passed

alexeykudinkin · 2023-01-20T07:22:31Z

@hudi-bot run azure

yihua

Half way through the review. I left a few comments.

hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/bloom/HoodieBloomIndex.java

hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/data/HoodieJavaRDD.java

hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/index/bloom/AbstractIterator.java

hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/index/bloom/FlattenedIterator.java

...t/hudi-spark-client/src/main/java/org/apache/hudi/index/bloom/HoodieFileProbingFunction.java

...ient/src/main/java/org/apache/hudi/index/bloom/HoodieMetadataBloomFilterProbingFunction.java

yihua · 2023-01-25T21:17:34Z

...hudi-spark-client/src/main/java/org/apache/hudi/index/bloom/SparkHoodieBloomIndexHelper.java

+    int configuredBloomIndexParallelism = config.getBloomIndexParallelism();

+    // NOTE: Target parallelism could be overridden by the config
+    int targetParallelism =


note to myself: if the configured bloom index parallelism is smaller than the input parallelism, before this PR, we take the input parallelism; now, we take the configured bloom index parallelism.

Correct. Previously we're taking max(input, configured) and there was essentially no way for user to override it

…HTTP request for every file probed, killing performance); Added config to control the parallelism factor when reading BFs from MT; Tidying up: added comments;

…Spark helpers

…x API

…provided serializable initializer

…ation` explicitly; Dialed down buffer size from 10Mb to 10Kb (this buffer is practically not used)

…nderlying MT readers

Added test

…e-group looked up

…sm to read MT as the one used for Bloom Index lookup overall

hudi-bot · 2023-01-26T09:57:57Z

CI report:

5fdfcbe Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

yihua

LGTM. Thanks for fixing and improving the Bloom Index when reading bloom filters from the metadata table!

yihua · 2023-01-27T23:03:18Z

...hudi-spark-client/src/main/java/org/apache/hudi/index/bloom/SparkHoodieBloomIndexHelper.java

+      int bloomFilterPartitionFileGroupCount =
+          config.getMetadataConfig().getBloomFilterIndexFileGroupCount();
+      int adjustedTargetParallelism =
+          targetParallelism % bloomFilterPartitionFileGroupCount == 0
+              ? targetParallelism
+              // NOTE: We add 1 to make sure parallelism a) value always stays positive and b)
+              //       {@code targetParallelism <= adjustedTargetParallelism}
+              : (targetParallelism / bloomFilterPartitionFileGroupCount + 1) * bloomFilterPartitionFileGroupCount;


Make sure we benchmark the write workload with this new logic before landing the PR.

One caveat is that if the targetParallelism is large, it may still overload the S3 bucket if the number of Spark executors is large (each executor reading metadata table's bloom_filters partition).

Correct, this still has that risk. This could be controlled by Bloom Index parallelism though

yihua · 2023-01-27T23:17:06Z

...hudi-spark-client/src/main/java/org/apache/hudi/index/bloom/SparkHoodieBloomIndexHelper.java

+
+    // TODO(HUDI-5619) remove when addressed
+    private final Map<String, Map<String, String>> cachedLatestBaseFileNames =
+        new HashMap<>(16);


Any reason of using 16 instead of a bigger number?

We want to skip a first few expansions, but don't want to allocate too much memory that might not even be used

yihua · 2023-01-27T23:19:25Z

hudi-common/src/main/java/org/apache/hudi/common/model/HoodieAvroIndexedRecord.java

  @Override
  public HoodieRecord rewriteRecord(Schema recordSchema, Properties props, Schema targetSchema) throws IOException {
-    GenericRecord record = HoodieAvroUtils.rewriteRecord((GenericRecord) data, targetSchema);
+    GenericRecord record = HoodieAvroUtils.rewriteRecordWithNewSchema(data, targetSchema);


nit: I see other irrelevant changes reverted. This one is still here. If you think this makes a performance improvement, we can keep it in this PR.

hudi-common/src/main/java/org/apache/hudi/metadata/BaseTableMetadata.java

…om Metadata Table (#7642) Most recently, while trying to use Metadata Table in Bloom Index it was resulting in failures due to exhaustion of S3 connection pool no matter how (reasonably big) we're setting the pool size (we've tested up to 3k connections). This PR focuses on optimizing the Bloom Index lookup sequence in case when it's leveraging Bloom Filter partition in Metadata Table. The premise of this change is based on the following observations: Increasing the size of the batch of the requests to MT allows to amortize the cost of processing it (bigger the batch, lesser the cost). Having too few partitions in the Bloom Index path however, starts to hurt parallelism when we actually probe individual files whether they actually contain target keys or not. Solution to this is to split these 2 in different stages w/ drastically different parallelism levels: constrain parallelism when reading from MT (10s of tasks) and keep at the current level for probing individual files (100s of tasks) Current way of partitioning records (relying on Spark's default partitioner) was entailing that every Spark executor with high likelihood will be opening up (and processing) every file-group of the MT Bloom Filter partition. To alleviate that same hashing algorithm used by MT should be used to partition records into Spark's individual partitions, so that we can limit every task to open no more than 1 file-group in Bloom Filter's partition of MT To achieve that following changes in Bloom Index sequence (leveraging MT) are implemented Bloom Filter probing and actual File Probing are split into 2 separate operations (so that parallelism of each of them could be controlled individually) Requests to MT are replaced to invoke batch APIs Custom partitioner is introduced AffineBloomIndexFileGroupPartitioner repartitioning dataset of filenames with corresponding record keys in a way that is affine w/ MT Bloom Filters' partitioning (allowing us to open no more than a single file-group per Spark's task) Additionally, this PR addresses some of the low-hanging performance optimizations that could considerably improve performance of the Bloom Index lookup sequence like mapping file-comparison pairs to PairRDD (where key is file-name, and value is record-key) instead of RDD so that we could: Do in-partition sorting by filename (to make sure we check all records w/in the file all at once) w/in a single Spark partition instead of global one (reducing shuffling as well) Avoid re-shuffling (by re-mapping from RDD to PairRDD later)

…om Metadata Table (apache#7642) Most recently, while trying to use Metadata Table in Bloom Index it was resulting in failures due to exhaustion of S3 connection pool no matter how (reasonably big) we're setting the pool size (we've tested up to 3k connections). This PR focuses on optimizing the Bloom Index lookup sequence in case when it's leveraging Bloom Filter partition in Metadata Table. The premise of this change is based on the following observations: Increasing the size of the batch of the requests to MT allows to amortize the cost of processing it (bigger the batch, lesser the cost). Having too few partitions in the Bloom Index path however, starts to hurt parallelism when we actually probe individual files whether they actually contain target keys or not. Solution to this is to split these 2 in different stages w/ drastically different parallelism levels: constrain parallelism when reading from MT (10s of tasks) and keep at the current level for probing individual files (100s of tasks) Current way of partitioning records (relying on Spark's default partitioner) was entailing that every Spark executor with high likelihood will be opening up (and processing) every file-group of the MT Bloom Filter partition. To alleviate that same hashing algorithm used by MT should be used to partition records into Spark's individual partitions, so that we can limit every task to open no more than 1 file-group in Bloom Filter's partition of MT To achieve that following changes in Bloom Index sequence (leveraging MT) are implemented Bloom Filter probing and actual File Probing are split into 2 separate operations (so that parallelism of each of them could be controlled individually) Requests to MT are replaced to invoke batch APIs Custom partitioner is introduced AffineBloomIndexFileGroupPartitioner repartitioning dataset of filenames with corresponding record keys in a way that is affine w/ MT Bloom Filters' partitioning (allowing us to open no more than a single file-group per Spark's task) Additionally, this PR addresses some of the low-hanging performance optimizations that could considerably improve performance of the Bloom Index lookup sequence like mapping file-comparison pairs to PairRDD (where key is file-name, and value is record-key) instead of RDD so that we could: Do in-partition sorting by filename (to make sure we check all records w/in the file all at once) w/in a single Spark partition instead of global one (reducing shuffling as well) Avoid re-shuffling (by re-mapping from RDD to PairRDD later)

danny0405 assigned nsivabalan Jan 11, 2023

danny0405 added metadata priority:high Significant impact; potential bugs engine:spark Spark integration labels Jan 11, 2023

alexeykudinkin added priority:blocker Production down; release blocker and removed priority:high Significant impact; potential bugs labels Jan 11, 2023

alexeykudinkin requested a review from yihua January 11, 2023 06:59

alexeykudinkin assigned yihua and unassigned nsivabalan Jan 11, 2023

alexeykudinkin changed the title ~~[DNM] Optimizing Bloom Index lookup when using Bloom Filters from Metadata Table~~ [HUDI-5534] Optimizing Bloom Index lookup when using Bloom Filters from Metadata Table Jan 11, 2023

alexeykudinkin changed the title ~~[HUDI-5534] Optimizing Bloom Index lookup when using Bloom Filters from Metadata Table~~ [HUDI-5534][Stacked on 6782] Optimizing Bloom Index lookup when using Bloom Filters from Metadata Table Jan 11, 2023

alexeykudinkin force-pushed the ak/blm-idx-dag-opt branch 3 times, most recently from 2ac1a78 to 6988dfc Compare January 18, 2023 22:02

alexeykudinkin force-pushed the ak/blm-idx-dag-opt branch from dd82fb6 to b11fa6b Compare January 19, 2023 17:26

alexeykudinkin changed the title ~~[HUDI-5534][Stacked on 6782] Optimizing Bloom Index lookup when using Bloom Filters from Metadata Table~~ [HUDI-5534][Stacked on 6815] Optimizing Bloom Index lookup when using Bloom Filters from Metadata Table Jan 19, 2023

alexeykudinkin force-pushed the ak/blm-idx-dag-opt branch 3 times, most recently from 621d20b to 48c0f69 Compare January 20, 2023 06:16

apache deleted a comment from hudi-bot Jan 20, 2023

alexeykudinkin force-pushed the ak/blm-idx-dag-opt branch 6 times, most recently from cb0d7dc to 4437e63 Compare January 24, 2023 21:38

alexeykudinkin changed the title ~~[HUDI-5534][Stacked on 6815] Optimizing Bloom Index lookup when using Bloom Filters from Metadata Table~~ [HUDI-5534] Optimizing Bloom Index lookup when using Bloom Filters from Metadata Table Jan 24, 2023

yihua reviewed Jan 25, 2023

View reviewed changes

Alexey Kudinkin added 19 commits January 25, 2023 20:20

Added static FS view as broadcasted value (to avoid the need to fire …

37b34ec

…HTTP request for every file probed, killing performance); Added config to control the parallelism factor when reading BFs from MT; Tidying up: added comments;

De-duplicated HoodieBloomIndexCheckFunction to be re-used for List/…

e0bc340

…Spark helpers

Tidying up

209fc66

Use method rewriting the record that skips the validation

0bb0aea

Tidying up

8dac4b5

Reverting inadvertent changes

736ab71

Fixing compilation for Scala 2.11

09012c6

Scaffolded HoodieRDDUtils to handle incompatible change in Spark 3.…

25698d8

…x API

Fixed non-serializable lambdas

f4ecb9f

Killing dead code

c7bbc9a

Fixing compilation

71b9af0

Introduced Transient to auto-reset after being serialized based on …

e3eb787

…provided serializable initializer

Cleaned up BaseTableMetadata to avoid holding `SerializableConfigur…

c9ca811

…ation` explicitly; Dialed down buffer size from 10Mb to 10Kb (this buffer is practically not used)

Made HoodieBackedTableMetadata serializable even when reusing the u…

66ed5f7

…nderlying MT readers

Simplified and cleaned up FlatteningIterator;

fe8f935

Added test

Added workaround to avoid re-processing whole partition for every fil…

077909a

…e-group looked up

Adding java-docs

6fdd8b5

[XXX] Avoid repartitioning multiple times by using the same paralleli…

0a80390

…sm to read MT as the one used for Bloom Index lookup overall

Tidying up

c5489b7

alexeykudinkin force-pushed the ak/blm-idx-dag-opt branch from 6bf773f to f8e0d62 Compare January 26, 2023 04:20

Tidying up

bb20aea

alexeykudinkin force-pushed the ak/blm-idx-dag-opt branch from f8e0d62 to bb20aea Compare January 26, 2023 04:32

apache deleted a comment from hudi-bot Jan 26, 2023

yihua approved these changes Jan 27, 2023

View reviewed changes

alexeykudinkin merged commit e270924 into apache:master Jan 27, 2023

yihua mentioned this pull request Feb 8, 2023

[HUDI-4586] Improve metadata fetching in bloom index #6432

Closed

4 tasks

[HUDI-5534] Optimizing Bloom Index lookup when using Bloom Filters from Metadata Table #7642

[HUDI-5534] Optimizing Bloom Index lookup when using Bloom Filters from Metadata Table #7642

Uh oh!

Conversation

alexeykudinkin commented Jan 11, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Change Logs

Impact

Risk level (write none, low medium or high below)

Documentation Update

Contributor's checklist

Uh oh!

alexeykudinkin commented Jan 20, 2023

Uh oh!

yihua left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hudi-bot commented Jan 26, 2023

CI report:

Uh oh!

yihua left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

alexeykudinkin commented Jan 11, 2023 •

edited

Loading