Commit 5078e72
[HUDI-5534] Optimizing Bloom Index lookup when using Bloom Filters from Metadata Table (apache#7642)
Most recently, while trying to use Metadata Table in Bloom Index it was resulting in failures due to exhaustion of S3 connection pool no matter how (reasonably big) we're setting the pool size (we've tested up to 3k connections).
This PR focuses on optimizing the Bloom Index lookup sequence in case when it's leveraging Bloom Filter partition in Metadata Table. The premise of this change is based on the following observations:
Increasing the size of the batch of the requests to MT allows to amortize the cost of processing it (bigger the batch, lesser the cost).
Having too few partitions in the Bloom Index path however, starts to hurt parallelism when we actually probe individual files whether they actually contain target keys or not. Solution to this is to split these 2 in different stages w/ drastically different parallelism levels: constrain parallelism when reading from MT (10s of tasks) and keep at the current level for probing individual files (100s of tasks)
Current way of partitioning records (relying on Spark's default partitioner) was entailing that every Spark executor with high likelihood will be opening up (and processing) every file-group of the MT Bloom Filter partition. To alleviate that same hashing algorithm used by MT should be used to partition records into Spark's individual partitions, so that we can limit every task to open no more than 1 file-group in Bloom Filter's partition of MT
To achieve that following changes in Bloom Index sequence (leveraging MT) are implemented
Bloom Filter probing and actual File Probing are split into 2 separate operations (so that parallelism of each of them could be controlled individually)
Requests to MT are replaced to invoke batch APIs
Custom partitioner is introduced AffineBloomIndexFileGroupPartitioner repartitioning dataset of filenames with corresponding record keys in a way that is affine w/ MT Bloom Filters' partitioning (allowing us to open no more than a single file-group per Spark's task)
Additionally, this PR addresses some of the low-hanging performance optimizations that could considerably improve performance of the Bloom Index lookup sequence like mapping file-comparison pairs to PairRDD (where key is file-name, and value is record-key) instead of RDD so that we could:
Do in-partition sorting by filename (to make sure we check all records w/in the file all at once) w/in a single Spark partition instead of global one (reducing shuffling as well)
Avoid re-shuffling (by re-mapping from RDD to PairRDD later)1 parent 2e6d29b commit 5078e72
35 files changed
Lines changed: 1059 additions & 484 deletions
File tree
- hudi-client
- hudi-client-common/src/main/java/org/apache/hudi/index
- bloom
- hudi-flink-client/src/test/java/org/apache/hudi/index/bloom
- hudi-spark-client/src
- main
- java/org/apache/hudi
- data
- index/bloom
- scala/org/apache/spark/sql/hudi
- test/java/org/apache/hudi/index/bloom
- hudi-common/src
- main/java/org/apache/hudi
- common
- data
- function
- model
- table
- util/collection
- metadata
- util
- test/java/org/apache/hudi/common/util/collection
- hudi-spark-datasource/hudi-spark2/src/main/scala/org/apache/spark/sql/adapter
Lines changed: 2 additions & 3 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
62 | 62 | | |
63 | 63 | | |
64 | 64 | | |
65 | | - | |
66 | | - | |
67 | | - | |
| 65 | + | |
| 66 | + | |
68 | 67 | | |
69 | 68 | | |
70 | 69 | | |
| |||
Lines changed: 2 additions & 3 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
19 | 19 | | |
20 | 20 | | |
21 | 21 | | |
22 | | - | |
23 | 22 | | |
24 | 23 | | |
| 24 | + | |
25 | 25 | | |
26 | 26 | | |
27 | | - | |
28 | 27 | | |
29 | 28 | | |
30 | 29 | | |
| |||
51 | 50 | | |
52 | 51 | | |
53 | 52 | | |
54 | | - | |
| 53 | + | |
55 | 54 | | |
56 | 55 | | |
57 | 56 | | |
Lines changed: 40 additions & 37 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
27 | 27 | | |
28 | 28 | | |
29 | 29 | | |
| 30 | + | |
30 | 31 | | |
31 | 32 | | |
32 | 33 | | |
| |||
35 | 36 | | |
36 | 37 | | |
37 | 38 | | |
38 | | - | |
39 | 39 | | |
40 | 40 | | |
41 | 41 | | |
| |||
45 | 45 | | |
46 | 46 | | |
47 | 47 | | |
| 48 | + | |
48 | 49 | | |
49 | 50 | | |
50 | | - | |
51 | 51 | | |
52 | 52 | | |
53 | 53 | | |
| |||
127 | 127 | | |
128 | 128 | | |
129 | 129 | | |
130 | | - | |
| 130 | + | |
131 | 131 | | |
132 | 132 | | |
133 | 133 | | |
| |||
210 | 210 | | |
211 | 211 | | |
212 | 212 | | |
213 | | - | |
214 | | - | |
215 | | - | |
216 | | - | |
217 | | - | |
218 | | - | |
219 | | - | |
220 | | - | |
221 | | - | |
222 | | - | |
223 | | - | |
224 | | - | |
225 | | - | |
226 | | - | |
227 | | - | |
228 | | - | |
229 | | - | |
230 | | - | |
231 | | - | |
232 | | - | |
233 | | - | |
234 | | - | |
235 | | - | |
236 | | - | |
237 | | - | |
238 | | - | |
239 | | - | |
240 | | - | |
| 213 | + | |
| 214 | + | |
| 215 | + | |
| 216 | + | |
| 217 | + | |
| 218 | + | |
| 219 | + | |
| 220 | + | |
| 221 | + | |
| 222 | + | |
| 223 | + | |
| 224 | + | |
| 225 | + | |
| 226 | + | |
| 227 | + | |
| 228 | + | |
| 229 | + | |
| 230 | + | |
| 231 | + | |
| 232 | + | |
| 233 | + | |
| 234 | + | |
| 235 | + | |
| 236 | + | |
| 237 | + | |
| 238 | + | |
| 239 | + | |
| 240 | + | |
| 241 | + | |
241 | 242 | | |
242 | 243 | | |
243 | 244 | | |
| |||
278 | 279 | | |
279 | 280 | | |
280 | 281 | | |
281 | | - | |
| 282 | + | |
282 | 283 | | |
283 | 284 | | |
284 | 285 | | |
| |||
289 | 290 | | |
290 | 291 | | |
291 | 292 | | |
292 | | - | |
293 | | - | |
294 | | - | |
295 | | - | |
296 | | - | |
| 293 | + | |
| 294 | + | |
| 295 | + | |
| 296 | + | |
| 297 | + | |
| 298 | + | |
| 299 | + | |
297 | 300 | | |
298 | 301 | | |
299 | 302 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
19 | 19 | | |
20 | 20 | | |
21 | 21 | | |
22 | | - | |
| 22 | + | |
| 23 | + | |
23 | 24 | | |
24 | 25 | | |
25 | 26 | | |
| |||
28 | 29 | | |
29 | 30 | | |
30 | 31 | | |
31 | | - | |
| 32 | + | |
32 | 33 | | |
33 | 34 | | |
34 | 35 | | |
| 36 | + | |
35 | 37 | | |
36 | 38 | | |
37 | | - | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
38 | 44 | | |
39 | | - | |
40 | | - | |
| 45 | + | |
| 46 | + | |
41 | 47 | | |
42 | 48 | | |
43 | 49 | | |
44 | 50 | | |
45 | 51 | | |
46 | | - | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
47 | 59 | | |
48 | 60 | | |
| 61 | + | |
| 62 | + | |
49 | 63 | | |
50 | 64 | | |
51 | 65 | | |
52 | | - | |
53 | | - | |
| 66 | + | |
| 67 | + | |
54 | 68 | | |
55 | 69 | | |
56 | | - | |
| 70 | + | |
57 | 71 | | |
58 | 72 | | |
59 | 73 | | |
60 | | - | |
| 74 | + | |
61 | 75 | | |
62 | 76 | | |
63 | 77 | | |
64 | | - | |
65 | | - | |
66 | | - | |
67 | | - | |
68 | 78 | | |
69 | 79 | | |
| 80 | + | |
70 | 81 | | |
71 | 82 | | |
72 | 83 | | |
73 | 84 | | |
74 | | - | |
75 | | - | |
76 | | - | |
77 | | - | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
78 | 93 | | |
79 | 94 | | |
80 | 95 | | |
| |||
100 | 115 | | |
101 | 116 | | |
102 | 117 | | |
103 | | - | |
| 118 | + | |
104 | 119 | | |
| 120 | + | |
105 | 121 | | |
106 | 122 | | |
107 | | - | |
108 | | - | |
109 | 123 | | |
110 | | - | |
111 | | - | |
| 124 | + | |
112 | 125 | | |
113 | 126 | | |
114 | 127 | | |
Lines changed: 8 additions & 6 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
25 | 25 | | |
26 | 26 | | |
27 | 27 | | |
| 28 | + | |
28 | 29 | | |
29 | 30 | | |
30 | 31 | | |
| |||
41 | 42 | | |
42 | 43 | | |
43 | 44 | | |
44 | | - | |
| 45 | + | |
45 | 46 | | |
46 | 47 | | |
47 | 48 | | |
| |||
74 | 75 | | |
75 | 76 | | |
76 | 77 | | |
77 | | - | |
| 78 | + | |
78 | 79 | | |
79 | 80 | | |
80 | 81 | | |
| |||
87 | 88 | | |
88 | 89 | | |
89 | 90 | | |
90 | | - | |
91 | | - | |
92 | | - | |
93 | | - | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
94 | 96 | | |
95 | 97 | | |
96 | 98 | | |
| |||
Lines changed: 13 additions & 13 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
19 | 19 | | |
20 | 20 | | |
21 | 21 | | |
22 | | - | |
23 | 22 | | |
24 | 23 | | |
| 24 | + | |
25 | 25 | | |
26 | 26 | | |
| 27 | + | |
27 | 28 | | |
28 | 29 | | |
29 | 30 | | |
30 | 31 | | |
31 | 32 | | |
32 | 33 | | |
33 | | - | |
| 34 | + | |
34 | 35 | | |
35 | | - | |
36 | 36 | | |
37 | 37 | | |
38 | 38 | | |
| |||
56 | 56 | | |
57 | 57 | | |
58 | 58 | | |
59 | | - | |
| 59 | + | |
60 | 60 | | |
61 | | - | |
| 61 | + | |
62 | 62 | | |
63 | 63 | | |
64 | 64 | | |
65 | | - | |
66 | | - | |
67 | | - | |
68 | | - | |
69 | | - | |
70 | | - | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
71 | 73 | | |
72 | | - | |
73 | | - | |
74 | 74 | | |
75 | 75 | | |
76 | 76 | | |
| |||
Lines changed: 6 additions & 2 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
24 | 24 | | |
25 | 25 | | |
26 | 26 | | |
| 27 | + | |
27 | 28 | | |
28 | 29 | | |
29 | 30 | | |
| |||
186 | 187 | | |
187 | 188 | | |
188 | 189 | | |
189 | | - | |
| 190 | + | |
| 191 | + | |
190 | 192 | | |
191 | 193 | | |
192 | 194 | | |
193 | | - | |
| 195 | + | |
| 196 | + | |
| 197 | + | |
194 | 198 | | |
195 | 199 | | |
196 | 200 | | |
| |||
0 commit comments