Commit 77d046e
[SPARK-21782][CORE] Repartition creates skews when numPartitions is a power of 2
## Problem
When an RDD (particularly with a low item-per-partition ratio) is repartitioned to numPartitions = power of 2, the resulting partitions are very uneven-sized, due to using fixed seed to initialize PRNG, and using the PRNG only once. See details in https://issues.apache.org/jira/browse/SPARK-21782
## What changes were proposed in this pull request?
Instead of directly using `0, 1, 2,...` seeds to initialize `Random`, hash them with `scala.util.hashing.byteswap32()`.
## How was this patch tested?
`build/mvn -Dtest=none -DwildcardSuites=org.apache.spark.rdd.RDDSuite test`
Author: Sergey Serebryakov <sserebryakov@tesla.com>
Closes #18990 from megaserg/repartition-skew.1 parent 28a6cca commit 77d046e
2 files changed
Lines changed: 6 additions & 3 deletions
File tree
- core/src
- main/scala/org/apache/spark/rdd
- test/scala/org/apache/spark/rdd
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
24 | 24 | | |
25 | 25 | | |
26 | 26 | | |
| 27 | + | |
27 | 28 | | |
28 | 29 | | |
29 | 30 | | |
| |||
448 | 449 | | |
449 | 450 | | |
450 | 451 | | |
451 | | - | |
| 452 | + | |
452 | 453 | | |
453 | 454 | | |
454 | 455 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
347 | 347 | | |
348 | 348 | | |
349 | 349 | | |
350 | | - | |
| 350 | + | |
351 | 351 | | |
352 | 352 | | |
353 | 353 | | |
354 | | - | |
| 354 | + | |
| 355 | + | |
355 | 356 | | |
356 | 357 | | |
357 | 358 | | |
358 | 359 | | |
359 | 360 | | |
| 361 | + | |
360 | 362 | | |
361 | 363 | | |
362 | 364 | | |
| |||
0 commit comments