Commit 7e2c870
committed
[SPARK-31485][CORE] Avoid application hang if only partial barrier tasks launched
Use `dagScheduler.taskSetFailed` to abort a barrier stage instead of throwing exception within `resourceOffers`.
Any non fatal exception thrown within Spark RPC framework can be swallowed:
https://github.com/apache/spark/blob/100fc58da54e026cda87832a10e2d06eaeccdf87/core/src/main/scala/org/apache/spark/rpc/netty/Inbox.scala#L202-L211
The method `TaskSchedulerImpl.resourceOffers` is also within the scope of Spark RPC framework. Thus, throw exception inside `resourceOffers` won't fail the application.
As a result, if a barrier stage fail the require check at `require(addressesWithDescs.size == taskSet.numTasks, ...)`, the barrier stage will fail the check again and again util all tasks from `TaskSetManager` dequeued. But since the barrier stage isn't really executed, the application will hang.
The issue can be reproduced by the following test:
```scala
initLocalClusterSparkContext(2)
val rdd0 = sc.parallelize(Seq(0, 1, 2, 3), 2)
val dep = new OneToOneDependency[Int](rdd0)
val rdd = new MyRDD(sc, 2, List(dep), Seq(Seq("executor_h_0"),Seq("executor_h_0")))
rdd.barrier().mapPartitions { iter =>
BarrierTaskContext.get().barrier()
iter
}.collect()
```
Yes, application hang previously but fail-fast after this fix.
Added a regression test.
Closes apache#28257 from Ngone51/fix_barrier_abort.
Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>1 parent a602ab6 commit 7e2c870
3 files changed
Lines changed: 43 additions & 13 deletions
File tree
- core/src
- main/scala/org/apache/spark/scheduler
- test/scala/org/apache/spark/scheduler
Lines changed: 1 addition & 1 deletion
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
63 | 63 | | |
64 | 64 | | |
65 | 65 | | |
66 | | - | |
| 66 | + | |
67 | 67 | | |
68 | 68 | | |
69 | 69 | | |
| |||
Lines changed: 24 additions & 10 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
57 | 57 | | |
58 | 58 | | |
59 | 59 | | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
60 | 65 | | |
61 | 66 | | |
62 | 67 | | |
| |||
356 | 361 | | |
357 | 362 | | |
358 | 363 | | |
359 | | - | |
360 | | - | |
361 | | - | |
| 364 | + | |
362 | 365 | | |
363 | 366 | | |
364 | 367 | | |
| |||
516 | 519 | | |
517 | 520 | | |
518 | 521 | | |
519 | | - | |
520 | | - | |
521 | | - | |
522 | | - | |
523 | | - | |
| 522 | + | |
| 523 | + | |
| 524 | + | |
| 525 | + | |
| 526 | + | |
| 527 | + | |
| 528 | + | |
| 529 | + | |
| 530 | + | |
| 531 | + | |
| 532 | + | |
| 533 | + | |
524 | 534 | | |
525 | 535 | | |
526 | 536 | | |
| |||
582 | 592 | | |
583 | 593 | | |
584 | 594 | | |
585 | | - | |
586 | | - | |
| 595 | + | |
| 596 | + | |
| 597 | + | |
| 598 | + | |
| 599 | + | |
| 600 | + | |
587 | 601 | | |
588 | 602 | | |
589 | 603 | | |
| |||
Lines changed: 18 additions & 2 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
26 | 26 | | |
27 | 27 | | |
28 | 28 | | |
29 | | - | |
| 29 | + | |
30 | 30 | | |
31 | 31 | | |
32 | 32 | | |
33 | | - | |
| 33 | + | |
34 | 34 | | |
35 | 35 | | |
36 | 36 | | |
| |||
276 | 276 | | |
277 | 277 | | |
278 | 278 | | |
| 279 | + | |
| 280 | + | |
| 281 | + | |
| 282 | + | |
| 283 | + | |
| 284 | + | |
| 285 | + | |
| 286 | + | |
| 287 | + | |
| 288 | + | |
| 289 | + | |
| 290 | + | |
| 291 | + | |
| 292 | + | |
| 293 | + | |
| 294 | + | |
279 | 295 | | |
0 commit comments