[SPARK-53575][CORE] Retry entire consumer stages when checksum mismatch detected for a retried shuffle map task #52336

ivoson · 2025-09-15T08:07:12Z

What changes were proposed in this pull request?

This PR proposes to retry all tasks of the consumer stages, when checksum mismatches are detected on their producer stages. In the case that we can't rollback and retry all tasks of a consumer stage, we will have to abort the stage (thus the job).

How do we detect and handle nondeterministic before:

Stages are labeled as indeterminate at planning time, prior to query execution
When a task completes and FetchFailed is detected, we will abort all unrollbackable succeeding stages of the map stage, and resubmit failed stages.
In submitMissingTasks(), if a stage itself is isIndeterminate, we will call unregisterAllMapAndMergeOutput() and retry all tasks for stage.

How do we detect and handle nondeterministic now:

During query execution, we keep track on the checksums produced by each map task.
When a task completes and checksum mismatch is detected, we will abort unrollbackable succeeding stages of the stage with checksum mismatches. The failed stages resubmission still happen in the same places as before.
In submitMissingTasks(), if the parent of a stage has checksum mismatches, we will call unregisterAllMapAndMergeOutput() and retry all tasks for stage.

Note that (1) if a stage isReliablyCheckpointed, the consumer stages don't need to have whole stage retry, and (2) when mismatches are detected for a stage in a chain (e.g., the first stage in stage_i -> stage_i+1 -> stage_i+2 -> ...), the direct consumer (e.g., stage_i+1) of the stage will have a whole stage retry, and an indirect consumer (e.g., stage_i+2) will have a whole stage retry when its parent detects checksum mismatches.

Why are the changes needed?

Handle nondeterministic issues caused by the retry of shuffle map task.

Does this PR introduce any user-facing change?

No

How was this patch tested?

UTs added.

Was this patch authored or co-authored using generative AI tooling?

No

…tried shuffle map task

ivoson · 2025-09-22T04:46:54Z

cc @cloud-fan @mridulm @attilapiros can you please review this PR? This is to deal with non-deterministic stage retry based on the checksum mismatch detection #50230

attilapiros · 2025-09-22T22:49:35Z

core/src/main/scala/org/apache/spark/internal/config/package.scala

+    ConfigBuilder("spark.scheduler.checksumMismatchFullRetry.enabled")
+      .doc("Whether to retry all tasks of a consumer stage when we detect checksum mismatches " +
+        "with its producer stages. The checksum computation is controlled by another config " +
+        "called SHUFFLE_ORDER_INDEPENDENT_CHECKSUM_ENABLED.")


Does it make sense to use SHUFFLE_ORDER_INDEPENDENT_CHECKSUM_ENABLED without SCHEDULER_CHECKSUM_MISMATCH_FULL_RETRY_ENABLED ? or vice versa?

What about getting removing the SHUFFLE_ORDER_INDEPENDENT_CHECKSUM_ENABLED (as the version is where it is introduced is also 4.1.0 we can do that) and computing the checksum when SCHEDULER_CHECKSUM_MISMATCH_FULL_RETRY_ENABLED is true? So having only one config for the feature?

This is a good point. Why do we separate the checksum computation and stage retry with two flag? Do we have logging for checksum mismatch without retry?

Yes, there is log written when checksum mismatch happens. If SCHEDULER_CHECKSUM_MISMATCH_FULL_RETRY_ENABLED is false, there'll be no fully retry for succeeding stages but only logs for the checksum mismatch.

And when we want to enable SCHEDULER_CHECKSUM_MISMATCH_FULL_RETRY_ENABLED, need to make sure SHUFFLE_ORDER_INDEPENDENT_CHECKSUM_ENABLED is also true if we keep these two configs.

One config can be easier to use. If it makes sense to you all, I'll remove SHUFFLE_ORDER_INDEPENDENT_CHECKSUM_ENABLED.

I think it makes sense to have the log-only mode, so that Spark users can do impact analysis before turnning on the retry.

We can improve it a bit more: we will compute checksum when either SHUFFLE_ORDER_INDEPENDENT_CHECKSUM_ENABLED or SCHEDULER_CHECKSUM_MISMATCH_FULL_RETRY_ENABLED is enabled.

Thanks, updated.

dongjoon-hyun

The new configuration should be under spark.scheduler.checksum namespace because spark.scheduler.checksum.enabled=false will disable this, @ivoson .

Specifically, I'd like to propose the following new name. WDYT?

- spark.scheduler.checksumMismatchFullRetry.enabled
+ spark.scheduler.checksum.enableFullRetryOnMismatch

ivoson · 2025-09-24T07:01:23Z

The new configuration should be under spark.scheduler.checksum namespace because spark.scheduler.checksum.enabled=false will disable this, @ivoson .

Specifically, I'd like to propose the following new name. WDYT?
- spark.scheduler.checksumMismatchFullRetry.enabled
+ spark.scheduler.checksum.enableFullRetryOnMismatch

Hey @dongjoon-hyun, there might be some misunderstanding here, we don't depends on spark.scheduler.checksum.enabled, and the config actually does not exist.

Currently there are two related configs for the feature:
spark.sql.shuffle.orderIndependentChecksum.enabled: whether compute order independent checksum for shuffle output;
spark.scheduler.checksumMismatchFullRetry.enabled: whether retry all tasks for a succeeding stages when shuffle checksum mismatch detected;

Pls let me know if you have any suggestions regarding above configs. Thanks.

core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala

dongjoon-hyun · 2025-09-25T06:01:14Z

Currently there are two related configs for the feature:
spark.sql.shuffle.orderIndependentChecksum.enabled: whether compute order independent checksum for shuffle output;
spark.scheduler.checksumMismatchFullRetry.enabled: whether retry all tasks for a succeeding stages when shuffle checksum mismatch detected;

Thank you for correcting me. In that case, spark.sql.shuffle.orderIndependentChecksum.* seems to be the parent name space for this feature. If spark.sql.shuffle.orderIndependentChecksum.enabled=false disabled this PR's configuration, this should be under the same namespace. The revised config name might be the following. WDYT, @ivoson ?

- spark.scheduler.checksumMismatchFullRetry.enabled
+ spark.sql.shuffle.orderIndependentChecksum.enableFullRetryOnMismatch

The basic idea is the dependency among the configurations. Please let me know your hierarchy for new set of configurations for this feature.

ivoson · 2025-09-26T03:26:20Z

Currently there are two related configs for the feature:
spark.sql.shuffle.orderIndependentChecksum.enabled: whether compute order independent checksum for shuffle output;
spark.scheduler.checksumMismatchFullRetry.enabled: whether retry all tasks for a succeeding stages when shuffle checksum mismatch detected;

Thank you for correcting me. In that case, spark.sql.shuffle.orderIndependentChecksum.* seems to be the parent name space for this feature. If spark.sql.shuffle.orderIndependentChecksum.enabled=false disabled this PR's configuration, this should be under the same namespace. The revised config name might be the following. WDYT, @ivoson ?
- spark.scheduler.checksumMismatchFullRetry.enabled
+ spark.sql.shuffle.orderIndependentChecksum.enableFullRetryOnMismatch
The basic idea is the dependency among the configurations. Please let me know your hierarchy for new set of configurations for this feature.

Thanks @dongjoon-hyun for the suggestion. Updated. For the new configs:

spark.sql.shuffle.orderIndependentChecksum.enabled -> when it's true, we'll compute the shuffle checksum and only log detected checksum mismatch if spark.sql.shuffle.orderIndependentChecksum.enableFullRetryOnMismatch is false;
spark.sql.shuffle.orderIndependentChecksum.enableFullRetryOnMismatch -> when it's true, we'll compute the shuffle checksum and fully retry consumer stages once mismatch happens.

dongjoon-hyun

Thank you so much for updating, @ivoson . +1, LGTM (if CI passes)

Could you rebase to master branch because yesterday branch was broken? It's fixed now via 3d19a65 .

cloud-fan · 2025-09-28T02:01:50Z

core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala

-              log"(${MDC(STAGE_ID, stage.id)}) were aborted so this stage is not needed anymore.")
-            return
+      case sms: ShuffleMapStage if !sms.isAvailable =>
+        if (sms.shuffleDep.checksumMismatchFullRetryEnabled) {


nit: we can make the code a bit more clearer

val needsFullStageRetry = if (sms.shuffleDep.checksumMismatchFullRetryEnabled) { // the comment stage.isParentIndeterminate } else { the legacy code } if (needsFullStageRetry) { mapOutputTracker.unregisterAllMapAndMergeOutput(sms.shuffleDep.shuffleId) sms.shuffleDep.newShuffleMergeState() }

Thanks, done.

cloud-fan · 2025-09-28T02:06:55Z

sql/core/src/test/scala/org/apache/spark/sql/MapStatusEndToEndSuite.scala

+    Seq(("true", "false"), ("false", "true"), ("true", "true")).foreach {
+      case (orderIndependentChecksumEnabled: String, checksumMismatchFullRetryEnabled: String) =>
+        withSQLConf(
+          "spark.sql.shuffle.orderIndependentChecksum.enabled" -> orderIndependentChecksumEnabled,


nit: let's not hardcode it, we can reference them by SQLConf.key_name

cloud-fan · 2025-09-28T02:07:18Z

sql/core/src/test/scala/org/apache/spark/sql/MapStatusEndToEndSuite.scala

+            checksumMismatchFullRetryEnabled) {
+          assert(SQLConf.get.shuffleOrderIndependentChecksumEnabled ===
+            orderIndependentChecksumEnabled.toBoolean)
+          assert(SQLConf.get.shuffleChecksumMismatchFullRetryEnabled ===


We already have sql conf test suites to verify the basic functionalities, no need to test it here.

cloud-fan · 2025-09-29T09:52:55Z

thanks, merging to master!

…ch detected for a retried shuffle map task ### What changes were proposed in this pull request? This PR proposes to retry all tasks of the consumer stages, when checksum mismatches are detected on their producer stages. In the case that we can't rollback and retry all tasks of a consumer stage, we will have to abort the stage (thus the job). How do we detect and handle nondeterministic before: - Stages are labeled as indeterminate at planning time, prior to query execution - When a task completes and `FetchFailed` is detected, we will abort all unrollbackable succeeding stages of the map stage, and resubmit failed stages. - In `submitMissingTasks()`, if a stage itself is isIndeterminate, we will call `unregisterAllMapAndMergeOutput()` and retry all tasks for stage. How do we detect and handle nondeterministic now: - During query execution, we keep track on the checksums produced by each map task. - When a task completes and checksum mismatch is detected, we will abort unrollbackable succeeding stages of the stage with checksum mismatches. The failed stages resubmission still happen in the same places as before. - In `submitMissingTasks()`, if the parent of a stage has checksum mismatches, we will call `unregisterAllMapAndMergeOutput()` and retry all tasks for stage. Note that (1) if a stage `isReliablyCheckpointed`, the consumer stages don't need to have whole stage retry, and (2) when mismatches are detected for a stage in a chain (e.g., the first stage in stage_i -> stage_i+1 -> stage_i+2 -> ...), the direct consumer (e.g., stage_i+1) of the stage will have a whole stage retry, and an indirect consumer (e.g., stage_i+2) will have a whole stage retry when its parent detects checksum mismatches. ### Why are the changes needed? Handle nondeterministic issues caused by the retry of shuffle map task. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? UTs added. ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#52336 from ivoson/SPARK-53575. Authored-by: Tengfei Huang <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

mridulm · 2025-12-06T07:43:36Z

core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala

-              log"(${MDC(STAGE_ID, stage.id)}) were aborted so this stage is not needed anymore.")
-            return
+      case sms: ShuffleMapStage if !sms.isAvailable =>
+        val needFullStageRetry = if (sms.shuffleDep.checksumMismatchFullRetryEnabled) {


Catching up on PR's I missed out on reviewing.

This negatively interacts if there is push based shuffle enabled.
The condition should be sms.shuffleDep.checksumMismatchFullRetryEnabled && !pushBasedShuffleEnabled

+CC @ivoson

Hi @mridulm can you please explain more about the issue with push based shuffle? Thanks.

With push based shuffle enabled - a mappers output would also be pushed to mergers to create a reducer oriented view (all mappers write to a single merger for a given reducer).
If a subset of mapper tasks are now getting reexecuted - the merged output would get impacted as they have already been finalized when the previous attempt completed : causing a disconnect between the mapper output from the new attempt, and merged output from previous attempt.

Essentially, for indeterminate stages, the entire reducer oriented view is unusable - and needs to be recomputed.

Hi @mridulm to recompute the indeterminate stages, we'll clean up all the shuffle outputs and shuffle merge state for push-based shuffle. Would that resolve your concern regarding to push-based shuffle?

spark/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala

Lines 1584 to 1585 in d0fbb15

mapOutputTracker.unregisterAllMapAndMergeOutput(sms.shuffleDep.shuffleId)

sms.shuffleDep.newShuffleMergeState()

mapOutputTracker.unregisterAllMapAndMergeOutput(sms.shuffleDep.shuffleId) sms.shuffleDep.newShuffleMergeState()

mridulm · 2025-12-06T07:46:07Z

core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala

+              val stagesToRollback = collectSucceedingStages(sms)
+              abortStageWithInvalidRollBack(stagesToRollback)


nit: We could have delegated this to abortUnrollbackableStages

Thanks @mridulm

I am working on another PR to capture more scenarios, can you please also take a look? Thanks

Will do some code refactor in the PR: #53274

mridulm · 2025-12-06T07:48:55Z

core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala

+                val isChecksumMismatched = mapOutputTracker.registerMapOutput(
                  shuffleStage.shuffleDep.shuffleId, smt.partitionId, status)
+                if (isChecksumMismatched) {
+                  shuffleStage.isChecksumMismatched = isChecksumMismatched


This is never reset back to false when the stage attempt is retried and succeeds - what am I missing ?
This would mean the app will always fail, right ?

Not sure what I am missing here.
+CC @ivoson , @cloud-fan , @attilapiros

Hi @mridulm , this is not set back to false. Would expect all the succeeding stages do fully retry once there is checksum mismatch happening for the stage, as we don't know the successful tasks consumed which version shuffle output.

This won't fail the app, the impact is that the succeeding stages would have a fully-retry.

The code logic has changed a little bit in PR: #53274

Pls take a look once you get a change. Thanks.

On retry - when we throw away the entire mapper output and recompute it -> at which point, we can go back to setting it to false ?

Currently, it's not setting back to false. We'll only recompute once any new shuffle checksum mismatch is detected. Maybe we can remove the flag to avoid the confusion here.

Retry entire consumer stages when checksum mismatch detected for a re…

2ab6a38

…tried shuffle map task

github-actions bot added the CORE label Sep 15, 2025

refactor code and fix ut

8644232

ivoson changed the title ~~[WIP][SPARK-53575][CORE] Retry entire consumer stages when checksum mismatch detected for a retried shuffle map task~~ [SPARK-53575][CORE] Retry entire consumer stages when checksum mismatch detected for a retried shuffle map task Sep 17, 2025

ivoson marked this pull request as ready for review September 17, 2025 03:39

cloud-fan approved these changes Sep 22, 2025

View reviewed changes

remove invalid comments

6e214f0

attilapiros reviewed Sep 22, 2025

View reviewed changes

address comments

cd20fa4

github-actions bot added the SQL label Sep 24, 2025

dongjoon-hyun requested changes Sep 24, 2025

View reviewed changes

cloud-fan reviewed Sep 24, 2025

View reviewed changes

core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala Show resolved Hide resolved

address comments

e75e88a

dongjoon-hyun approved these changes Sep 27, 2025

View reviewed changes

cloud-fan reviewed Sep 28, 2025

View reviewed changes

ivoson added 3 commits September 29, 2025 09:46

Merge branch 'apache:master' into SPARK-53575

bec3e9f

address comments

491a04b

remove unnecessary assert

0bd01f6

cloud-fan closed this in 922adad Sep 29, 2025

ivoson deleted the SPARK-53575 branch September 29, 2025 14:35

mridulm reviewed Dec 6, 2025

View reviewed changes

	mapOutputTracker.unregisterAllMapAndMergeOutput(sms.shuffleDep.shuffleId)
	sms.shuffleDep.newShuffleMergeState()

		val stagesToRollback = collectSucceedingStages(sms)
		abortStageWithInvalidRollBack(stagesToRollback)

[SPARK-53575][CORE] Retry entire consumer stages when checksum mismatch detected for a retried shuffle map task #52336

[SPARK-53575][CORE] Retry entire consumer stages when checksum mismatch detected for a retried shuffle map task #52336

Uh oh!

Conversation

ivoson commented Sep 15, 2025

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

ivoson commented Sep 22, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

ivoson commented Sep 24, 2025

Uh oh!

Uh oh!

dongjoon-hyun commented Sep 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ivoson commented Sep 26, 2025

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan Sep 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Sep 29, 2025

Uh oh!

mridulm Dec 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mridulm Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mridulm Dec 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

dongjoon-hyun commented Sep 25, 2025 •

edited

Loading

cloud-fan Sep 28, 2025 •

edited

Loading

mridulm Dec 6, 2025 •

edited

Loading

mridulm Dec 10, 2025 •

edited

Loading

mridulm Dec 6, 2025 •

edited

Loading