-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-53575][CORE] Retry entire consumer stages when checksum mismatch detected for a retried shuffle map task #52336
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…tried shuffle map task
|
cc @cloud-fan @mridulm @attilapiros can you please review this PR? This is to deal with non-deterministic stage retry based on the checksum mismatch detection #50230 |
| ConfigBuilder("spark.scheduler.checksumMismatchFullRetry.enabled") | ||
| .doc("Whether to retry all tasks of a consumer stage when we detect checksum mismatches " + | ||
| "with its producer stages. The checksum computation is controlled by another config " + | ||
| "called SHUFFLE_ORDER_INDEPENDENT_CHECKSUM_ENABLED.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make sense to use SHUFFLE_ORDER_INDEPENDENT_CHECKSUM_ENABLED without SCHEDULER_CHECKSUM_MISMATCH_FULL_RETRY_ENABLED ? or vice versa?
What about getting removing the SHUFFLE_ORDER_INDEPENDENT_CHECKSUM_ENABLED (as the version is where it is introduced is also 4.1.0 we can do that) and computing the checksum when SCHEDULER_CHECKSUM_MISMATCH_FULL_RETRY_ENABLED is true? So having only one config for the feature?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good point. Why do we separate the checksum computation and stage retry with two flag? Do we have logging for checksum mismatch without retry?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, there is log written when checksum mismatch happens. If SCHEDULER_CHECKSUM_MISMATCH_FULL_RETRY_ENABLED is false, there'll be no fully retry for succeeding stages but only logs for the checksum mismatch.
And when we want to enable SCHEDULER_CHECKSUM_MISMATCH_FULL_RETRY_ENABLED, need to make sure SHUFFLE_ORDER_INDEPENDENT_CHECKSUM_ENABLED is also true if we keep these two configs.
One config can be easier to use. If it makes sense to you all, I'll remove SHUFFLE_ORDER_INDEPENDENT_CHECKSUM_ENABLED.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it makes sense to have the log-only mode, so that Spark users can do impact analysis before turnning on the retry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can improve it a bit more: we will compute checksum when either SHUFFLE_ORDER_INDEPENDENT_CHECKSUM_ENABLED or SCHEDULER_CHECKSUM_MISMATCH_FULL_RETRY_ENABLED is enabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, updated.
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new configuration should be under spark.scheduler.checksum namespace because spark.scheduler.checksum.enabled=false will disable this, @ivoson .
Specifically, I'd like to propose the following new name. WDYT?
- spark.scheduler.checksumMismatchFullRetry.enabled
+ spark.scheduler.checksum.enableFullRetryOnMismatch
Hey @dongjoon-hyun, there might be some misunderstanding here, we don't depends on Currently there are two related configs for the feature: Pls let me know if you have any suggestions regarding above configs. Thanks. |
Thank you for correcting me. In that case, The basic idea is the dependency among the configurations. Please let me know your hierarchy for new set of configurations for this feature. |
Thanks @dongjoon-hyun for the suggestion. Updated. For the new configs:
|
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| log"(${MDC(STAGE_ID, stage.id)}) were aborted so this stage is not needed anymore.") | ||
| return | ||
| case sms: ShuffleMapStage if !sms.isAvailable => | ||
| if (sms.shuffleDep.checksumMismatchFullRetryEnabled) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we can make the code a bit more clearer
val needsFullStageRetry = if (sms.shuffleDep.checksumMismatchFullRetryEnabled) {
// the comment
stage.isParentIndeterminate
} else {
the legacy code
}
if (needsFullStageRetry) {
mapOutputTracker.unregisterAllMapAndMergeOutput(sms.shuffleDep.shuffleId)
sms.shuffleDep.newShuffleMergeState()
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, done.
| Seq(("true", "false"), ("false", "true"), ("true", "true")).foreach { | ||
| case (orderIndependentChecksumEnabled: String, checksumMismatchFullRetryEnabled: String) => | ||
| withSQLConf( | ||
| "spark.sql.shuffle.orderIndependentChecksum.enabled" -> orderIndependentChecksumEnabled, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: let's not hardcode it, we can reference them by SQLConf.key_name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
| checksumMismatchFullRetryEnabled) { | ||
| assert(SQLConf.get.shuffleOrderIndependentChecksumEnabled === | ||
| orderIndependentChecksumEnabled.toBoolean) | ||
| assert(SQLConf.get.shuffleChecksumMismatchFullRetryEnabled === |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already have sql conf test suites to verify the basic functionalities, no need to test it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
|
thanks, merging to master! |
…ch detected for a retried shuffle map task ### What changes were proposed in this pull request? This PR proposes to retry all tasks of the consumer stages, when checksum mismatches are detected on their producer stages. In the case that we can't rollback and retry all tasks of a consumer stage, we will have to abort the stage (thus the job). How do we detect and handle nondeterministic before: - Stages are labeled as indeterminate at planning time, prior to query execution - When a task completes and `FetchFailed` is detected, we will abort all unrollbackable succeeding stages of the map stage, and resubmit failed stages. - In `submitMissingTasks()`, if a stage itself is isIndeterminate, we will call `unregisterAllMapAndMergeOutput()` and retry all tasks for stage. How do we detect and handle nondeterministic now: - During query execution, we keep track on the checksums produced by each map task. - When a task completes and checksum mismatch is detected, we will abort unrollbackable succeeding stages of the stage with checksum mismatches. The failed stages resubmission still happen in the same places as before. - In `submitMissingTasks()`, if the parent of a stage has checksum mismatches, we will call `unregisterAllMapAndMergeOutput()` and retry all tasks for stage. Note that (1) if a stage `isReliablyCheckpointed`, the consumer stages don't need to have whole stage retry, and (2) when mismatches are detected for a stage in a chain (e.g., the first stage in stage_i -> stage_i+1 -> stage_i+2 -> ...), the direct consumer (e.g., stage_i+1) of the stage will have a whole stage retry, and an indirect consumer (e.g., stage_i+2) will have a whole stage retry when its parent detects checksum mismatches. ### Why are the changes needed? Handle nondeterministic issues caused by the retry of shuffle map task. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? UTs added. ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#52336 from ivoson/SPARK-53575. Authored-by: Tengfei Huang <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
| log"(${MDC(STAGE_ID, stage.id)}) were aborted so this stage is not needed anymore.") | ||
| return | ||
| case sms: ShuffleMapStage if !sms.isAvailable => | ||
| val needFullStageRetry = if (sms.shuffleDep.checksumMismatchFullRetryEnabled) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Catching up on PR's I missed out on reviewing.
This negatively interacts if there is push based shuffle enabled.
The condition should be sms.shuffleDep.checksumMismatchFullRetryEnabled && !pushBasedShuffleEnabled
+CC @ivoson
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @mridulm can you please explain more about the issue with push based shuffle? Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With push based shuffle enabled - a mappers output would also be pushed to mergers to create a reducer oriented view (all mappers write to a single merger for a given reducer).
If a subset of mapper tasks are now getting reexecuted - the merged output would get impacted as they have already been finalized when the previous attempt completed : causing a disconnect between the mapper output from the new attempt, and merged output from previous attempt.
Essentially, for indeterminate stages, the entire reducer oriented view is unusable - and needs to be recomputed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @mridulm to recompute the indeterminate stages, we'll clean up all the shuffle outputs and shuffle merge state for push-based shuffle. Would that resolve your concern regarding to push-based shuffle?
spark/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
Lines 1584 to 1585 in d0fbb15
| mapOutputTracker.unregisterAllMapAndMergeOutput(sms.shuffleDep.shuffleId) | |
| sms.shuffleDep.newShuffleMergeState() |
mapOutputTracker.unregisterAllMapAndMergeOutput(sms.shuffleDep.shuffleId)
sms.shuffleDep.newShuffleMergeState()
| val stagesToRollback = collectSucceedingStages(sms) | ||
| abortStageWithInvalidRollBack(stagesToRollback) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: We could have delegated this to abortUnrollbackableStages
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| val isChecksumMismatched = mapOutputTracker.registerMapOutput( | ||
| shuffleStage.shuffleDep.shuffleId, smt.partitionId, status) | ||
| if (isChecksumMismatched) { | ||
| shuffleStage.isChecksumMismatched = isChecksumMismatched |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is never reset back to false when the stage attempt is retried and succeeds - what am I missing ?
This would mean the app will always fail, right ?
Not sure what I am missing here.
+CC @ivoson , @cloud-fan , @attilapiros
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @mridulm , this is not set back to false. Would expect all the succeeding stages do fully retry once there is checksum mismatch happening for the stage, as we don't know the successful tasks consumed which version shuffle output.
This won't fail the app, the impact is that the succeeding stages would have a fully-retry.
The code logic has changed a little bit in PR: #53274
Pls take a look once you get a change. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On retry - when we throw away the entire mapper output and recompute it -> at which point, we can go back to setting it to false ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, it's not setting back to false. We'll only recompute once any new shuffle checksum mismatch is detected. Maybe we can remove the flag to avoid the confusion here.
What changes were proposed in this pull request?
This PR proposes to retry all tasks of the consumer stages, when checksum mismatches are detected on their producer stages. In the case that we can't rollback and retry all tasks of a consumer stage, we will have to abort the stage (thus the job).
How do we detect and handle nondeterministic before:
FetchFailedis detected, we will abort all unrollbackable succeeding stages of the map stage, and resubmit failed stages.submitMissingTasks(), if a stage itself is isIndeterminate, we will callunregisterAllMapAndMergeOutput()and retry all tasks for stage.How do we detect and handle nondeterministic now:
submitMissingTasks(), if the parent of a stage has checksum mismatches, we will callunregisterAllMapAndMergeOutput()and retry all tasks for stage.Note that (1) if a stage
isReliablyCheckpointed, the consumer stages don't need to have whole stage retry, and (2) when mismatches are detected for a stage in a chain (e.g., the first stage in stage_i -> stage_i+1 -> stage_i+2 -> ...), the direct consumer (e.g., stage_i+1) of the stage will have a whole stage retry, and an indirect consumer (e.g., stage_i+2) will have a whole stage retry when its parent detects checksum mismatches.Why are the changes needed?
Handle nondeterministic issues caused by the retry of shuffle map task.
Does this PR introduce any user-facing change?
No
How was this patch tested?
UTs added.
Was this patch authored or co-authored using generative AI tooling?
No