[RayService] Rollback Support for Incremental Upgrades#4109
[RayService] Rollback Support for Incremental Upgrades#4109andrewsykim merged 14 commits intoray-project:masterfrom
Conversation
|
cc: @Future-Outlier the changes for full rollback support are just in this commit: f5d0243 |
Signed-off-by: Ryan O'Leary <[email protected]>
f5d0243 to
5938b5b
Compare
Signed-off-by: Ryan O'Leary <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]>
|
cc: @Future-Outlier this is rebased on the latest changes now and should be good to review |
Signed-off-by: Ryan O'Leary <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]>
ray-operator/test/e2eincrementalupgrade/rayservice_incremental_upgrade_test.go
Outdated
Show resolved
Hide resolved
Signed-off-by: Ryan O'Leary <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]>
|
cc @JiangJiaWei1103 to take a look too |
Signed-off-by: Ryan O'Leary <[email protected]>
…ays evaluate false Signed-off-by: Ryan O'Leary <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]>
andrewsykim
left a comment
There was a problem hiding this comment.
I had some nits about naming and terminology in the logs, but otherwise I think the logic seems fine. @Future-Outlier @rueian any concerns?
| if isRollbackInProgress { | ||
| // Gradually shift traffic from the pending to the active cluster. | ||
| // Rollback traffic migration occurs regardless of pending cluster readiness. | ||
| logger.Info("Rollback in progress. Shifting traffic back to active cluster.", "stepSize", *options.StepSizePercent) |
There was a problem hiding this comment.
I think the active / pending terminology will get confusing going forward, especially as we add rollback support. Do we use these terms in the APIs or only in the logs? If only in the logs we should consider changing the terminology in a future PR. Maybe next cluster and previous cluster might be easier to understand.
There was a problem hiding this comment.
active and pending are used in the API in ActiveServiceStatus and PendingServiceStatus, so I used similar terminology when referring to the cluster that's being scaled for each service. Currently we only refer to the active/pending clusters in the comments/logs though so that should be fine to change to next cluster and previous cluster. Can follow-up in the next PR if needed
|
cc @spencer-p for review as well since you've been thinking about Gateway integrations in KubeRay as well |
Signed-off-by: Ryan O'Leary <[email protected]>
Future-Outlier
left a comment
There was a problem hiding this comment.
- I will come back to review this after kuberay 1.6 release
- due to my personal bandwidth, I might don't have bandwidth review this now
- if @andrewsykim and @rueian approve, I am ok to merge this
- we will need to merge @JiangJiaWei1103 's e2e test for incremental upgrades before this get merged
|
We committed in the KubeRay community sync that the E2E tests must be merged before we can merge the incremental upgrade PR. I may not have communicated that clearly enough earlier. If Ryan and Jia-Wei’s E2E tests are not merged, I will revert this PR before the 1.6 release. |
|
@codex review |
|
@cursor review |
| if meta.IsStatusConditionTrue(rayServiceInstance.Status.Conditions, string(rayv1.RollbackInProgress)) { | ||
| // Rollback the upgrade. The active RayCluster should be scaled back to 100% target_capacity, | ||
| // while the pending RayCluster is scaled to 0%. This is the inverse of the regular upgrade path. | ||
| activeTrafficRoutedPercent := ptr.Deref(activeRayServiceStatus.TrafficRoutedPercent, 0) |
There was a problem hiding this comment.
Inconsistent default for active TrafficRoutedPercent causes potential deadlock
Medium Severity
During rollback, reconcileServeTargetCapacity defaults active TrafficRoutedPercent to 0 via ptr.Deref(..., 0), while calculateTrafficRoutedPercent defaults the same field to 100 via ptr.Deref(..., 100). If TrafficRoutedPercent is ever nil during rollback, the capacity reconciler sees activeTargetCapacity(100) != activeTrafficRoutedPercent(0) and defers, while the traffic calculator sees activeClusterWeight(100) == activeClusterTargetCapacity(100) and stops migration. Neither function makes progress, creating a deadlock where the rollback can never complete.
Additional Locations (1)
There was a problem hiding this comment.
I think the correct solution is to default it to TargetCapacity if it's ever nil, rather than 100 or 0 since either can result in deadlock (100 (capacity) != 0 (traffic) -> defer or 100 (traffic) == 100 (capacity) -> defer if we default when it becomes nil.
There was a problem hiding this comment.
fixed in d02b7be and added unit tests for this case
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b8b66d42a5
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if isRollbackInProgress { | ||
| // Gradually shift traffic from the pending to the active cluster. | ||
| // Rollback traffic migration occurs regardless of pending cluster readiness. | ||
| logger.Info("Rollback in progress. Shifting traffic back to active cluster.", "stepSize", *options.StepSizePercent) |
There was a problem hiding this comment.
Gate rollback traffic migration on active readiness
The rollback branch starts shifting HTTPRoute weight back to the active cluster whenever isRollbackInProgress is true, but it never checks whether the active cluster is actually ready to serve. In the case where the active cluster is unhealthy during an in-progress upgrade and the user changes spec (triggering rollback), this logic can still move traffic onto a non-ready backend and cause request failures; rollback traffic shifts should be conditioned on active-cluster readiness the same way forward migration is conditioned on pending readiness.
Useful? React with 👍 / 👎.
This reverts commit b8b66d4.
| NoPendingCluster RayServiceConditionReason = "NoPendingCluster" | ||
| NoActiveCluster RayServiceConditionReason = "NoActiveCluster" | ||
| RayServiceValidationFailed RayServiceConditionReason = "ValidationFailed" | ||
| TargetClusterChanged RayServiceConditionReason = "TargetClusterChanged" |
There was a problem hiding this comment.
As a follow-up, we should also consider renaming this reason as discussed here.
I'd suggest renaming it to GoalClusterSpecChanged (or a clearer name). Here's the rationale:
Goalprefix: Consistent with the naming convention used across all log and error messages in thereconcileRollbackStatemethodSpecChangedsuffix: More accurate, since this reflects a spec change (RayService.Spec.RayClusterSpec) rather than a target cluster switch- Avoids ambiguity: The term "target cluster" is already used in a different context, where the head and serve services point to the ready cluster (please see here), so reusing it here could be misleading.
Happy to take this on in a follow-up PR!
There was a problem hiding this comment.
"TargetCluster" or "DesireCluster" is more commonly used term in Kubernetes, so I suggest changing the logs or other naming conventions if possible while we're still in alpha. For 3) the "target cluster" is only used as a reference for a Go variable and not actually exposed to users, so not sure it matters in this case?
There was a problem hiding this comment.
Thanks so much for the context and guidance. Really helpful to understand the broader Kubernetes conventions here!
Aligning with idiomatic "Target" / "Desired" terminology makes a lot of sense. And fair enough on point 3 since the variable isn't user-facing, the ambiguity concern is less of an issue.
With that in mind, here's what I'm thinking for the follow-up:
- Rename rollback-related logs and local variables in
reconcileRollbackStateto use "Target" (or "Desired") consistently - Update the condition reason to align with that same terminology (e.g.
TargetClusterSpecChanged)
Do you have a preference between Target and Desired? Happy to consolidate everything into a single follow-up change once we land on the right term. Thanks!
There was a problem hiding this comment.
+1 on changing the naming, DesiredClusterSpecChanged sounds good to me since we already use "desired" terminology in the API (
TargetCapacity which refers to the Serve capacity to scale within a cluster. I think we should use "target" to refer to fields that the controller is changing, while "desired" more clearly refers to user-declared intent - like the spec that they submit.
There was a problem hiding this comment.
Thanks Ryan, the naming makes a lot of sense!
* Disable RayMultiHostIndexing feature for TestReconcile_Multihost_Replicas (#4583) Signed-off-by: Future-Outlier <[email protected]> * Disable the field alignment lint check (#4560) * chore: disable govet fieldalignment lint Signed-off-by: jinbum9958 <[email protected]> * chore: remove redundant //nolint:govet comments Signed-off-by: jinbum9958 <[email protected]> * chore: restore RayCronJobSpec/RayCronJob comments Signed-off-by: jinbum9958 <[email protected]> * chore: restore RayCronJob comment Signed-off-by: jinbum9958 <[email protected]> * chore: regenerate artifacts for RayCronJob comment changes Signed-off-by: jinbum9958 <[email protected]> --------- Signed-off-by: jinbum9958 <[email protected]> * Mark RC releases as pre-release in GoReleaser (#4590) Signed-off-by: Future-Outlier <[email protected]> * Add example YAML for manually enabling Ray k8s auth (#4582) Signed-off-by: Andrew Sy Kim <[email protected]> * [RayService] Rollback Support for Incremental Upgrades (#4109) * Implement rollback support Signed-off-by: Ryan O'Leary <[email protected]> * Fix unit test file Signed-off-by: Ryan O'Leary <[email protected]> * Ensure upgrade in progress status is cleared Signed-off-by: Ryan O'Leary <[email protected]> * Clarify rollback scenarios and clear pending apps during rollback Signed-off-by: Ryan O'Leary <[email protected]> * Fix typo Signed-off-by: Ryan O'Leary <[email protected]> * Add guard for rollback to prepare new cluster Signed-off-by: Ryan O'Leary <[email protected]> * fix e2e test checking pending cluster name Signed-off-by: Ryan O'Leary <[email protected]> * Fix rollback calculation Signed-off-by: Ryan O'Leary <[email protected]> * Fix httproute name in test Signed-off-by: Ryan O'Leary <[email protected]> * Increase timeout for checking RayCluster deletion Signed-off-by: Ryan O'Leary <[email protected]> * Fix rollback check and change httproute equal check so it doesn't always evaluate false Signed-off-by: Ryan O'Leary <[email protected]> * Adjust logic to use timestamp of growing cluster Signed-off-by: Ryan O'Leary <[email protected]> * Fix naming of vars Signed-off-by: Ryan O'Leary <[email protected]> --------- Signed-off-by: Ryan O'Leary <[email protected]> Signed-off-by: Ryan O'Leary <[email protected]> * [RayService] Promote Incremental Upgrade Feature to Beta (#4599) Signed-off-by: Ryan O'Leary <[email protected]> * Revert "[RayService] Promote Incremental Upgrade Feature to Beta (#4599)" (#4602) This reverts commit 0c8aab9. * [Github Action] Skip krew-index update for pre-release tags (#4587) * [CI] Skip krew-index update for pre-release tags Co-Authored-By: Claude Opus 4.6 <[email protected]> Signed-off-by: Future-Outlier <[email protected]> * update Signed-off-by: Future-Outlier <[email protected]> --------- Signed-off-by: Future-Outlier <[email protected]> Co-authored-by: Claude Opus 4.6 <[email protected]> * [Helm] update ray-cluster chart to apiVersion: v2 (#4593) * Update kind configs and docs to v1.29.0 (#4595) * Remove ingress template support for k8s < 1.19 (#4591) * [Helm] Update ray-cluster default resource values (#4588) * update kuberay 1.6.0 helm chart and readme Signed-off-by: Future-Outlier <[email protected]> * [Helm] Fix apiserver README chart version and align ray-cluster resource defaults Signed-off-by: Future-Outlier <[email protected]> * update Signed-off-by: Future-Outlier <[email protected]> * update Signed-off-by: Future-Outlier <[email protected]> * update Signed-off-by: Future-Outlier <[email protected]> * update Signed-off-by: Future-Outlier <[email protected]> * update Signed-off-by: Future-Outlier <[email protected]> * update Signed-off-by: Future-Outlier <[email protected]> * update Signed-off-by: Future-Outlier <[email protected]> * fix ctl test Signed-off-by: Future-Outlier <[email protected]> * update Signed-off-by: Future-Outlier <[email protected]> * update Signed-off-by: Future-Outlier <[email protected]> --------- Signed-off-by: Future-Outlier <[email protected]> * fix lint Signed-off-by: Future-Outlier <[email protected]> * chore: Use RayCluster name as SA name for RBAC auth (#4611) Signed-off-by: JiangJiaWei1103 <[email protected]> * Add Google Artifact Registry image build/push guide (#4618) * [History Server] Fix API response format to match Ray Dashboard frontend schema (#4615) * [History Server] Fix API response format to match Ray Dashboard frontend schema Signed-off-by: Future-Outlier <[email protected]> * update Signed-off-by: Future-Outlier <[email protected]> * update Signed-off-by: Future-Outlier <[email protected]> * fix bug 1 Signed-off-by: Future-Outlier <[email protected]> * update Signed-off-by: Future-Outlier <[email protected]> * add comments to explain frontend fields naming issue Signed-off-by: Future-Outlier <[email protected]> * fix tests Signed-off-by: Future-Outlier <[email protected]> * Add comments Signed-off-by: Future-Outlier <[email protected]> * update Signed-off-by: Future-Outlier <[email protected]> * Add frontend reference for this PR Signed-off-by: Future-Outlier <[email protected]> * fix codex and cursor bug Signed-off-by: Future-Outlier <[email protected]> * upadte Signed-off-by: Future-Outlier <[email protected]> * add actor.LabelSelector Signed-off-by: Future-Outlier <[email protected]> * better query for additional endpoints, request URI (path + query) Signed-off-by: Future-Outlier <[email protected]> * Add comments Signed-off-by: Future-Outlier <[email protected]> --------- Signed-off-by: Future-Outlier <[email protected]> --------- Signed-off-by: Future-Outlier <[email protected]> Signed-off-by: jinbum9958 <[email protected]> Signed-off-by: Andrew Sy Kim <[email protected]> Signed-off-by: Ryan O'Leary <[email protected]> Signed-off-by: Ryan O'Leary <[email protected]> Signed-off-by: JiangJiaWei1103 <[email protected]> Co-authored-by: Han-Ju Chen (Future-Outlier) <[email protected]> Co-authored-by: Jinbum Kim <[email protected]> Co-authored-by: Andrew Sy Kim <[email protected]> Co-authored-by: Ryan O'Leary <[email protected]> Co-authored-by: Claude Opus 4.6 <[email protected]> Co-authored-by: Matt Boersma <[email protected]> Co-authored-by: Jia-Wei Jiang <[email protected]> Co-authored-by: Chia-Yi Liang <[email protected]>


Why are these changes needed?
Implement rollback support for the IncrementalUpgrade RayService feature.
Related issue number
#3209
Checks