Skip to content

[RayService] Rollback Support for Incremental Upgrades#4109

Merged
andrewsykim merged 14 commits intoray-project:masterfrom
ryanaoleary:rollback-support
Mar 13, 2026
Merged

[RayService] Rollback Support for Incremental Upgrades#4109
andrewsykim merged 14 commits intoray-project:masterfrom
ryanaoleary:rollback-support

Conversation

@ryanaoleary
Copy link
Collaborator

Why are these changes needed?

Implement rollback support for the IncrementalUpgrade RayService feature.

Related issue number

#3209

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

@ryanaoleary
Copy link
Collaborator Author

cc: @Future-Outlier the changes for full rollback support are just in this commit: f5d0243

Signed-off-by: Ryan O'Leary <[email protected]>
ryanaoleary and others added 2 commits March 5, 2026 04:03
@ryanaoleary ryanaoleary marked this pull request as ready for review March 5, 2026 12:08
@ryanaoleary
Copy link
Collaborator Author

cc: @Future-Outlier this is rebased on the latest changes now and should be good to review

Signed-off-by: Ryan O'Leary <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]>
@Future-Outlier
Copy link
Member

cc @JiangJiaWei1103 to take a look too

Copy link
Member

@andrewsykim andrewsykim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had some nits about naming and terminology in the logs, but otherwise I think the logic seems fine. @Future-Outlier @rueian any concerns?

if isRollbackInProgress {
// Gradually shift traffic from the pending to the active cluster.
// Rollback traffic migration occurs regardless of pending cluster readiness.
logger.Info("Rollback in progress. Shifting traffic back to active cluster.", "stepSize", *options.StepSizePercent)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the active / pending terminology will get confusing going forward, especially as we add rollback support. Do we use these terms in the APIs or only in the logs? If only in the logs we should consider changing the terminology in a future PR. Maybe next cluster and previous cluster might be easier to understand.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

active and pending are used in the API in ActiveServiceStatus and PendingServiceStatus, so I used similar terminology when referring to the cluster that's being scaled for each service. Currently we only refer to the active/pending clusters in the comments/logs though so that should be fine to change to next cluster and previous cluster. Can follow-up in the next PR if needed

@andrewsykim
Copy link
Member

cc @spencer-p for review as well since you've been thinking about Gateway integrations in KubeRay as well

Signed-off-by: Ryan O'Leary <[email protected]>
Copy link
Member

@Future-Outlier Future-Outlier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I will come back to review this after kuberay 1.6 release
  2. due to my personal bandwidth, I might don't have bandwidth review this now
  3. if @andrewsykim and @rueian approve, I am ok to merge this
  4. we will need to merge @JiangJiaWei1103 's e2e test for incremental upgrades before this get merged

@andrewsykim andrewsykim merged commit b8b66d4 into ray-project:master Mar 13, 2026
31 checks passed
@github-project-automation github-project-automation bot moved this from To Review to Done in My Kuberay & Ray Mar 13, 2026
@Future-Outlier
Copy link
Member

We committed in the KubeRay community sync that the E2E tests must be merged before we can merge the incremental upgrade PR. I may not have communicated that clearly enough earlier. If Ryan and Jia-Wei’s E2E tests are not merged, I will revert this PR before the 1.6 release.

@Future-Outlier
Copy link
Member

@codex review

@Future-Outlier
Copy link
Member

@cursor review

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

if meta.IsStatusConditionTrue(rayServiceInstance.Status.Conditions, string(rayv1.RollbackInProgress)) {
// Rollback the upgrade. The active RayCluster should be scaled back to 100% target_capacity,
// while the pending RayCluster is scaled to 0%. This is the inverse of the regular upgrade path.
activeTrafficRoutedPercent := ptr.Deref(activeRayServiceStatus.TrafficRoutedPercent, 0)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistent default for active TrafficRoutedPercent causes potential deadlock

Medium Severity

During rollback, reconcileServeTargetCapacity defaults active TrafficRoutedPercent to 0 via ptr.Deref(..., 0), while calculateTrafficRoutedPercent defaults the same field to 100 via ptr.Deref(..., 100). If TrafficRoutedPercent is ever nil during rollback, the capacity reconciler sees activeTargetCapacity(100) != activeTrafficRoutedPercent(0) and defers, while the traffic calculator sees activeClusterWeight(100) == activeClusterTargetCapacity(100) and stops migration. Neither function makes progress, creating a deadlock where the rollback can never complete.

Additional Locations (1)
Fix in Cursor Fix in Web

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @ryanaoleary to take a look

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the correct solution is to default it to TargetCapacity if it's ever nil, rather than 100 or 0 since either can result in deadlock (100 (capacity) != 0 (traffic) -> defer or 100 (traffic) == 100 (capacity) -> defer if we default when it becomes nil.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in d02b7be and added unit tests for this case

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b8b66d42a5

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +768 to +771
if isRollbackInProgress {
// Gradually shift traffic from the pending to the active cluster.
// Rollback traffic migration occurs regardless of pending cluster readiness.
logger.Info("Rollback in progress. Shifting traffic back to active cluster.", "stepSize", *options.StepSizePercent)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Gate rollback traffic migration on active readiness

The rollback branch starts shifting HTTPRoute weight back to the active cluster whenever isRollbackInProgress is true, but it never checks whether the active cluster is actually ready to serve. In the case where the active cluster is unhealthy during an in-progress upgrade and the user changes spec (triggering rollback), this logic can still move traffic onto a non-ready backend and cause request failures; rollback traffic shifts should be conditioned on active-cluster readiness the same way forward migration is conditioned on pending readiness.

Useful? React with 👍 / 👎.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @ryanaoleary to take a look

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in 81ad0cf

NoPendingCluster RayServiceConditionReason = "NoPendingCluster"
NoActiveCluster RayServiceConditionReason = "NoActiveCluster"
RayServiceValidationFailed RayServiceConditionReason = "ValidationFailed"
TargetClusterChanged RayServiceConditionReason = "TargetClusterChanged"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a follow-up, we should also consider renaming this reason as discussed here.

I'd suggest renaming it to GoalClusterSpecChanged (or a clearer name). Here's the rationale:

  1. Goal prefix: Consistent with the naming convention used across all log and error messages in the reconcileRollbackState method
  2. SpecChanged suffix: More accurate, since this reflects a spec change (RayService.Spec.RayClusterSpec) rather than a target cluster switch
  3. Avoids ambiguity: The term "target cluster" is already used in a different context, where the head and serve services point to the ready cluster (please see here), so reusing it here could be misleading.

Happy to take this on in a follow-up PR!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"TargetCluster" or "DesireCluster" is more commonly used term in Kubernetes, so I suggest changing the logs or other naming conventions if possible while we're still in alpha. For 3) the "target cluster" is only used as a reference for a Go variable and not actually exposed to users, so not sure it matters in this case?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much for the context and guidance. Really helpful to understand the broader Kubernetes conventions here!

Aligning with idiomatic "Target" / "Desired" terminology makes a lot of sense. And fair enough on point 3 since the variable isn't user-facing, the ambiguity concern is less of an issue.

With that in mind, here's what I'm thinking for the follow-up:

  • Rename rollback-related logs and local variables in reconcileRollbackState to use "Target" (or "Desired") consistently
  • Update the condition reason to align with that same terminology (e.g. TargetClusterSpecChanged)

Do you have a preference between Target and Desired? Happy to consolidate everything into a single follow-up change once we land on the right term. Thanks!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on changing the naming, DesiredClusterSpecChanged sounds good to me since we already use "desired" terminology in the API (

DesiredCPU resource.Quantity `json:"desiredCPU,omitempty"`
). We also already have TargetCapacity which refers to the Serve capacity to scale within a cluster. I think we should use "target" to refer to fields that the controller is changing, while "desired" more clearly refers to user-declared intent - like the spec that they submit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Ryan, the naming makes a lot of sense!

rueian added a commit that referenced this pull request Mar 19, 2026
* Disable RayMultiHostIndexing feature for TestReconcile_Multihost_Replicas (#4583)

Signed-off-by: Future-Outlier <[email protected]>

* Disable the field alignment lint check (#4560)

* chore: disable govet fieldalignment lint

Signed-off-by: jinbum9958 <[email protected]>

* chore: remove redundant //nolint:govet comments

Signed-off-by: jinbum9958 <[email protected]>

* chore: restore RayCronJobSpec/RayCronJob comments

Signed-off-by: jinbum9958 <[email protected]>

* chore: restore RayCronJob comment

Signed-off-by: jinbum9958 <[email protected]>

* chore: regenerate artifacts for RayCronJob comment changes

Signed-off-by: jinbum9958 <[email protected]>

---------

Signed-off-by: jinbum9958 <[email protected]>

* Mark RC releases as pre-release in GoReleaser (#4590)

Signed-off-by: Future-Outlier <[email protected]>

* Add example YAML for manually enabling Ray k8s auth (#4582)

Signed-off-by: Andrew Sy Kim <[email protected]>

* [RayService] Rollback Support for Incremental Upgrades (#4109)

* Implement rollback support

Signed-off-by: Ryan O'Leary <[email protected]>

* Fix unit test file

Signed-off-by: Ryan O'Leary <[email protected]>

* Ensure upgrade in progress status is cleared

Signed-off-by: Ryan O'Leary <[email protected]>

* Clarify rollback scenarios and clear pending apps during rollback

Signed-off-by: Ryan O'Leary <[email protected]>

* Fix typo

Signed-off-by: Ryan O'Leary <[email protected]>

* Add guard for rollback to prepare new cluster

Signed-off-by: Ryan O'Leary <[email protected]>

* fix e2e test checking pending cluster name

Signed-off-by: Ryan O'Leary <[email protected]>

* Fix rollback calculation

Signed-off-by: Ryan O'Leary <[email protected]>

* Fix httproute name in test

Signed-off-by: Ryan O'Leary <[email protected]>

* Increase timeout for checking RayCluster deletion

Signed-off-by: Ryan O'Leary <[email protected]>

* Fix rollback check and change httproute equal check so it doesn't always evaluate false

Signed-off-by: Ryan O'Leary <[email protected]>

* Adjust logic to use timestamp of growing cluster

Signed-off-by: Ryan O'Leary <[email protected]>

* Fix naming of vars

Signed-off-by: Ryan O'Leary <[email protected]>

---------

Signed-off-by: Ryan O'Leary <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]>

* [RayService] Promote Incremental Upgrade Feature to Beta (#4599)

Signed-off-by: Ryan O'Leary <[email protected]>

* Revert "[RayService] Promote Incremental Upgrade Feature to Beta (#4599)" (#4602)

This reverts commit 0c8aab9.

* [Github Action] Skip krew-index update for pre-release tags (#4587)

* [CI] Skip krew-index update for pre-release tags

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Signed-off-by: Future-Outlier <[email protected]>

* update

Signed-off-by: Future-Outlier <[email protected]>

---------

Signed-off-by: Future-Outlier <[email protected]>
Co-authored-by: Claude Opus 4.6 <[email protected]>

* [Helm] update ray-cluster chart to apiVersion: v2 (#4593)

* Update kind configs and docs to v1.29.0 (#4595)

* Remove ingress template support for k8s < 1.19 (#4591)

* [Helm] Update ray-cluster default resource values (#4588)

* update kuberay 1.6.0 helm chart and readme

Signed-off-by: Future-Outlier <[email protected]>

* [Helm] Fix apiserver README chart version and align ray-cluster resource defaults

Signed-off-by: Future-Outlier <[email protected]>

* update

Signed-off-by: Future-Outlier <[email protected]>

* update

Signed-off-by: Future-Outlier <[email protected]>

* update

Signed-off-by: Future-Outlier <[email protected]>

* update

Signed-off-by: Future-Outlier <[email protected]>

* update

Signed-off-by: Future-Outlier <[email protected]>

* update

Signed-off-by: Future-Outlier <[email protected]>

* update

Signed-off-by: Future-Outlier <[email protected]>

* fix ctl test

Signed-off-by: Future-Outlier <[email protected]>

* update

Signed-off-by: Future-Outlier <[email protected]>

* update

Signed-off-by: Future-Outlier <[email protected]>

---------

Signed-off-by: Future-Outlier <[email protected]>

* fix lint

Signed-off-by: Future-Outlier <[email protected]>

* chore: Use RayCluster name as SA name for RBAC auth (#4611)

Signed-off-by: JiangJiaWei1103 <[email protected]>

* Add Google Artifact Registry image build/push guide (#4618)

* [History Server] Fix API response format to match Ray Dashboard frontend schema (#4615)

* [History Server] Fix API response format to match Ray Dashboard frontend schema

Signed-off-by: Future-Outlier <[email protected]>

* update

Signed-off-by: Future-Outlier <[email protected]>

* update

Signed-off-by: Future-Outlier <[email protected]>

* fix bug 1

Signed-off-by: Future-Outlier <[email protected]>

* update

Signed-off-by: Future-Outlier <[email protected]>

* add comments to explain frontend fields naming issue

Signed-off-by: Future-Outlier <[email protected]>

* fix tests

Signed-off-by: Future-Outlier <[email protected]>

* Add comments

Signed-off-by: Future-Outlier <[email protected]>

* update

Signed-off-by: Future-Outlier <[email protected]>

* Add frontend reference for this PR

Signed-off-by: Future-Outlier <[email protected]>

* fix codex and cursor bug

Signed-off-by: Future-Outlier <[email protected]>

* upadte

Signed-off-by: Future-Outlier <[email protected]>

* add actor.LabelSelector

Signed-off-by: Future-Outlier <[email protected]>

* better query for additional endpoints, request URI (path + query)

Signed-off-by: Future-Outlier <[email protected]>

* Add comments

Signed-off-by: Future-Outlier <[email protected]>

---------

Signed-off-by: Future-Outlier <[email protected]>

---------

Signed-off-by: Future-Outlier <[email protected]>
Signed-off-by: jinbum9958 <[email protected]>
Signed-off-by: Andrew Sy Kim <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]>
Signed-off-by: JiangJiaWei1103 <[email protected]>
Co-authored-by: Han-Ju Chen (Future-Outlier) <[email protected]>
Co-authored-by: Jinbum Kim <[email protected]>
Co-authored-by: Andrew Sy Kim <[email protected]>
Co-authored-by: Ryan O'Leary <[email protected]>
Co-authored-by: Claude Opus 4.6 <[email protected]>
Co-authored-by: Matt Boersma <[email protected]>
Co-authored-by: Jia-Wei Jiang <[email protected]>
Co-authored-by: Chia-Yi Liang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants