Skip to content

[Test] Add load tests and behavioral checks to incremental upgrade E2E#4541

Open
JiangJiaWei1103 wants to merge 51 commits intoray-project:masterfrom
JiangJiaWei1103:add-load-tests-incr-upgrade-e2e
Open

[Test] Add load tests and behavioral checks to incremental upgrade E2E#4541
JiangJiaWei1103 wants to merge 51 commits intoray-project:masterfrom
JiangJiaWei1103:add-load-tests-incr-upgrade-e2e

Conversation

@JiangJiaWei1103
Copy link
Contributor

@JiangJiaWei1103 JiangJiaWei1103 commented Feb 26, 2026

Why are these changes needed?

The existing RayService Incremental Upgrade E2E test is implemented as a single, monolithic functional test and doesn't include load testing. Hence, it doesn't verify system behavior under real traffic during upgrades. In addition, the current functional test doesn't provide comprehensive behavioral checks.

This PR introduces Locust-based load tests to validate incremental upgrade behavior under continuous traffic. It also adds more thorough behavioral checks to the functional test. Both test cases are executed across multiple upgrade strategies to improve test coverage.

Test Summary

RayCluster Setup

To run the Locust load tests, two RayClusters are required: one acting as the client (Locust) and the other managed by the RayService.

Locust Cluster (Client-Side)

The client cluster configuration follows the example defined here.

Component Replicas CPU (req/limit) Memory (req/limit)
Head 1 300m / 500m 1Gi / 2Gi
Worker 0 (head-only) - -

RayService Cluster (Server-Side)

Worker resource limits are intentionally not set to align with the original example proposed by @Future-Outlier here.

UPDATE:

Without resource limits, workers may consume more CPU and memory than their requested resources. This can lead to node-level resource contention and cause the incremental upgrade process to time out. To avoid this issue, resource limits are kept for worker pods, particularly in CI environments where compute resources are constrained (e.g., Buildkite runners with 8 vCPUs).

In addition, setting explicit CPU requests for the head pod can make the pod unschedulable in single-node test environments due to Insufficient cpu. To ensure the test environment remains schedulable, the head is configured with rayStartParams["num-cpus"]: 0 while keeping minimal resource requests.

Component Replicas (min/max) CPU (req/limit) Memory (req/limit)
Head 1 - (rayStartParams["num-cpus"]: 0) -
Worker 1 / 4 2 / 2 2Gi / 2Gi

Test Matrix

Both test cases are executed with multiple upgrade strategies to improve coverage.

Upgrade Strategy (maxSurgePercent, stepSizePercent, intervalSeconds) Description
BlueGreen (100, 100, 1) Instant traffic cutover
AggressiveGradual (50, 25, 2) Larger traffic migration steps with shorter intervals
ConservativeGradual (25, 5, 10) Smaller traffic migration steps with longer intervals

Behavioral Checks

To improve the robustness of the e2e tests, we introduce additional behavioral checks, including verification that TargetCapacity and TrafficRoutedPercent progress monotonically. See the Change Summary below for details.

Test Results

Throughput

The CI throughput (~500 RPS, and sometimes ~450) is approximately half of the local test (~1000+ RPS). This is likely due to the smaller compute capacity of the Buildkite hosted runners:

Buildkite large instances provide 8 vCPUs, where each vCPU typically maps to one logical thread (hyper-thread) on a physical core. This means the available compute might be roughly half that of our local test machine. Furthermore, CI providers may enforce cgroup limits or CPU throttling, which can cause additional performance degradation. For full instance specifications, refer to the Buildkite hosted Linux sizes documentation.

NOTE: The greatest instance supported by the Ray ecosystem CI is large. For error on selecting xlarge, please see this CI failure.

Local (~1000+ RPS) CI (~500 RPS)
Screenshot 2026-03-05 at 8 50 59 PM Screenshot 2026-03-05 at 8 56 41 PM

Overall E2E

TestRayServiceIncrementalUpgrade TestRayServiceIncrementalUpgradeWithLocust TestRayServiceIncrementalUpgradeRollback
Screenshot 2026-03-16 at 4 52 32 PM Screenshot 2026-03-16 at 5 21 01 PM Screenshot 2026-03-16 at 5 21 32 PM

Change Summary

TestRayServiceIncrementalUpgrade

Add comprehensive behavioral checks covering:

  • The current state value matches the expected value, or the RayService has already finished upgrading
  • Both old and new versions serve traffic during the upgrade, ensuring no requests are dropped
  • Traffic migration respects the configured interval seconds
  • Active TargetCapacity is monotonically decreasing while Pending TargetCapacity is monotonically increasing
  • Active TrafficRoutedPercent is monotonically decreasing while Pending TrafficRoutedPercent is monotonically increasing

TestRayServiceIncrementalUpgradeWithLocust

  • Add a new E2E test case TestRayServiceIncrementalUpgradeWithLocust that runs Locust load tests
    • Run Locust in a background goroutine
    • Warm up Locust first (entering a steady state) before triggering the upgrade
    • Verify that no requests are dropped during the upgrade under load
  • Add RayCluster and ConfigMap manifests required to run the Locust load tests
  • Add a new ServeConfigV2 configuration with a lightweight Serve application that directly returns a response
    • This is designed for high-RPS load testing scenarios

Limitations and Future Improvements

  • The Serve application source code is currently hosted in my personal repo. It should be migrated to an official ray-project organization repository.
  • Ray and image versions are currently hardcoded in locust-cluster.incremental-upgrade.yaml, following the same pattern used in the existing HA E2E tests here.
  • Locust warm-up parameters are currently defined as named constants. We may need to further tune the ServeConfigV2 implementation and the RayCluster configuration to support higher RPS in CI:
    • Current throughput: ~500 RPS
    • Target throughput: 1000+ RPS

Related issue number

#3209

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
@JiangJiaWei1103
Copy link
Contributor Author

Current Local E2E Test Results

Screenshot 2026-02-26 at 8 56 37 PM

The next steps are:

  • Address flaky parts (see TODO in the e2e test file)
  • Add more test cases with diverse (stepSize, interval, maxSurge) combinations

- kind create cluster --wait 900s --config ./ci/kind-config-buildkite-1-29.yml
- kubectl config set clusters.kind-kind.server https://docker:6443

# Install MetalLB for LoadBalancer IPs on Kind
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The setup order is rearranged to align with the official Ray docs example, so developers can follow along without wondering why the steps differ.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted at dc98432 due to duplicate installation of Istio GatewayClass. We might revisit this in the future.

Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Future-Outlier and others added 6 commits March 16, 2026 12:25
Signed-off-by: Future-Outlier <eric901201@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Should(Not(BeNil()))
LogWithTimestamp(test.T(), "Verifying both old and new versions served traffic during the upgrade")
g.Expect(oldVersionServed).To(BeTrue(), "The old version of the service should have served traffic during the upgrade.")
g.Expect(newVersionServed).To(BeTrue(), "The new version of the service should have served traffic during the upgrade.")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BlueGreen scenario may fail both-versions-served assertion

Medium Severity

For the BlueGreen scenario (stepSize=100, interval=1, maxSurge=100), the upgrade may complete during the preceding validation steps (pending head pod readiness, HTTPRoute backend checks at lines 106–123), since the entire traffic shift happens in a single 1-second step. When the behavioral check loop starts, g.Eventually can succeed via the !IsRayServiceUpgrading(svc) escape clause, the curl returns only "8" (new version), and the loop breaks — leaving oldVersionServed as false. The assertion at line 205 then fails. This is a race condition that could cause flaky test failures.

Additional Locations (1)
Fix in Cursor Fix in Web

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can happen when the upgrade completes too quickly, even before the test enters the upgradeSteps loop. In practice, this scenario is rare.

For now, I suggest keeping the current behavior unchanged. We can revisit whether it is reasonable to assert that both clusters should serve traffic during the upgrade for the Blue/Green strategy, which is effectively a single-step upgrade rather than a gradual traffic migration.

- mkdir -p "$(pwd)/tmp" && export KUBERAY_TEST_OUTPUT_DIR=$(pwd)/tmp
- echo "KUBERAY_TEST_OUTPUT_DIR=$$KUBERAY_TEST_OUTPUT_DIR"
- KUBERAY_TEST_TIMEOUT_SHORT=1m KUBERAY_TEST_TIMEOUT_MEDIUM=5m KUBERAY_TEST_TIMEOUT_LONG=10m go test -timeout 30m -v ./test/e2eincrementalupgrade 2>&1 | awk -f ../.buildkite/format.awk | tee $$KUBERAY_TEST_OUTPUT_DIR/gotest.log || (kubectl logs --tail -1 -l app.kubernetes.io/name=kuberay | tee $$KUBERAY_TEST_OUTPUT_DIR/kuberay-operator.log && cd $$KUBERAY_TEST_OUTPUT_DIR && find . -name "*.log" | tar -cf /artifact-mount/e2e-log.tar -T - && exit 1)
- KUBERAY_TEST_TIMEOUT_SHORT=1m KUBERAY_TEST_TIMEOUT_MEDIUM=10m KUBERAY_TEST_TIMEOUT_LONG=20m go test -timeout 60m -v ./test/e2eincrementalupgrade 2>&1 | awk -f ../.buildkite/format.awk | tee $$KUBERAY_TEST_OUTPUT_DIR/gotest.log || (kubectl logs --tail -1 -l app.kubernetes.io/name=kuberay | tee $$KUBERAY_TEST_OUTPUT_DIR/kuberay-operator.log && cd $$KUBERAY_TEST_OUTPUT_DIR && find . -name "*.log" | tar -cf /artifact-mount/e2e-log.tar -T - && exit 1)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We increase the timeout to deflake the e2e test.

Comment on lines +14 to +20
resources:
requests:
cpu: 300m
memory: 1G
limits:
cpu: 500m
memory: 2G
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For resources setup for the locust RayCluster, we follow practices here:

resources:
requests:
cpu: 300m
memory: 1G
limits:
cpu: 500m
memory: 2G

import_path: simple_serve.app
route_prefix: /test
runtime_env:
working_dir: "https://github.com/jiangjiawei1103/incr-upgrade-locust/archive/a185bb29374388e801db4331ae73af3ad1e79a5f.zip"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Ryan!! I'll change the URL once the PR is merged.

corev1ac.ContainerPort().WithName(utils.DashboardPortName).WithContainerPort(utils.DefaultDashboardPort),
corev1ac.ContainerPort().WithName(utils.ClientPortName).WithContainerPort(utils.DefaultClientPort),
).
WithResources(corev1ac.ResourceRequirements().
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The resource setup is mainly constrained by buildkite hardware limitation (8 vCPU). For details, please refer to the PR description.

@@ -137,12 +168,12 @@ func IncrementalUpgradeRayServiceApplyConfiguration(
WithImage(GetRayImage()).
WithResources(corev1ac.ResourceRequirements().
WithRequests(corev1.ResourceList{
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

serveConfigV2 serveConfigV2,
) *rayv1ac.RayServiceSpecApplyConfiguration {
return rayv1ac.RayServiceSpec().
WithUpgradeStrategy(rayv1ac.RayServiceUpgradeStrategy().
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend adding:

WithRayClusterDeletionDelaySeconds(0).

here since the default deletion delay is 60 seconds, which adds unnecessary lag to the test since we check for cluster deletion. We could lower it to 0 or even just a value like 10 seconds to speed up these tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ryanaoleary,

I noticed that only TestRayServiceIncrementalUpgradeRollback verifies whether the pending cluster is deleted after the rollback completes.

For TestRayServiceIncrementalUpgrade and TestRayServiceIncrementalUpgradeWithLocust, neither test checks whether the previous active clusters are cleaned up. Instead, they focus on verifying that traffic is correctly served by the new cluster (i.e., the newly promoted active cluster). Therefore, the RayClusterDeletionDelaySeconds setup wouldn't help speed up the e2e tests at this stage.

I suggest the following adjustments in this PR:

  1. Remove the rollback E2E tests for now, since this PR will be merged before the rollback logic itself.
  2. Reintroduce the rollback logic along with the corresponding rollback E2E tests. For each test, we should:
    • Run it under Locust load
    • Enable WithRayClusterDeletionDelaySeconds to accelerate verification that the pending cluster is cleaned up.

WDYT? If I misunderstood anything, please let me know. Thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rollback e2e has been removed at baf2d6c. I'll merge master again once #4604 is merged. Thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We add the basic rollback e2e back at 44fdb7e. And, we'll enhance rollback e2e coverage in the follow-up PRs.

Copy link
Collaborator

@ryanaoleary ryanaoleary left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - just one small comment on the config that gets applied and the test path needs to be updated when ray-project/serve_config_examples#15 is merged.

Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
This reverts commit baf2d6c, reversing
changes made to 73e6637.
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 2 total unresolved issues (including 1 from previous review).

Fix All in Cursor

test.T().Logf("failed to parse RPS, retrying in 2 seconds: %s", err.Error())
time.Sleep(2 * time.Second)
continue
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Warmup stableCount not reset on error retries

Medium Severity

In warmupLocust, when the stats query fails (stderr non-empty/stdout empty), the stats slice is too short, or float parsing fails, the loop continues without resetting stableCount. Since locustWarmupStableWindowSeconds represents consecutive seconds of stability, intermittent failures during the stable window are silently skipped, and the function can prematurely declare steady state based on non-consecutive stable checks.

Fix in Cursor Fix in Web

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We reset stableCount only when the RPS value is successfully queried and parsed, and is below the rpsThreshold:

https://github.com/ray-project/kuberay/pull/4541/changes#diff-e421bf294e3026a8e3ee0aad96d7d11b2e1714e705fb859d1889bceac8cd2ba5R426-R430

The three cases you mentioned are commonly observed formatting/parsing issues, which don't reliably indicate that the actual RPS is below the threshold. Therefore, we don't reset stableCount in those scenarios.

@ryanaoleary
Copy link
Collaborator

Since #4604 is closed I think we should actually keep the changes from 44fdb7e

Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Copy link
Member

@win5923 win5923 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late review, I spent some time getting familiar with the incremental upgrade PR.
Overall LGTM. Although I wasn’t able to reach 400 RPS in my local tests, the CI results look stable, which should be sufficient.

JiangJiaWei1103 and others added 2 commits March 19, 2026 10:02
Co-authored-by: Jun-Hao Wan <ken89@kimo.com>
Signed-off-by: Jia-Wei Jiang <36886416+JiangJiaWei1103@users.noreply.github.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants