[Test] Add load tests and behavioral checks to incremental upgrade E2E#4541
[Test] Add load tests and behavioral checks to incremental upgrade E2E#4541JiangJiaWei1103 wants to merge 51 commits intoray-project:masterfrom
Conversation
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
| - kind create cluster --wait 900s --config ./ci/kind-config-buildkite-1-29.yml | ||
| - kubectl config set clusters.kind-kind.server https://docker:6443 | ||
|
|
||
| # Install MetalLB for LoadBalancer IPs on Kind |
There was a problem hiding this comment.
The setup order is rearranged to align with the official Ray docs example, so developers can follow along without wondering why the steps differ.
There was a problem hiding this comment.
Reverted at dc98432 due to duplicate installation of Istio GatewayClass. We might revisit this in the future.
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
This reverts commit 067ba1f.
…roject#4109)" This reverts commit b8b66d4.
Signed-off-by: Future-Outlier <eric901201@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
This reverts commit 7e821f3.
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
| Should(Not(BeNil())) | ||
| LogWithTimestamp(test.T(), "Verifying both old and new versions served traffic during the upgrade") | ||
| g.Expect(oldVersionServed).To(BeTrue(), "The old version of the service should have served traffic during the upgrade.") | ||
| g.Expect(newVersionServed).To(BeTrue(), "The new version of the service should have served traffic during the upgrade.") |
There was a problem hiding this comment.
BlueGreen scenario may fail both-versions-served assertion
Medium Severity
For the BlueGreen scenario (stepSize=100, interval=1, maxSurge=100), the upgrade may complete during the preceding validation steps (pending head pod readiness, HTTPRoute backend checks at lines 106–123), since the entire traffic shift happens in a single 1-second step. When the behavioral check loop starts, g.Eventually can succeed via the !IsRayServiceUpgrading(svc) escape clause, the curl returns only "8" (new version), and the loop breaks — leaving oldVersionServed as false. The assertion at line 205 then fails. This is a race condition that could cause flaky test failures.
Additional Locations (1)
There was a problem hiding this comment.
This can happen when the upgrade completes too quickly, even before the test enters the upgradeSteps loop. In practice, this scenario is rare.
For now, I suggest keeping the current behavior unchanged. We can revisit whether it is reasonable to assert that both clusters should serve traffic during the upgrade for the Blue/Green strategy, which is effectively a single-step upgrade rather than a gradual traffic migration.
| - mkdir -p "$(pwd)/tmp" && export KUBERAY_TEST_OUTPUT_DIR=$(pwd)/tmp | ||
| - echo "KUBERAY_TEST_OUTPUT_DIR=$$KUBERAY_TEST_OUTPUT_DIR" | ||
| - KUBERAY_TEST_TIMEOUT_SHORT=1m KUBERAY_TEST_TIMEOUT_MEDIUM=5m KUBERAY_TEST_TIMEOUT_LONG=10m go test -timeout 30m -v ./test/e2eincrementalupgrade 2>&1 | awk -f ../.buildkite/format.awk | tee $$KUBERAY_TEST_OUTPUT_DIR/gotest.log || (kubectl logs --tail -1 -l app.kubernetes.io/name=kuberay | tee $$KUBERAY_TEST_OUTPUT_DIR/kuberay-operator.log && cd $$KUBERAY_TEST_OUTPUT_DIR && find . -name "*.log" | tar -cf /artifact-mount/e2e-log.tar -T - && exit 1) | ||
| - KUBERAY_TEST_TIMEOUT_SHORT=1m KUBERAY_TEST_TIMEOUT_MEDIUM=10m KUBERAY_TEST_TIMEOUT_LONG=20m go test -timeout 60m -v ./test/e2eincrementalupgrade 2>&1 | awk -f ../.buildkite/format.awk | tee $$KUBERAY_TEST_OUTPUT_DIR/gotest.log || (kubectl logs --tail -1 -l app.kubernetes.io/name=kuberay | tee $$KUBERAY_TEST_OUTPUT_DIR/kuberay-operator.log && cd $$KUBERAY_TEST_OUTPUT_DIR && find . -name "*.log" | tar -cf /artifact-mount/e2e-log.tar -T - && exit 1) |
There was a problem hiding this comment.
We increase the timeout to deflake the e2e test.
| resources: | ||
| requests: | ||
| cpu: 300m | ||
| memory: 1G | ||
| limits: | ||
| cpu: 500m | ||
| memory: 2G |
There was a problem hiding this comment.
For resources setup for the locust RayCluster, we follow practices here:
| import_path: simple_serve.app | ||
| route_prefix: /test | ||
| runtime_env: | ||
| working_dir: "https://github.com/jiangjiawei1103/incr-upgrade-locust/archive/a185bb29374388e801db4331ae73af3ad1e79a5f.zip" |
There was a problem hiding this comment.
Thanks Ryan!! I'll change the URL once the PR is merged.
| corev1ac.ContainerPort().WithName(utils.DashboardPortName).WithContainerPort(utils.DefaultDashboardPort), | ||
| corev1ac.ContainerPort().WithName(utils.ClientPortName).WithContainerPort(utils.DefaultClientPort), | ||
| ). | ||
| WithResources(corev1ac.ResourceRequirements(). |
There was a problem hiding this comment.
The resource setup is mainly constrained by buildkite hardware limitation (8 vCPU). For details, please refer to the PR description.
| @@ -137,12 +168,12 @@ func IncrementalUpgradeRayServiceApplyConfiguration( | |||
| WithImage(GetRayImage()). | |||
| WithResources(corev1ac.ResourceRequirements(). | |||
| WithRequests(corev1.ResourceList{ | |||
| serveConfigV2 serveConfigV2, | ||
| ) *rayv1ac.RayServiceSpecApplyConfiguration { | ||
| return rayv1ac.RayServiceSpec(). | ||
| WithUpgradeStrategy(rayv1ac.RayServiceUpgradeStrategy(). |
There was a problem hiding this comment.
I recommend adding:
WithRayClusterDeletionDelaySeconds(0).
here since the default deletion delay is 60 seconds, which adds unnecessary lag to the test since we check for cluster deletion. We could lower it to 0 or even just a value like 10 seconds to speed up these tests.
There was a problem hiding this comment.
Hi @ryanaoleary,
I noticed that only TestRayServiceIncrementalUpgradeRollback verifies whether the pending cluster is deleted after the rollback completes.
For TestRayServiceIncrementalUpgrade and TestRayServiceIncrementalUpgradeWithLocust, neither test checks whether the previous active clusters are cleaned up. Instead, they focus on verifying that traffic is correctly served by the new cluster (i.e., the newly promoted active cluster). Therefore, the RayClusterDeletionDelaySeconds setup wouldn't help speed up the e2e tests at this stage.
I suggest the following adjustments in this PR:
- Remove the rollback E2E tests for now, since this PR will be merged before the rollback logic itself.
- Reintroduce the rollback logic along with the corresponding rollback E2E tests. For each test, we should:
- Run it under Locust load
- Enable
WithRayClusterDeletionDelaySecondsto accelerate verification that the pending cluster is cleaned up.
WDYT? If I misunderstood anything, please let me know. Thanks!
There was a problem hiding this comment.
We add the basic rollback e2e back at 44fdb7e. And, we'll enhance rollback e2e coverage in the follow-up PRs.
ryanaoleary
left a comment
There was a problem hiding this comment.
LGTM - just one small comment on the config that gets applied and the test path needs to be updated when ray-project/serve_config_examples#15 is merged.
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
| test.T().Logf("failed to parse RPS, retrying in 2 seconds: %s", err.Error()) | ||
| time.Sleep(2 * time.Second) | ||
| continue | ||
| } |
There was a problem hiding this comment.
Warmup stableCount not reset on error retries
Medium Severity
In warmupLocust, when the stats query fails (stderr non-empty/stdout empty), the stats slice is too short, or float parsing fails, the loop continues without resetting stableCount. Since locustWarmupStableWindowSeconds represents consecutive seconds of stability, intermittent failures during the stable window are silently skipped, and the function can prematurely declare steady state based on non-consecutive stable checks.
There was a problem hiding this comment.
We reset stableCount only when the RPS value is successfully queried and parsed, and is below the rpsThreshold:
The three cases you mentioned are commonly observed formatting/parsing issues, which don't reliably indicate that the actual RPS is below the threshold. Therefore, we don't reset stableCount in those scenarios.
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
win5923
left a comment
There was a problem hiding this comment.
Sorry for the late review, I spent some time getting familiar with the incremental upgrade PR.
Overall LGTM. Although I wasn’t able to reach 400 RPS in my local tests, the CI results look stable, which should be sufficient.
ray-operator/test/e2eincrementalupgrade/rayservice_incremental_upgrade_test.go
Outdated
Show resolved
Hide resolved
Co-authored-by: Jun-Hao Wan <ken89@kimo.com> Signed-off-by: Jia-Wei Jiang <36886416+JiangJiaWei1103@users.noreply.github.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>



Why are these changes needed?
The existing RayService Incremental Upgrade E2E test is implemented as a single, monolithic functional test and doesn't include load testing. Hence, it doesn't verify system behavior under real traffic during upgrades. In addition, the current functional test doesn't provide comprehensive behavioral checks.
This PR introduces Locust-based load tests to validate incremental upgrade behavior under continuous traffic. It also adds more thorough behavioral checks to the functional test. Both test cases are executed across multiple upgrade strategies to improve test coverage.
Test Summary
RayCluster Setup
To run the Locust load tests, two RayClusters are required: one acting as the client (Locust) and the other managed by the RayService.
Locust Cluster (Client-Side)
The client cluster configuration follows the example defined here.
RayService Cluster (Server-Side)
Worker resource limits are intentionally not set to align with the original example proposed by @Future-Outlier here.UPDATE:
Without resource limits, workers may consume more CPU and memory than their requested resources. This can lead to node-level resource contention and cause the incremental upgrade process to time out. To avoid this issue, resource limits are kept for worker pods, particularly in CI environments where compute resources are constrained (e.g., Buildkite runners with 8 vCPUs).
In addition, setting explicit CPU requests for the head pod can make the pod unschedulable in single-node test environments due to
Insufficient cpu. To ensure the test environment remains schedulable, the head is configured withrayStartParams["num-cpus"]: 0while keeping minimal resource requests.rayStartParams["num-cpus"]: 0)Test Matrix
Both test cases are executed with multiple upgrade strategies to improve coverage.
BlueGreenAggressiveGradualConservativeGradualBehavioral Checks
To improve the robustness of the e2e tests, we introduce additional behavioral checks, including verification that
TargetCapacityandTrafficRoutedPercentprogress monotonically. See the Change Summary below for details.Test Results
Throughput
The CI throughput (~500 RPS, and sometimes ~450) is approximately half of the local test (~1000+ RPS). This is likely due to the smaller compute capacity of the Buildkite hosted runners:
Buildkite large instances provide 8 vCPUs, where each vCPU typically maps to one logical thread (hyper-thread) on a physical core. This means the available compute might be roughly half that of our local test machine. Furthermore, CI providers may enforce cgroup limits or CPU throttling, which can cause additional performance degradation. For full instance specifications, refer to the Buildkite hosted Linux sizes documentation.
NOTE: The greatest instance supported by the Ray ecosystem CI is
large. For error on selectingxlarge, please see this CI failure.Overall E2E
TestRayServiceIncrementalUpgradeTestRayServiceIncrementalUpgradeWithLocustTestRayServiceIncrementalUpgradeRollbackChange Summary
TestRayServiceIncrementalUpgradeAdd comprehensive behavioral checks covering:
TargetCapacityis monotonically decreasing while PendingTargetCapacityis monotonically increasingTrafficRoutedPercentis monotonically decreasing while PendingTrafficRoutedPercentis monotonically increasingTestRayServiceIncrementalUpgradeWithLocustTestRayServiceIncrementalUpgradeWithLocustthat runs Locust load testsRayClusterandConfigMapmanifests required to run the Locust load testsServeConfigV2configuration with a lightweight Serve application that directly returns a responseLimitations and Future Improvements
The Serve application source code is currently hosted in my personal repo. It should be migrated to an officialray-projectorganization repository.ConfigMap, butworking_dircurrently does not support local paths.locust-cluster.incremental-upgrade.yaml, following the same pattern used in the existing HA E2E tests here.ServeConfigV2implementation and theRayClusterconfiguration to support higher RPS in CI:Related issue number
#3209
Checks