Skip to content
Open
Show file tree
Hide file tree
Changes from 48 commits
Commits
Show all changes
51 commits
Select commit Hold shift + click to select a range
44c4aef
docs: Add step-by-step process of the basic e2e
JiangJiaWei1103 Feb 26, 2026
01bd86e
chore: Align test infra setup order with Ray docs
JiangJiaWei1103 Feb 26, 2026
40f1d5c
test: Add locust load test for incr upgrade e2e
JiangJiaWei1103 Feb 26, 2026
bb4a26f
docs: Improve maintainability of locust yaml
JiangJiaWei1103 Feb 27, 2026
afc9dcc
refactor: Remove redundant helper
JiangJiaWei1103 Feb 27, 2026
89f5478
fix: Deflake hardcoded sleep for Locust ramp up
JiangJiaWei1103 Feb 27, 2026
004a757
refactor: Remove legacy and extract a helper to get rps index
JiangJiaWei1103 Feb 27, 2026
f00ab71
test: Ensure remaining traffic routed to the new cluster
JiangJiaWei1103 Feb 28, 2026
ef802a6
test: Support high-rps serve application with expected rps over 900
JiangJiaWei1103 Mar 1, 2026
87ca63d
Test CI
JiangJiaWei1103 Mar 1, 2026
1b8ab65
Test CI
JiangJiaWei1103 Mar 1, 2026
504ecf5
Test CI
JiangJiaWei1103 Mar 1, 2026
e64a6e7
test: Recover basic incr upgrade test
JiangJiaWei1103 Mar 2, 2026
25f34ee
docs: Improve maintainability
JiangJiaWei1103 Mar 3, 2026
6b23507
fix: Deflake CI istio gc installation
JiangJiaWei1103 Mar 3, 2026
0aa45a1
fix: Deflake istio gc installation
JiangJiaWei1103 Mar 3, 2026
dc98432
revert: Use orig install order
JiangJiaWei1103 Mar 3, 2026
7e4d888
test: Support diverse incr upgrade parameter combinations
JiangJiaWei1103 Mar 3, 2026
8f6df3d
Test CI
JiangJiaWei1103 Mar 3, 2026
f71836a
fix: Skip transient state check right before promotion
JiangJiaWei1103 Mar 4, 2026
b1eeab8
test: Retest standard gradual incr upgrade
JiangJiaWei1103 Mar 4, 2026
7c2e504
refactor: Make curl function clearer
JiangJiaWei1103 Mar 5, 2026
62fa293
fix: Avoid t.FailNow from non-test goroutines
JiangJiaWei1103 Mar 5, 2026
d152779
fix: Deflake by using commit hash
JiangJiaWei1103 Mar 5, 2026
b9db8ee
refactor: Remove redundant checks
JiangJiaWei1103 Mar 5, 2026
442db2d
refactor: Get rps col index without hardcoded int
JiangJiaWei1103 Mar 5, 2026
2e7042e
refactor: Extract locust warmup constants for tweaking
JiangJiaWei1103 Mar 5, 2026
6ae50b8
Remove redundant line
JiangJiaWei1103 Mar 5, 2026
9737c31
test: Split test responsibilities
JiangJiaWei1103 Mar 6, 2026
c57f594
refactor: Use eg instead of wg
JiangJiaWei1103 Mar 6, 2026
f183c38
refactor: Extract trigger incr upgrade helper and curl const
JiangJiaWei1103 Mar 6, 2026
bf164e9
fix: Deflake waiting for upgrade complete using a longer timeout
JiangJiaWei1103 Mar 6, 2026
9becace
Improve readability
JiangJiaWei1103 Mar 6, 2026
b485e1e
fix: Remove data race on err btw goroutines
JiangJiaWei1103 Mar 6, 2026
38bce3f
fix: Fix last migrate time check
JiangJiaWei1103 Mar 6, 2026
6acc27a
Merge branch 'master' into add-load-tests-incr-upgrade-e2e
ryanaoleary Mar 13, 2026
fe092d0
Fix merge and update rollback test for changes in this PR
ryanaoleary Mar 13, 2026
067ba1f
chore: Use a larger runner for higher RPS
JiangJiaWei1103 Mar 14, 2026
d0025dc
Revert "chore: Use a larger runner for higher RPS"
JiangJiaWei1103 Mar 14, 2026
5eef493
Revert "[RayService] Rollback Support for Incremental Upgrades (#4109)"
Future-Outlier Mar 16, 2026
675d3ef
codex issue
Future-Outlier Mar 16, 2026
4c38b7e
chore: Deflake by increasing timeout
JiangJiaWei1103 Mar 16, 2026
7e821f3
chore: Remove worker rsc limits to align with the orig example
JiangJiaWei1103 Mar 16, 2026
13011d6
Revert "chore: Remove worker rsc limits to align with the orig example"
JiangJiaWei1103 Mar 16, 2026
73e6637
docs: Better strategy naming
JiangJiaWei1103 Mar 16, 2026
baf2d6c
Revert: Remove rollback e2e
JiangJiaWei1103 Mar 16, 2026
44fdb7e
Revert "Revert: Remove rollback e2e"
JiangJiaWei1103 Mar 16, 2026
d1591b9
chore: Use Ray project serve config example link
JiangJiaWei1103 Mar 16, 2026
ffd38b6
chore: Deflake CI by lowering steady state RPS
JiangJiaWei1103 Mar 17, 2026
9f745c3
fix: Make log msg accurate
JiangJiaWei1103 Mar 19, 2026
fdf988f
fix: Make log msg more accurate
JiangJiaWei1103 Mar 19, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .buildkite/test-e2e.yml
Original file line number Diff line number Diff line change
Expand Up @@ -94,7 +94,7 @@
- set -o pipefail
- mkdir -p "$(pwd)/tmp" && export KUBERAY_TEST_OUTPUT_DIR=$(pwd)/tmp
- echo "KUBERAY_TEST_OUTPUT_DIR=$$KUBERAY_TEST_OUTPUT_DIR"
- KUBERAY_TEST_TIMEOUT_SHORT=1m KUBERAY_TEST_TIMEOUT_MEDIUM=5m KUBERAY_TEST_TIMEOUT_LONG=10m go test -timeout 30m -v ./test/e2eincrementalupgrade 2>&1 | awk -f ../.buildkite/format.awk | tee $$KUBERAY_TEST_OUTPUT_DIR/gotest.log || (kubectl logs --tail -1 -l app.kubernetes.io/name=kuberay | tee $$KUBERAY_TEST_OUTPUT_DIR/kuberay-operator.log && cd $$KUBERAY_TEST_OUTPUT_DIR && find . -name "*.log" | tar -cf /artifact-mount/e2e-log.tar -T - && exit 1)
- KUBERAY_TEST_TIMEOUT_SHORT=1m KUBERAY_TEST_TIMEOUT_MEDIUM=10m KUBERAY_TEST_TIMEOUT_LONG=20m go test -timeout 60m -v ./test/e2eincrementalupgrade 2>&1 | awk -f ../.buildkite/format.awk | tee $$KUBERAY_TEST_OUTPUT_DIR/gotest.log || (kubectl logs --tail -1 -l app.kubernetes.io/name=kuberay | tee $$KUBERAY_TEST_OUTPUT_DIR/kuberay-operator.log && cd $$KUBERAY_TEST_OUTPUT_DIR && find . -name "*.log" | tar -cf /artifact-mount/e2e-log.tar -T - && exit 1)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We increase the timeout to deflake the e2e test.

- echo "--- END:RayService Incremental Upgrade E2E (nightly operator) tests finished"

- label: 'Test Autoscaler E2E Part 1 (nightly operator)'
Expand Down
121 changes: 121 additions & 0 deletions ray-operator/test/e2eincrementalupgrade/constant.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,121 @@
package e2eincrementalupgrade

import "k8s.io/utils/ptr"

// These parameters control capacity scaling and gradual traffic migration during the upgrade.
type incrementalUpgradeParams struct {
Name string
StepSize int32
Interval int32
MaxSurge int32
}

// incrementalUpgradeCombinations defines diverse (stepSize, interval, maxSurge) combinations
// to exercise different upgrade behaviors. Each combination targets a distinct scenario.
var incrementalUpgradeCombinations = []incrementalUpgradeParams{
{
// Scenario: Instant cutover.
// All capacity and traffic shift in one step, which behaves like a blue/green deployment.
StepSize: 100,
Interval: 1,
MaxSurge: 100,
Name: "BlueGreen",
},
{
// Scenario: Aggressive gradual upgrade.
// Larger traffic migration steps with shorter intervals.
StepSize: 25,
Interval: 5,
MaxSurge: 50,
Name: "AggressiveGradual",
},
{
// Scenario: Conservative gradual upgrade.
// Smaller traffic migration steps with longer intervals.
StepSize: 5,
Interval: 10,
MaxSurge: 25,
Name: "ConservativeGradual",
},
}

// ptrs returns (*stepSize, *interval, *maxSurge) for use with the RayService bootstrap helper.
func (p incrementalUpgradeParams) ptrs() (*int32, *int32, *int32) {
return ptr.To(p.StepSize), ptr.To(p.Interval), ptr.To(p.MaxSurge)
}

// The following defines the Serve configurations for different types of incremental upgrade tests, including:
// - Functional test
// - High-RPS Locust load test
//
// NOTE: working_dir is coupled with the external GitHub repos, which might lead to CI flakiness considering the
// availability and stability of these repos and specific commit hashes.

type serveConfigV2 string

// defaultIncrementalUpgradeServeConfigV2 configures a Serve app for functional tests.
const defaultIncrementalUpgradeServeConfigV2 serveConfigV2 = `applications:
- name: fruit_app
import_path: fruit.deployment_graph
route_prefix: /fruit
runtime_env:
working_dir: "https://github.com/ray-project/test_dag/archive/78b4a5da38796123d9f9ffff59bab2792a043e95.zip"
deployments:
- name: MangoStand
num_replicas: 1
user_config:
price: 3
ray_actor_options:
num_cpus: 0.1
- name: OrangeStand
num_replicas: 1
user_config:
price: 2
ray_actor_options:
num_cpus: 0.1
- name: FruitMarket
num_replicas: 1
ray_actor_options:
num_cpus: 0.1
- name: math_app
import_path: conditional_dag.serve_dag
route_prefix: /calc
runtime_env:
working_dir: "https://github.com/ray-project/test_dag/archive/78b4a5da38796123d9f9ffff59bab2792a043e95.zip"
deployments:
- name: Adder
num_replicas: 1
user_config:
increment: 3
ray_actor_options:
num_cpus: 0.1
- name: Multiplier
num_replicas: 1
user_config:
factor: 5
ray_actor_options:
num_cpus: 0.1
- name: Router
num_replicas: 1
ray_actor_options:
num_cpus: 0.1
`

// highRPSServeConfigV2 configures a minimal high-RPS Serve app (SimpleDeployment) for Locust load tests.
const highRPSServeConfigV2 serveConfigV2 = `applications:
- name: simple_app
import_path: locust_test.simple_serve:app
route_prefix: /test
runtime_env:
working_dir: "https://github.com/ray-project/serve_config_examples/archive/530e247ca195530b71b92d7e708048a1bdc02583.zip"
deployments:
- name: SimpleDeployment
autoscaling_config:
min_replicas: 1
max_replicas: 3
target_ongoing_requests: 2
max_ongoing_requests: 6
upscale_delay_s: 0.5
ray_actor_options:
num_cpus: 2
`
Loading
Loading