[DistInf] Cleanup PR#209: fix Dockerfile, refactor MoRI EP script, fix slurm defaults by raviguptaamd · Pull Request #143 · ROCm/MAD

raviguptaamd · 2026-04-08T09:00:40Z

Summary

Code review cleanup for MAD-private PR #209 (vLLM MoRI EP with PD disaggregation). Includes all original commits plus the following fixes.

Dockerfile

Fix sed pattern to avoid corrupting https URLs (s/http/https -> s|http://|https://|)
Replace silent || true on vLLM build with a logged warning so failures are visible
Remove commented-out dead code (old base image ARG, CMAKE_PREFIX_PATH, git checkout)
Use consistent absolute path for versions.txt; add comment documenting base image dependency
Fix trailing whitespace
Combine version-tracking RUN layers into a single layer
Move py-spy and flask installs from runtime script into Dockerfile

Slurm script (`run_xPyD_models.slurm`)

Require explicit DOCKER_IMAGE_NAME instead of defaulting to stale pre-MoRI image
Remove unreachable empty-string check after default assignment
Remove duplicate NUM_NODES echo

MoRI EP script (`vllm_disagg_mori_ep.sh`)

Major refactor: Extract 4 near-identical vllm serve blocks (~150 duplicated lines) into shared helper functions: setup_mori_env, build_kv_transfer_config, launch_vllm_worker, wait_for_proxy_and_cleanup, print_node_info
Extract 8 hardcoded port numbers into named variables at the top of the script
Fix typo "untill" -> "until" (3 occurrences)
Fix --gpu_memory_utilization to --gpu-memory-utilization for CLI arg consistency
Remove commented-out --compilation-config lines
Add graceful kill with kill ... 2>/dev/null || true

Server script (`vllm_disagg_server.sh`)

Add comments noting DeepSeek-R1 config shares architecture with V3

Test plan

Verify Docker image builds successfully with docker build -f docker/vllm_disagg_inference.ubuntu.amd.Dockerfile -t test:latest .
Run 1P1D DeepSeek-R1 MoRI EP benchmark to confirm refactored script produces identical behavior
Verify non-MoRI path (RUN_MORI=0) still works with DeepSeek-R1 via vllm_disagg_server.sh
Confirm DOCKER_IMAGE_NAME is now required (no silent default to wrong image)

…scripts Dockerfile: - Fix sed pattern to avoid corrupting https URLs (s/http/https -> s|http://|https://|) - Replace silent `|| true` on vLLM build with logged warning - Remove commented-out dead code (old base image, CMAKE_PREFIX_PATH, git checkout) - Use consistent absolute path for versions.txt, add comment documenting base image dependency - Fix trailing whitespace on multiple lines - Combine version-tracking RUN layers into single layer - Move py-spy and flask installs from runtime into Dockerfile Slurm script (run_xPyD_models.slurm): - Require explicit DOCKER_IMAGE_NAME instead of defaulting to stale image - Remove unreachable empty-string check - Remove duplicate NUM_NODES echo MoRI EP script (vllm_disagg_mori_ep.sh): - Refactor 4 near-identical vllm serve blocks into shared functions (setup_mori_env, build_kv_transfer_config, launch_vllm_worker, wait_for_proxy_and_cleanup, print_node_info), eliminating ~150 lines - Extract hardcoded ports into named variables at top of script - Fix typo "untill" -> "until" - Fix --gpu_memory_utilization to --gpu-memory-utilization for consistency - Remove commented-out compilation-config lines - Add graceful kill with `kill ... 2>/dev/null || true` Server script (vllm_disagg_server.sh): - Add comments noting DeepSeek-R1 shares architecture with V3 Made-with: Cursor

etcd is not needed for the MoRI EP disaggregation path. Remove: - etcd, etcd-server, etcd-client apt packages - etcd v3.6.0-rc.5 binary download and PATH entry - etcd-cpp-apiv3 source build Made-with: Cursor

This reverts commit 68b3935.

Remove etcd from both Dockerfile and vllm_disagg_server.sh: - Dockerfile: remove etcd/etcd-server/etcd-client apt packages, etcd binary download, and etcd-cpp-apiv3 source build - vllm_disagg_server.sh: remove ETCD Server Setup section (start_etcd, etcdctl health/status checks, barrier waits) and etcd_pid kill at exit Fix spelling typos: - "untill" -> "until" in vllm_disagg_server.sh and sglang_disagg_server.sh - "Dissagerated" -> "Disaggregated" in vllm_dissag/README.MD - "Disaggerated" -> "Disaggregated" in sglang_disagg/README.MD - "disggregation" -> "disaggregation" in README.md Note: the directory name scripts/vllm_dissag/ is also misspelled (should be vllm_disagg) but renaming it is a larger refactor best done in a separate PR. Made-with: Cursor

Made-with: Cursor

Port DeepEP server script and dispatch logic from MAD-private PR #208. Three run modes now available via RUN_MORI/RUN_DEEPEP flags with mutual exclusion. Adds host RDMA library bind-mounts, docker pull, and DeepEP env var passthrough to the slurm launcher. Updates README with all three modes, node topology, and DeepEP configuration reference. Made-with: Cursor

Made-with: Cursor

Expose BENCHMARK_ITR, BENCHMARK_CON, and BENCHMARK_COMBINATIONS as environment variables passed through slurm -> Docker -> benchmark script, matching sglang's pattern. Add a small warmup run before the main sweep to stabilize results. Made-with: Cursor

Refactors all three server scripts and the Slurm launcher to run the proxy/router on the same node as the Prefill master (rank 0), reducing the required node count from xP+yD+1 to xP+yD. Key changes: - NUM_NODES = xP + yD (was xP + yD + 1) - SBATCH defaults: -N 2 -n 2 (was -N 3 -n 3) - Rank 0 now runs both prefill vLLM serve and proxy/router - ROUTER_PORT defaults to 8001 (avoids collision with SERVER_PORT=2584) - IP array indexing shifted: prefill at [0..xP-1], decode at [xP..end] - DP start ranks adjusted: prefill = NODE_RANK*8, decode = (NODE_RANK-xP)*8 - Log file names updated: prefill_NODE0.log, decode_NODE{xP}.log Made-with: Cursor

…script Replace hard-fail checks with auto-detection via ibstat (matching the DeepEP script pattern) so the script works without external env vars. Made-with: Cursor

Kill stale listeners on the barrier port before starting the socket_barrier, matching the pattern already used in the DeepEP script. Prevents 'Address already in use' errors after failed runs. Made-with: Cursor

…ript The vLLM version in the Docker image does not support this flag, causing 'unrecognized arguments' errors on startup. Made-with: Cursor

…decode-url The current vllm-router API uses --prefill and --decode (without -url suffix and without /v1 path). Matches the DeepEP script's pattern. Made-with: Cursor

Port 8001 is occupied by a system service on the cluster nodes, causing 'Address already in use' errors when the co-located proxy tries to bind. Use 18001 which is available. Made-with: Cursor

Made-with: Cursor

raviguptaamd requested review from Rohan138, amathews-amd, coketaste, gargrahul and ppalaniappan-amd as code owners April 8, 2026 09:00

raviguptaamd force-pushed the fix/pr209-review-cleanup branch 2 times, most recently from b1eac03 to f85d120 Compare April 8, 2026 16:04

lcskrishna and others added 13 commits April 11, 2026 02:37

initial commit for mori scripts 1P1D

ec0db7b

add 2p2d settings

a052782

add unified dockerfile for deepep & mori

3b17d6a

update working dockerfile

bee5838

add pip install for mori proxy

ecbbe03

update mori with benchmarks and few minor changes

22cd00b

remove etcd from vllm_disagg Dockerfile

d0b3872

etcd is not needed for the MoRI EP disaggregation path. Remove: - etcd, etcd-server, etcd-client apt packages - etcd v3.6.0-rc.5 binary download and PATH entry - etcd-cpp-apiv3 source build Made-with: Cursor

Revert "remove etcd from vllm_disagg Dockerfile"

2cdfcfb

This reverts commit 68b3935.

upgrade vllm-router to latest version during image build

93a2ef5

Made-with: Cursor

increase Docker ulimit nofile to 512K

b1f726f

Made-with: Cursor

raviguptaamd force-pushed the fix/pr209-review-cleanup branch from 77790b3 to b1f726f Compare April 11, 2026 02:38

raviguptaamd added 8 commits April 11, 2026 03:07

auto-detect UCX_NET_DEVICES and NCCL_SOCKET_IFNAME in default server …

860d3ee

…script Replace hard-fail checks with auto-detection via ibstat (matching the DeepEP script pattern) so the script works without external env vars. Made-with: Cursor

add barrier port cleanup to default and MoRI server scripts

5b26d32

Kill stale listeners on the barrier port before starting the socket_barrier, matching the pattern already used in the DeepEP script. Prevents 'Address already in use' errors after failed runs. Made-with: Cursor

remove unsupported --disable-log-requests flag from default server sc…

4659033

…ript The vLLM version in the Docker image does not support this flag, causing 'unrecognized arguments' errors on startup. Made-with: Cursor

fix vllm-router flags: --prefill/--decode instead of --prefill-url/--…

d3d618d

…decode-url The current vllm-router API uses --prefill and --decode (without -url suffix and without /v1 path). Matches the DeepEP script's pattern. Made-with: Cursor

change default proxy port from 8001 to 18001

b8472da

Port 8001 is occupied by a system service on the cluster nodes, causing 'Address already in use' errors when the co-located proxy tries to bind. Use 18001 which is available. Made-with: Cursor

remove accidentally committed files (test report and sglang files)

d59002f

Made-with: Cursor

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DistInf] Cleanup PR#209: fix Dockerfile, refactor MoRI EP script, fix slurm defaults#143

[DistInf] Cleanup PR#209: fix Dockerfile, refactor MoRI EP script, fix slurm defaults#143
raviguptaamd wants to merge 21 commits intoROCm:developfrom
raviguptaamd:fix/pr209-review-cleanup

raviguptaamd commented Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

raviguptaamd commented Apr 8, 2026

Summary

Dockerfile

Slurm script (run_xPyD_models.slurm)

MoRI EP script (vllm_disagg_mori_ep.sh)

Server script (vllm_disagg_server.sh)

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Slurm script (`run_xPyD_models.slurm`)

MoRI EP script (`vllm_disagg_mori_ep.sh`)

Server script (`vllm_disagg_server.sh`)