Skip to content

[DistInf] Cleanup PR#209: fix Dockerfile, refactor MoRI EP script, fix slurm defaults#143

Open
raviguptaamd wants to merge 21 commits intoROCm:developfrom
raviguptaamd:fix/pr209-review-cleanup
Open

[DistInf] Cleanup PR#209: fix Dockerfile, refactor MoRI EP script, fix slurm defaults#143
raviguptaamd wants to merge 21 commits intoROCm:developfrom
raviguptaamd:fix/pr209-review-cleanup

Conversation

@raviguptaamd
Copy link
Copy Markdown
Contributor

Summary

Code review cleanup for MAD-private PR #209 (vLLM MoRI EP with PD disaggregation). Includes all original commits plus the following fixes.

Dockerfile

  • Fix sed pattern to avoid corrupting https URLs (s/http/https -> s|http://|https://|)
  • Replace silent || true on vLLM build with a logged warning so failures are visible
  • Remove commented-out dead code (old base image ARG, CMAKE_PREFIX_PATH, git checkout)
  • Use consistent absolute path for versions.txt; add comment documenting base image dependency
  • Fix trailing whitespace
  • Combine version-tracking RUN layers into a single layer
  • Move py-spy and flask installs from runtime script into Dockerfile

Slurm script (run_xPyD_models.slurm)

  • Require explicit DOCKER_IMAGE_NAME instead of defaulting to stale pre-MoRI image
  • Remove unreachable empty-string check after default assignment
  • Remove duplicate NUM_NODES echo

MoRI EP script (vllm_disagg_mori_ep.sh)

  • Major refactor: Extract 4 near-identical vllm serve blocks (~150 duplicated lines) into shared helper functions: setup_mori_env, build_kv_transfer_config, launch_vllm_worker, wait_for_proxy_and_cleanup, print_node_info
  • Extract 8 hardcoded port numbers into named variables at the top of the script
  • Fix typo "untill" -> "until" (3 occurrences)
  • Fix --gpu_memory_utilization to --gpu-memory-utilization for CLI arg consistency
  • Remove commented-out --compilation-config lines
  • Add graceful kill with kill ... 2>/dev/null || true

Server script (vllm_disagg_server.sh)

  • Add comments noting DeepSeek-R1 config shares architecture with V3

Test plan

  • Verify Docker image builds successfully with docker build -f docker/vllm_disagg_inference.ubuntu.amd.Dockerfile -t test:latest .
  • Run 1P1D DeepSeek-R1 MoRI EP benchmark to confirm refactored script produces identical behavior
  • Verify non-MoRI path (RUN_MORI=0) still works with DeepSeek-R1 via vllm_disagg_server.sh
  • Confirm DOCKER_IMAGE_NAME is now required (no silent default to wrong image)

@raviguptaamd raviguptaamd force-pushed the fix/pr209-review-cleanup branch 2 times, most recently from b1eac03 to f85d120 Compare April 8, 2026 16:04
lcskrishna and others added 13 commits April 11, 2026 02:37
…scripts

Dockerfile:
- Fix sed pattern to avoid corrupting https URLs (s/http/https -> s|http://|https://|)
- Replace silent `|| true` on vLLM build with logged warning
- Remove commented-out dead code (old base image, CMAKE_PREFIX_PATH, git checkout)
- Use consistent absolute path for versions.txt, add comment documenting base image dependency
- Fix trailing whitespace on multiple lines
- Combine version-tracking RUN layers into single layer
- Move py-spy and flask installs from runtime into Dockerfile

Slurm script (run_xPyD_models.slurm):
- Require explicit DOCKER_IMAGE_NAME instead of defaulting to stale image
- Remove unreachable empty-string check
- Remove duplicate NUM_NODES echo

MoRI EP script (vllm_disagg_mori_ep.sh):
- Refactor 4 near-identical vllm serve blocks into shared functions
  (setup_mori_env, build_kv_transfer_config, launch_vllm_worker,
   wait_for_proxy_and_cleanup, print_node_info), eliminating ~150 lines
- Extract hardcoded ports into named variables at top of script
- Fix typo "untill" -> "until"
- Fix --gpu_memory_utilization to --gpu-memory-utilization for consistency
- Remove commented-out compilation-config lines
- Add graceful kill with `kill ... 2>/dev/null || true`

Server script (vllm_disagg_server.sh):
- Add comments noting DeepSeek-R1 shares architecture with V3

Made-with: Cursor
etcd is not needed for the MoRI EP disaggregation path. Remove:
- etcd, etcd-server, etcd-client apt packages
- etcd v3.6.0-rc.5 binary download and PATH entry
- etcd-cpp-apiv3 source build

Made-with: Cursor
Remove etcd from both Dockerfile and vllm_disagg_server.sh:
- Dockerfile: remove etcd/etcd-server/etcd-client apt packages, etcd binary
  download, and etcd-cpp-apiv3 source build
- vllm_disagg_server.sh: remove ETCD Server Setup section (start_etcd,
  etcdctl health/status checks, barrier waits) and etcd_pid kill at exit

Fix spelling typos:
- "untill" -> "until" in vllm_disagg_server.sh and sglang_disagg_server.sh
- "Dissagerated" -> "Disaggregated" in vllm_dissag/README.MD
- "Disaggerated" -> "Disaggregated" in sglang_disagg/README.MD
- "disggregation" -> "disaggregation" in README.md

Note: the directory name scripts/vllm_dissag/ is also misspelled (should be
vllm_disagg) but renaming it is a larger refactor best done in a separate PR.

Made-with: Cursor
Port DeepEP server script and dispatch logic from MAD-private PR #208.
Three run modes now available via RUN_MORI/RUN_DEEPEP flags with mutual
exclusion. Adds host RDMA library bind-mounts, docker pull, and
DeepEP env var passthrough to the slurm launcher. Updates README with
all three modes, node topology, and DeepEP configuration reference.

Made-with: Cursor
@raviguptaamd raviguptaamd force-pushed the fix/pr209-review-cleanup branch from 77790b3 to b1f726f Compare April 11, 2026 02:38
Expose BENCHMARK_ITR, BENCHMARK_CON, and BENCHMARK_COMBINATIONS as
environment variables passed through slurm -> Docker -> benchmark
script, matching sglang's pattern. Add a small warmup run before the
main sweep to stabilize results.

Made-with: Cursor
Refactors all three server scripts and the Slurm launcher to run the
proxy/router on the same node as the Prefill master (rank 0), reducing
the required node count from xP+yD+1 to xP+yD.

Key changes:
- NUM_NODES = xP + yD (was xP + yD + 1)
- SBATCH defaults: -N 2 -n 2 (was -N 3 -n 3)
- Rank 0 now runs both prefill vLLM serve and proxy/router
- ROUTER_PORT defaults to 8001 (avoids collision with SERVER_PORT=2584)
- IP array indexing shifted: prefill at [0..xP-1], decode at [xP..end]
- DP start ranks adjusted: prefill = NODE_RANK*8, decode = (NODE_RANK-xP)*8
- Log file names updated: prefill_NODE0.log, decode_NODE{xP}.log

Made-with: Cursor
…script

Replace hard-fail checks with auto-detection via ibstat (matching the
DeepEP script pattern) so the script works without external env vars.

Made-with: Cursor
Kill stale listeners on the barrier port before starting the
socket_barrier, matching the pattern already used in the DeepEP
script. Prevents 'Address already in use' errors after failed runs.

Made-with: Cursor
…ript

The vLLM version in the Docker image does not support this flag,
causing 'unrecognized arguments' errors on startup.

Made-with: Cursor
…decode-url

The current vllm-router API uses --prefill and --decode (without -url
suffix and without /v1 path). Matches the DeepEP script's pattern.

Made-with: Cursor
Port 8001 is occupied by a system service on the cluster nodes,
causing 'Address already in use' errors when the co-located proxy
tries to bind. Use 18001 which is available.

Made-with: Cursor
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants