Skip to content

[DistInf] Cleanup PR#209: fix Dockerfile, refactor MoRI EP script, fix slurm defaults#143

Open
raviguptaamd wants to merge 12 commits intoROCm:developfrom
raviguptaamd:fix/pr209-review-cleanup
Open

[DistInf] Cleanup PR#209: fix Dockerfile, refactor MoRI EP script, fix slurm defaults#143
raviguptaamd wants to merge 12 commits intoROCm:developfrom
raviguptaamd:fix/pr209-review-cleanup

Conversation

@raviguptaamd
Copy link
Copy Markdown
Contributor

Summary

Code review cleanup for MAD-private PR #209 (vLLM MoRI EP with PD disaggregation). Includes all original commits plus the following fixes.

Dockerfile

  • Fix sed pattern to avoid corrupting https URLs (s/http/https -> s|http://|https://|)
  • Replace silent || true on vLLM build with a logged warning so failures are visible
  • Remove commented-out dead code (old base image ARG, CMAKE_PREFIX_PATH, git checkout)
  • Use consistent absolute path for versions.txt; add comment documenting base image dependency
  • Fix trailing whitespace
  • Combine version-tracking RUN layers into a single layer
  • Move py-spy and flask installs from runtime script into Dockerfile

Slurm script (run_xPyD_models.slurm)

  • Require explicit DOCKER_IMAGE_NAME instead of defaulting to stale pre-MoRI image
  • Remove unreachable empty-string check after default assignment
  • Remove duplicate NUM_NODES echo

MoRI EP script (vllm_disagg_mori_ep.sh)

  • Major refactor: Extract 4 near-identical vllm serve blocks (~150 duplicated lines) into shared helper functions: setup_mori_env, build_kv_transfer_config, launch_vllm_worker, wait_for_proxy_and_cleanup, print_node_info
  • Extract 8 hardcoded port numbers into named variables at the top of the script
  • Fix typo "untill" -> "until" (3 occurrences)
  • Fix --gpu_memory_utilization to --gpu-memory-utilization for CLI arg consistency
  • Remove commented-out --compilation-config lines
  • Add graceful kill with kill ... 2>/dev/null || true

Server script (vllm_disagg_server.sh)

  • Add comments noting DeepSeek-R1 config shares architecture with V3

Test plan

  • Verify Docker image builds successfully with docker build -f docker/vllm_disagg_inference.ubuntu.amd.Dockerfile -t test:latest .
  • Run 1P1D DeepSeek-R1 MoRI EP benchmark to confirm refactored script produces identical behavior
  • Verify non-MoRI path (RUN_MORI=0) still works with DeepSeek-R1 via vllm_disagg_server.sh
  • Confirm DOCKER_IMAGE_NAME is now required (no silent default to wrong image)

lcskrishna and others added 11 commits April 8, 2026 16:04
…scripts

Dockerfile:
- Fix sed pattern to avoid corrupting https URLs (s/http/https -> s|http://|https://|)
- Replace silent `|| true` on vLLM build with logged warning
- Remove commented-out dead code (old base image, CMAKE_PREFIX_PATH, git checkout)
- Use consistent absolute path for versions.txt, add comment documenting base image dependency
- Fix trailing whitespace on multiple lines
- Combine version-tracking RUN layers into single layer
- Move py-spy and flask installs from runtime into Dockerfile

Slurm script (run_xPyD_models.slurm):
- Require explicit DOCKER_IMAGE_NAME instead of defaulting to stale image
- Remove unreachable empty-string check
- Remove duplicate NUM_NODES echo

MoRI EP script (vllm_disagg_mori_ep.sh):
- Refactor 4 near-identical vllm serve blocks into shared functions
  (setup_mori_env, build_kv_transfer_config, launch_vllm_worker,
   wait_for_proxy_and_cleanup, print_node_info), eliminating ~150 lines
- Extract hardcoded ports into named variables at top of script
- Fix typo "untill" -> "until"
- Fix --gpu_memory_utilization to --gpu-memory-utilization for consistency
- Remove commented-out compilation-config lines
- Add graceful kill with `kill ... 2>/dev/null || true`

Server script (vllm_disagg_server.sh):
- Add comments noting DeepSeek-R1 shares architecture with V3

Made-with: Cursor
etcd is not needed for the MoRI EP disaggregation path. Remove:
- etcd, etcd-server, etcd-client apt packages
- etcd v3.6.0-rc.5 binary download and PATH entry
- etcd-cpp-apiv3 source build

Made-with: Cursor
Remove etcd from both Dockerfile and vllm_disagg_server.sh:
- Dockerfile: remove etcd/etcd-server/etcd-client apt packages, etcd binary
  download, and etcd-cpp-apiv3 source build
- vllm_disagg_server.sh: remove ETCD Server Setup section (start_etcd,
  etcdctl health/status checks, barrier waits) and etcd_pid kill at exit

Fix spelling typos:
- "untill" -> "until" in vllm_disagg_server.sh and sglang_disagg_server.sh
- "Dissagerated" -> "Disaggregated" in vllm_dissag/README.MD
- "Disaggerated" -> "Disaggregated" in sglang_disagg/README.MD
- "disggregation" -> "disaggregation" in README.md

Note: the directory name scripts/vllm_dissag/ is also misspelled (should be
vllm_disagg) but renaming it is a larger refactor best done in a separate PR.

Made-with: Cursor
@raviguptaamd raviguptaamd force-pushed the fix/pr209-review-cleanup branch from b1eac03 to f85d120 Compare April 8, 2026 16:04
Port DeepEP server script and dispatch logic from MAD-private PR #208.
Three run modes now available via RUN_MORI/RUN_DEEPEP flags with mutual
exclusion. Adds host RDMA library bind-mounts, docker pull, and
DeepEP env var passthrough to the slurm launcher. Updates README with
all three modes, node topology, and DeepEP configuration reference.

Made-with: Cursor
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants