[DistInf] Cleanup PR#209: fix Dockerfile, refactor MoRI EP script, fix slurm defaults#143
Open
raviguptaamd wants to merge 12 commits intoROCm:developfrom
Open
[DistInf] Cleanup PR#209: fix Dockerfile, refactor MoRI EP script, fix slurm defaults#143raviguptaamd wants to merge 12 commits intoROCm:developfrom
raviguptaamd wants to merge 12 commits intoROCm:developfrom
Conversation
d62ca9c to
010da4c
Compare
…scripts Dockerfile: - Fix sed pattern to avoid corrupting https URLs (s/http/https -> s|http://|https://|) - Replace silent `|| true` on vLLM build with logged warning - Remove commented-out dead code (old base image, CMAKE_PREFIX_PATH, git checkout) - Use consistent absolute path for versions.txt, add comment documenting base image dependency - Fix trailing whitespace on multiple lines - Combine version-tracking RUN layers into single layer - Move py-spy and flask installs from runtime into Dockerfile Slurm script (run_xPyD_models.slurm): - Require explicit DOCKER_IMAGE_NAME instead of defaulting to stale image - Remove unreachable empty-string check - Remove duplicate NUM_NODES echo MoRI EP script (vllm_disagg_mori_ep.sh): - Refactor 4 near-identical vllm serve blocks into shared functions (setup_mori_env, build_kv_transfer_config, launch_vllm_worker, wait_for_proxy_and_cleanup, print_node_info), eliminating ~150 lines - Extract hardcoded ports into named variables at top of script - Fix typo "untill" -> "until" - Fix --gpu_memory_utilization to --gpu-memory-utilization for consistency - Remove commented-out compilation-config lines - Add graceful kill with `kill ... 2>/dev/null || true` Server script (vllm_disagg_server.sh): - Add comments noting DeepSeek-R1 shares architecture with V3 Made-with: Cursor
etcd is not needed for the MoRI EP disaggregation path. Remove: - etcd, etcd-server, etcd-client apt packages - etcd v3.6.0-rc.5 binary download and PATH entry - etcd-cpp-apiv3 source build Made-with: Cursor
This reverts commit 68b3935.
Remove etcd from both Dockerfile and vllm_disagg_server.sh: - Dockerfile: remove etcd/etcd-server/etcd-client apt packages, etcd binary download, and etcd-cpp-apiv3 source build - vllm_disagg_server.sh: remove ETCD Server Setup section (start_etcd, etcdctl health/status checks, barrier waits) and etcd_pid kill at exit Fix spelling typos: - "untill" -> "until" in vllm_disagg_server.sh and sglang_disagg_server.sh - "Dissagerated" -> "Disaggregated" in vllm_dissag/README.MD - "Disaggerated" -> "Disaggregated" in sglang_disagg/README.MD - "disggregation" -> "disaggregation" in README.md Note: the directory name scripts/vllm_dissag/ is also misspelled (should be vllm_disagg) but renaming it is a larger refactor best done in a separate PR. Made-with: Cursor
Made-with: Cursor
b1eac03 to
f85d120
Compare
Port DeepEP server script and dispatch logic from MAD-private PR #208. Three run modes now available via RUN_MORI/RUN_DEEPEP flags with mutual exclusion. Adds host RDMA library bind-mounts, docker pull, and DeepEP env var passthrough to the slurm launcher. Updates README with all three modes, node topology, and DeepEP configuration reference. Made-with: Cursor
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Code review cleanup for MAD-private PR #209 (vLLM MoRI EP with PD disaggregation). Includes all original commits plus the following fixes.
Dockerfile
sedpattern to avoid corruptinghttpsURLs (s/http/https->s|http://|https://|)|| trueon vLLM build with a logged warning so failures are visibleversions.txt; add comment documenting base image dependencyRUNlayers into a single layerpy-spyandflaskinstalls from runtime script into DockerfileSlurm script (
run_xPyD_models.slurm)DOCKER_IMAGE_NAMEinstead of defaulting to stale pre-MoRI imageNUM_NODESechoMoRI EP script (
vllm_disagg_mori_ep.sh)vllm serveblocks (~150 duplicated lines) into shared helper functions:setup_mori_env,build_kv_transfer_config,launch_vllm_worker,wait_for_proxy_and_cleanup,print_node_info--gpu_memory_utilizationto--gpu-memory-utilizationfor CLI arg consistency--compilation-configlineskill ... 2>/dev/null || trueServer script (
vllm_disagg_server.sh)Test plan
docker build -f docker/vllm_disagg_inference.ubuntu.amd.Dockerfile -t test:latest .vllm_disagg_server.shDOCKER_IMAGE_NAMEis now required (no silent default to wrong image)