feat: MI300X disaggregated inference with Broadcom IBGDA (#982)#998
feat: MI300X disaggregated inference with Broadcom IBGDA (#982)#998JordanNanos wants to merge 24 commits intomainfrom
Conversation
Port MI325X disagg+BNXT changes to MI300X cluster (3 working nodes: mi300x-amds_0/2/3). MI300X is gfx942 identical to MI325X, using the same Vultr/CPE Broadcom Thor 2 bnxt_re NICs. Changes: - Add mi300x-disagg runner group (amds_0/2/3) to runners.yaml - Add dsr1-fp8-mi300x-sglang-disagg and -disagg-mtp benchmark configs - Add dsr1-fp8-mi300x-sglang-mtp single-node MTP benchmark config - Add chi-mi300x-* NIC detection (bnxt_re0-7) and QoS (TC=104, SL=3) - Add dsr1_fp8_mi300x_sglang-disagg.sh multi-node benchmark script - Add MTP support to single-node dsr1_fp8_mi300x.sh - Port launch_mi300x-amds.sh to full multi-node launcher with MODEL_YAML_KEY - Add docker/mi300x.Dockerfile + build script (registry API retag) - Add scripts/manual-test-mi300x.sh - Update perf-changelog.yaml Image: ghcr.io/jordannanos/sgl-mi300x-mori:v0.5.9-bnxt
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you |
runners/launch_mi300x-amds.sh
Outdated
| JOB_ID=$(bash "benchmarks/${BENCHMARK_SUBDIR}/${SCRIPT_NAME}") | ||
|
|
||
| LOG_FILE="$BENCHMARK_LOGS_DIR/slurm_job-${JOB_ID}.out" | ||
|
|
||
| sleep 10 | ||
|
|
||
| while ! ls "$LOG_FILE" &>/dev/null; do | ||
| if ! squeue -u "$USER" --noheader --format='%i' | grep -q "$JOB_ID"; then | ||
| echo "ERROR: Job $JOB_ID failed before creating log file" | ||
| scontrol show job "$JOB_ID" | ||
| exit 1 | ||
| fi | ||
| sleep 5 | ||
| done | ||
|
|
||
| set +x | ||
|
|
||
| ( | ||
| while squeue -u $USER --noheader --format='%i' | grep -q "$JOB_ID"; do |
There was a problem hiding this comment.
🔴 The multi-node path in runners/launch_mi300x-amds.sh captures JOB_ID from the benchmark script (line 90) without checking that the value is non-empty; if the inner script fails silently, JOB_ID is empty, LOG_FILE becomes slurm_job-.out (never created), and the wait loop spins forever. Compounding this, both grep -q "$JOB_ID" calls (inside the wait-for-log loop and the polling subshell) use unanchored substring matching, so an empty JOB_ID matches every squeue line, and a valid short ID like 1234 can spuriously match a sibling job 12345, causing the polling loop to keep running after the real job finishes. Fix: add if [[ -z "$JOB_ID" ]]; then echo "ERROR: benchmark script produced no job ID"; exit 1; fi immediately after line 90, and change both grep calls to grep -qx "$JOB_ID" (exact line match).
Extended reasoning...
Bug 1 – Missing empty JOB_ID guard (primary)
At the point where JOB_ID=$(bash "benchmarks/${BENCHMARK_SUBDIR}/${SCRIPT_NAME}") executes in the multi-node path, the exit code of the inner script is silently ignored (no set -e is in effect at that point, and there is no explicit || guard). If dsr1_fp8_mi300x_sglang-disagg.sh calls submit.sh, which can exit 1 on failure, the outer script's command substitution captures an empty string into JOB_ID. At that point execution continues normally. The single-node else branch (added in the same PR) does include if [ -z "$JOB_ID" ]; then exit 1; fi after its own salloc capture, but the new multi-node if branch omits the equivalent check entirely.
How the infinite loop manifests
After the empty-JOB_ID assignment: LOG_FILE="$BENCHMARK_LOGS_DIR/slurm_job-.out" — a file that never exists. The outer while \! ls "$LOG_FILE" loop therefore runs indefinitely. Inside the loop, squeue -u "$USER" --noheader --format='%i' | grep -q "$JOB_ID" becomes grep -q "", and an empty pattern matches every non-empty line of input, so as long as the user has any SLURM job running, the \! grep -q "" condition is always false and the early-exit branch never fires. The loop spins at 5-second intervals until all other user jobs happen to drain from the queue—which could take hours or days in a shared CI cluster.
Bug 2 – Unanchored grep causes false positive job-ID matches
Both grep calls use grep -q "$JOB_ID" against squeue --format='%i' output, which emits one integer job ID per line. grep matches substrings, not full lines. SLURM job IDs are sequentially assigned integers; if the current job is ID 123456 and another user job 1234567 is also in the queue, grep -q "123456" matches the line 1234567. Consequence at line 97 (wait-for-log loop): the early-exit guard never fires even after job 123456 fails and leaves the queue, so the loop waits forever for a log file that was never written. Consequence at line 108 (polling subshell): the while squeue … | grep -q "$JOB_ID" loop keeps looping after the real job exits because the sibling job keeps matching; POLL_PID never exits, so tail -F --pid=$POLL_PID keeps tailing, and wait $POLL_PID blocks indefinitely.
Step-by-step proof of the empty-JOB_ID infinite loop
submit.shinsidedsr1_fp8_mi300x_sglang-disagg.shexits non-zero (e.g., cluster unreachable).- The benchmark script echoes nothing to stdout and exits 1.
JOB_ID=$(bash "benchmarks/multi_node/…sh")— JOB_ID = "".- No guard checks JOB_ID; execution continues.
LOG_FILE="$BENCHMARK_LOGS_DIR/slurm_job-.out"— nonexistent path.- Outer loop:
while \! ls "slurm_job-.out"— always true. - Inner check:
squeue -u "$USER" --noheader --format='%i' | grep -q ""— empty pattern matches every non-empty line; as long as the cluster has any user job,\! grep -q ""is false, soexit 1is never reached. - Loop spins at 5 s intervals indefinitely. CI run hangs.
Fix
After the JOB_ID assignment: if [[ -z "$JOB_ID" ]]; then echo "ERROR: benchmark script produced no job ID"; exit 1; fi. Replace both grep calls with grep -qx "$JOB_ID" (or grep -q "^${JOB_ID}$") to enforce exact-line matching and prevent substring collisions.
perf-changelog.yaml
Outdated
| - "Image (single-node): lmsysorg/sglang:v0.5.9-rocm700-mi30x" | ||
| - "Full pareto sweep: non-MTP and MTP configs across 2 curve points (no EP/DP), ISL 1k/1k and 8k/1k" | ||
| - "Runners: mi300x-amds_0, mi300x-amds_2, mi300x-amds_3 (3 working nodes)" | ||
| pr-link: https://github.com/SemiAnalysisAI/InferenceX/issues/982 |
There was a problem hiding this comment.
🟡 The pr-link field in the new perf-changelog.yaml entry points to issues/982 (the tracking issue) instead of the implementing PR URL. All other non-placeholder entries in the file use the /pull/<number> format; the correct link for this PR should be /pull/998.
Extended reasoning...
What the bug is: The newly added perf-changelog.yaml entry (lines 1-15) sets pr-link: https://github.com/SemiAnalysisAI/InferenceX/issues/982. The field is named pr-link and is intended to reference the pull request that implements the change, not the issue that motivated it.
The specific code path: Line 13 of perf-changelog.yaml reads:
pr-link: https://github.com/SemiAnalysisAI/InferenceX/issues/982Every other properly-numbered entry in the file uses the /pull/<number> format (e.g., /pull/919, /pull/857, /pull/95). When a PR number is not yet known, the convention in this file is to use /pull/XXX as a placeholder (as in the entry for kimik2.5-int4-mi300x-vllm directly below).
Why existing code does not prevent it: There is no automated validation on pr-link values in the changelog; it is a free-text YAML field, so incorrect URLs are not caught at commit time.
Impact: Anyone reviewing the changelog will follow the pr-link to GitHub issue #982 (the tracking issue) rather than to this implementing PR (#998). This breaks the intended traceability — readers cannot navigate from the changelog entry to the code change itself.
How to fix it: Change line 13 from:
pr-link: https://github.com/SemiAnalysisAI/InferenceX/issues/982to:
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/998Step-by-step proof:
- The PR description states "Closes starter task: MVP port mi355 deepseek disagg recipe to mi300 #982", meaning starter task: MVP port mi355 deepseek disagg recipe to mi300 #982 is the issue being closed, not the PR number.
- This PR is feat: MI300X disaggregated inference with Broadcom IBGDA (#982) #998 (confirmed by
<pr number="998">in the metadata). - Scanning perf-changelog.yaml, every entry with a known PR number uses the
/pull/<number>format. - Entries with unknown PR numbers use
/pull/XXXas a placeholder. - The new entry is the only one using an
/issues/URL, making it factually incorrect metadata.
On the duplicate claim: One verifier noted this may overlap with bug_005. The underlying bug is real and independently confirmed by three verifiers; the issue stands on its own merits.
| #SBATCH --time=00:15:00 | ||
| #SBATCH --output=/home/j9s/build-mi300x-%j.log | ||
| #SBATCH --error=/home/j9s/build-mi300x-%j.log | ||
| #SBATCH --chdir=/tmp | ||
|
|
||
| set -euo pipefail | ||
| bash /nfsdata/sa/j9s/inferencex/docker/build-sglang-bnxt-mi300x.sh |
There was a problem hiding this comment.
🟡 docker/build-sglang-bnxt-mi300x.sbatch hardcodes personal paths for user j9s (log dirs under /home/j9s/ and script path under /nfsdata/sa/j9s/), making the file unusable by any other team member. Fix by replacing the log paths with a shared or home-relative path (e.g. $HOME or %u in the #SBATCH directive) and the script path with a SLURM_SUBMIT_DIR-relative reference.
Extended reasoning...
Bug: Hardcoded personal user paths in build-sglang-bnxt-mi300x.sbatch
What the bug is and how it manifests
docker/build-sglang-bnxt-mi300x.sbatch is a new file introduced by this PR that contains three hardcoded absolute paths belonging to a single cluster user (j9s):
- Line 9:
#SBATCH --output=/home/j9s/build-mi300x-%j.log - Line 10:
#SBATCH --error=/home/j9s/build-mi300x-%j.log - Line 14:
bash /nfsdata/sa/j9s/inferencex/docker/build-sglang-bnxt-mi300x.sh
Any other team member who runs sbatch docker/build-sglang-bnxt-mi300x.sbatch will get an immediate failure: SLURM will reject the job (or fail silently at output redirection) because /home/j9s/ does not exist for them, and the bash invocation will fail with "No such file or directory" for the NFS path.
The specific code path that triggers it
The job is submitted with sbatch. SLURM processes the #SBATCH --output and --error directives before the script body runs, attempting to open the log files under /home/j9s/. Then line 14 executes bash /nfsdata/sa/j9s/inferencex/docker/build-sglang-bnxt-mi300x.sh, which resolves to a personal NFS home directory on the cluster.
Why existing code doesn't prevent it
The #SBATCH directives are evaluated statically by SLURM at submission time; environment variables like /root are not expanded in #SBATCH comment lines. However, SLURM does support the %u token (username expansion) in output/error paths, which would make them portable. The command on line 14 is a normal shell line and could use ${SLURM_SUBMIT_DIR} or a path relative to --chdir to reference the script in the repo, but --chdir=/tmp was set on line 11, making relative paths from the repo root unavailable without an explicit variable.
What the impact would be
Any team member other than j9s who tries to rebuild the MI300X image by submitting this batch job will get a broken job. The log output will be lost (or SLURM may refuse submission), and the actual build script will not be found, so the retag operation cannot complete. While this is a developer convenience script and the image is already published, the file is committed to a shared repository and gives the false impression that it is reusable.
How to fix it
-
Replace the log paths with a shared or user-relative path, e.g.:
#SBATCH --output=%u-build-mi300x-%j.log #SBATCH --error=%u-build-mi300x-%j.logOr use a writable shared directory such as
/tmp/%u-build-mi300x-%j.log. -
Replace the hardcoded script path with one derived from the submission directory:
bash "${SLURM_SUBMIT_DIR}/build-sglang-bnxt-mi300x.sh"(changing
--chdirto something other than /tmp, or removing it and submitting from the repo root).
Step-by-step proof
- User alice clones the repo and runs:
sbatch docker/build-sglang-bnxt-mi300x.sbatch - SLURM reads
--output=/home/j9s/build-mi300x-$SLURM_JOB_ID.log— /home/j9s does not exist for alice → SLURM may fail to open the log file or silently discard output. - Even if the job starts, line 14 runs:
bash /nfsdata/sa/j9s/inferencex/docker/build-sglang-bnxt-mi300x.sh— this NFS path is in j9s's personal directory and inaccessible to alice →bash: /nfsdata/sa/j9s/inferencex/docker/build-sglang-bnxt-mi300x.sh: No such file or directory, exit code 1. - The build fails immediately without executing any docker registry logic.
- launch_mi300x-amds.sh: guard empty JOB_ID after benchmark script, use grep -qx for exact-line matching to prevent substring false positives - perf-changelog.yaml: fix pr-link from issues/982 to pull/998 - build-sglang-bnxt-mi300x.sbatch: replace hardcoded j9s paths with %u (SLURM username token) and SLURM_SUBMIT_DIR
MI300X uses a node-local HF cache (/home/gharunner/gharunners/hf-hub-cache/) that is not accessible from the controller where the launcher runs. This causes MODEL_NAME to fall back to the bare name (e.g. DeepSeek-R1-0528) instead of the full snapshot path. Add HF cache resolution inside job.slurm which runs on the compute node where the cache actually exists.
.github/configs/amd-master.yaml
Outdated
|
|
||
|
|
||
| dsr1-fp8-mi300x-sglang-disagg: | ||
| image: ghcr.io/jordannanos/sgl-mi300x-mori:v0.5.9-bnxt |
There was a problem hiding this comment.
@JordanNanos check ur email for access to https://hub.docker.com/u/semianalysiswork
nit: can u move this to our official SemiAnalysisAI public docker hub
also nit: why is mi325 disagg image different from mi300 disagg image? mi325 & mi300 is both cdna3 architecture just how how h100 & h200 are the same architecture/stack and a single image should work should mi300 & mi325.
There was a problem hiding this comment.
yeah it's the same image, just changed the tag
MODEL_YAML_KEY only contains the repo name (DeepSeek-R1-0528) without
the org prefix. Globbing models--*--{MODEL_NAME} finds the correct
HF cache dir (e.g. models--deepseek-ai--DeepSeek-R1-0528) regardless
of org. The model exists on MI300X nodes (confirmed by successful
single-node runs) but was not found due to the incomplete dir name.
- Use single shared image (semianalysiswork/sgl-bnxt-cdna3:v0.5.9-bnxt) for both MI300X and MI325X disagg configs — MI300X (gfx942) and MI325X (gfx942) share the same CDNA3 architecture and the same image works for both - Move image from personal GHCR (ghcr.io/jordannanos/) to official docker.io/semianalysiswork/ registry - Rewrite build-sglang-bnxt-mi300x.sh to use crane for cross-registry copy (GHCR → Docker Hub); add crane auto-download to sbatch wrapper - Add push-to-dockerhub.py / .sbatch as reference implementation Addresses review comments on PR #998 from functionstackx. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
In job 4729, nodes 057/058 were temporarily overloaded (likely from
prior docker pulls), causing every srun step targeting those nodes to
be CANCELLED after ~17s with exit code 0:53. This resulted in empty
IPADDRS ("104.207.141.247,,") and a 63-minute wasted job where the
main docker srun was also cancelled.
Changes:
- check_model_path: check path on batch node directly (NFS is shared)
instead of srun-ing to all nodes
- IP gathering: use scontrol to read NodeAddr instead of srun+ip-route;
fail loudly with a clear error if scontrol can't resolve an IP
- NFS refresh srun: add || echo "[WARN]" so a cancellation doesn't abort
the batch script via set -e
Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
Multi-node MI300X CI statusAll review comments from claude[bot] and functionstackx have been addressed (image on semianalysiswork Docker Hub, same CDNA3 image for MI300X and MI325X, sbatch paths fixed, JOB_ID guard, pr-link). The multi-node disagg CI jobs are failing due to a pre-existing cluster infrastructure issue: every Diagnosis:
Latest commit
The cluster-level srun issue needs to be investigated/fixed separately by the cluster admins. |
|
/sweep |
|
@JordanNanos Kicking off a sweep. Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/23975459520 |
|
/sweep test-config --config-files .github/configs/amd-master.yaml --config-keys dsr1-fp8-mi300x-sglang-disagg dsr1-fp8-mi300x-sglang-disagg-mtp |
|
@JordanNanos Kicking off a sweep. Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/23975486262 |
|
/sweep test-config --config-files .github/configs/amd-master.yaml --config-keys dsr1-fp8-mi300x-sglang-disagg dsr1-fp8-mi300x-sglang-disagg-mtp |
|
@JordanNanos Kicking off a sweep. Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/23992085651 |
|
/sweep test-config --config-files .github/configs/amd-master.yaml --config-keys dsr1-fp8-mi300x-sglang-disagg dsr1-fp8-mi300x-sglang-disagg-mtp |
|
@JordanNanos Kicking off a sweep. Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/23993399652 |
While node 057 is being rebooted, run only the low-concurrency (1P+1D) configs that fit on 2 nodes. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
|
/sweep test-config --config-files .github/configs/amd-master.yaml --config-keys dsr1-fp8-mi300x-sglang-disagg-2node dsr1-fp8-mi300x-sglang-disagg-mtp-2node |
|
@JordanNanos Kicking off a sweep. Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/23993923088 |
|
/sweep test-config --config-keys dsr1-fp8-mi300x-sglang-disagg-2node dsr1-fp8-mi300x-sglang-disagg-mtp-2node --config-files .github/configs/amd-master.yaml |
|
@JordanNanos Kicking off a sweep. Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/23994337726 |
Worker nodes only have munge installed for root. Jobs submitted by gharunner run as gharunner on worker nodes, causing srun step launches to fail with auth/munge errors. Using sudo sbatch ensures the job runs as root where munge is available. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
sbatch is now run via sudo, so jobs are owned by root rather than gharunner. squeue -u gharunner can't see root-owned jobs, causing the launcher to immediately declare the job failed. Use squeue -j JOB_ID which works regardless of job owner. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
sudo resets the environment by default (env_reset in sudoers), so all exported vars (MODEL_YAML_KEY, MODEL_PATH, CONTAINER_IMAGE, etc.) were lost and job.slurm received empty model name. -E preserves the caller's environment through sudo. Also includes the squeue -j fix from the previous commit that wasn't picked up by the last run due to runner caching. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
MODEL_NAME can be a HF cache path like models--org--repo/snapshots/<hash> which contains '/' characters that are invalid in Docker container names. Apply the same tr sanitization already used for USER_NAME. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
If a worker node has a slow or missing NFS mount, the ls/stat/cat calls block forever. Add --timeout=30 to srun and timeout 10 to each individual command so the non-fatal NFS refresh step can't hold up the entire job. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
srun does not have a --timeout option; wrap the entire srun call with the timeout(1) command instead. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
DeepSeek-R1-0528 FP8 requires more VRAM than the original R1. With TP4 (4x192GB = 768GB) the prefill server OOMs even at mem_fraction_ static=0.8. Bump to TP8 (8x192GB = 1536GB) to give adequate headroom. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
The chi-mi300x nodes have model weights at /vfs/models_blog/ but the HF hub cache at /home/gharunner/.../hf-hub-cache has snapshot symlinks pointing to missing blob files (metadata only). This caused SGLang to crash loading shard 35+ with FileNotFoundError. After selecting MODEL_PATH, verify that model-00001 is readable. If not (broken symlinks), strip the date suffix from MODEL_YAML_KEY and search /vfs/models_blog/ for a matching directory with actual weight files. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
The MoRI bootstrap server binds to the default NIC (external IP), but UFW's policy DROP blocks inbound traffic from other nodes' external IPs (which may be on different subnets from the allowed 10.162.224.0/24). Add a pre-Docker srun step that collects each node's external IP and adds UFW allow rules on all nodes before SGLang starts, ensuring the bootstrap handshake can complete. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
The barrier server bound to host_ip (external NIC IP), but the barrier check used IPADDRS (internal IPs from scontrol). On clusters where nodes have separate external and internal NICs, the check failed since internal IP connections were refused. Binding to 0.0.0.0 makes the barrier reachable on all interfaces. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
| _UFW_DIR="${BENCHMARK_LOGS_DIR}/.ufw_${SLURM_JOB_ID}" | ||
| mkdir -p "$_UFW_DIR" | ||
| timeout 30 srun --nodelist="$SELECTED_NODELIST_SRUN" bash -c ' | ||
| _IFACE=$(ip route show default 2>/dev/null | awk "NR==1{print \$5}") | ||
| _EXT_IP=$(ip -4 addr show "$_IFACE" 2>/dev/null | awk "/inet /{sub(/\/.*$/, \"\", \$2); print \$2; exit}") | ||
| if [[ -n "$_EXT_IP" ]]; then | ||
| echo "$_EXT_IP" > '"$_UFW_DIR"'/$SLURM_PROCID.ip | ||
| echo "[INFO] $(hostname): external IP $_EXT_IP (rank $SLURM_PROCID)" | ||
| fi | ||
| ' 2>/dev/null || echo "[WARN] External IP collection failed (non-fatal)." | ||
| for _IP_FILE in "$_UFW_DIR"/*.ip; do | ||
| [[ -f "$_IP_FILE" ]] || continue | ||
| _PEER_IP=$(cat "$_IP_FILE") | ||
| [[ -z "$_PEER_IP" ]] && continue |
There was a problem hiding this comment.
isnt this an 1 time thing that persistent between restarts? if it is an 1 time thing, maybe move it to an utils folder?
Disaggregated inference runs with RDMA failures produce zero output tokens, resulting in TPOT=0. The intvty (inter-token latency) inverse computation 1000/tpot crashed with ZeroDivisionError, preventing the CI artifact upload step from being reached. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
utils/process_result.py
Outdated
| tpot_val = float(value) | ||
| data[key.replace('_ms', '').replace( | ||
| 'tpot', 'intvty')] = 1000.0 / float(value) | ||
| 'tpot', 'intvty')] = 1000.0 / tpot_val if tpot_val != 0.0 else 0.0 |
There was a problem hiding this comment.
plz undo this change, if it is 0 (i.e. benchmarking was broken), then the expected behavior of this post processing script is that it should crash...
|
Could you check the status of sweep? seems like it failed due to no slurm infra issue https://github.com/SemiAnalysisAI/InferenceX/actions/runs/23994337726 |
Revert zero-TPOT guard in process_result.py: crash on broken runs is correct behavior so CI doesn't silently upload zero-valued results. Add comment to UFW setup block in job.slurm explaining why it runs per-job even though UFW rules persist across reboots (new/re-imaged nodes need the rules automatically). Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
|
Status update (2026-04-08) Addressing all outstanding review comments:
New sweep is now running (nodes 049+054+058 all healthy, all 8 RDMA ports ACTIVE on each):
Previous sweep failures (@billishyahao): root causes were (1) |
Node 049 returned to service missing the 'ufw allow from 10.162.224.0/24' rule. This caused srun steps in multi-node batch jobs to hang (054/058 could not connect back to 049's srun process for I/O), resulting in all steps being CANCELLED after 17s (MessageTimeout). Fixed: added internal subnet + peer external IP rules to 049's UFW. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
MI300X Disagg Performance Notes — RDMA SQ SaturationThe ISL=1024 disagg results are degraded relative to ISL=8192. Root cause is RDMA Send Queue (SQ) saturation on the Broadcom Thor 2 NICs ( What's happeningIn disaggregated inference, the prefill node computes the KV cache and transfers it to the decode node via RDMA using IBGDA (In-Band GPU Direct Access) — Broadcom's mechanism where the GPU posts RDMA work requests directly to the NIC's Send Queue, bypassing the CPU. The SQ has a fixed max depth of 4351 entries. When multiple concurrent requests are transferring KV cache simultaneously, the rate of new SQ posts exceeds the rate the NIC can drain completions, and the queue fills: Once the SQ is full, every subsequent KV transfer for that batch fails: Why ISL=1024 is worse than ISL=8192At ISL=1024 the prefill compute is fast, so many requests finish prefill and attempt RDMA transfer in rapid succession — hammering the SQ. At ISL=8192 prefill takes much longer, which naturally rate-limits KV transfer posting and keeps the SQ from filling. Why this shows up at high concurrencyAt low concurrency (conc=4), enough time passes between posts that completions drain and transfers eventually succeed. At high concurrency (conc=32+), the SQ stays permanently saturated and all transfers fail, producing 0 completed requests in the benchmark — which is why the ISL=1024 jobs show near-zero throughput above conc=32. Hardware constraintThe SQ depth limit (4351) is a Broadcom Thor 2 NIC firmware/hardware limit — it cannot be increased in software. Both MI300X and MI325X use the same Fix pathTwo options in SGLang/MoRI:
Until one of these is implemented, ISL=1024 disagg throughput at high concurrency on MI300X will remain SQ-limited. |
Previous sweep had several single-node jobs fail due to salloc resource contention (nodes occupied by 3-node multi-node jobs). Re-running to ensure complete pareto coverage across all configs. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
Ports all MI325X disagg+BNXT fixes (from #985) to the MI300X cluster.
MI300X is gfx942 identical to MI325X, same Vultr/CPE cluster with Broadcom Thor 2 bnxt_re NICs.
3 working nodes: mi300x-amds_0, mi300x-amds_2, mi300x-amds_3
Sweep: single-node w/wo MTP + multinode w/wo MTP, DeepSeek-R1 FP8, ISL 1k/1k and 8k/1k
Image:
ghcr.io/jordannanos/sgl-mi300x-mori:v0.5.9-bnxt(retag of MI325X image, same gfx942)Closes #982