Skip to content

feat: MI300X disaggregated inference with Broadcom IBGDA (#982)#998

Open
JordanNanos wants to merge 24 commits intomainfrom
jordan/mi300x-disagg-bnxt
Open

feat: MI300X disaggregated inference with Broadcom IBGDA (#982)#998
JordanNanos wants to merge 24 commits intomainfrom
jordan/mi300x-disagg-bnxt

Conversation

@JordanNanos
Copy link
Copy Markdown
Collaborator

Ports all MI325X disagg+BNXT fixes (from #985) to the MI300X cluster.

MI300X is gfx942 identical to MI325X, same Vultr/CPE cluster with Broadcom Thor 2 bnxt_re NICs.

3 working nodes: mi300x-amds_0, mi300x-amds_2, mi300x-amds_3

Sweep: single-node w/wo MTP + multinode w/wo MTP, DeepSeek-R1 FP8, ISL 1k/1k and 8k/1k

Image: ghcr.io/jordannanos/sgl-mi300x-mori:v0.5.9-bnxt (retag of MI325X image, same gfx942)

Closes #982

Port MI325X disagg+BNXT changes to MI300X cluster (3 working nodes:
mi300x-amds_0/2/3). MI300X is gfx942 identical to MI325X, using the
same Vultr/CPE Broadcom Thor 2 bnxt_re NICs.

Changes:
- Add mi300x-disagg runner group (amds_0/2/3) to runners.yaml
- Add dsr1-fp8-mi300x-sglang-disagg and -disagg-mtp benchmark configs
- Add dsr1-fp8-mi300x-sglang-mtp single-node MTP benchmark config
- Add chi-mi300x-* NIC detection (bnxt_re0-7) and QoS (TC=104, SL=3)
- Add dsr1_fp8_mi300x_sglang-disagg.sh multi-node benchmark script
- Add MTP support to single-node dsr1_fp8_mi300x.sh
- Port launch_mi300x-amds.sh to full multi-node launcher with MODEL_YAML_KEY
- Add docker/mi300x.Dockerfile + build script (registry API retag)
- Add scripts/manual-test-mi300x.sh
- Update perf-changelog.yaml

Image: ghcr.io/jordannanos/sgl-mi300x-mori:v0.5.9-bnxt
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 2, 2026

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

Comment on lines +90 to +108
JOB_ID=$(bash "benchmarks/${BENCHMARK_SUBDIR}/${SCRIPT_NAME}")

LOG_FILE="$BENCHMARK_LOGS_DIR/slurm_job-${JOB_ID}.out"

sleep 10

while ! ls "$LOG_FILE" &>/dev/null; do
if ! squeue -u "$USER" --noheader --format='%i' | grep -q "$JOB_ID"; then
echo "ERROR: Job $JOB_ID failed before creating log file"
scontrol show job "$JOB_ID"
exit 1
fi
sleep 5
done

set +x

(
while squeue -u $USER --noheader --format='%i' | grep -q "$JOB_ID"; do
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The multi-node path in runners/launch_mi300x-amds.sh captures JOB_ID from the benchmark script (line 90) without checking that the value is non-empty; if the inner script fails silently, JOB_ID is empty, LOG_FILE becomes slurm_job-.out (never created), and the wait loop spins forever. Compounding this, both grep -q "$JOB_ID" calls (inside the wait-for-log loop and the polling subshell) use unanchored substring matching, so an empty JOB_ID matches every squeue line, and a valid short ID like 1234 can spuriously match a sibling job 12345, causing the polling loop to keep running after the real job finishes. Fix: add if [[ -z "$JOB_ID" ]]; then echo "ERROR: benchmark script produced no job ID"; exit 1; fi immediately after line 90, and change both grep calls to grep -qx "$JOB_ID" (exact line match).

Extended reasoning...

Bug 1 – Missing empty JOB_ID guard (primary)

At the point where JOB_ID=$(bash "benchmarks/${BENCHMARK_SUBDIR}/${SCRIPT_NAME}") executes in the multi-node path, the exit code of the inner script is silently ignored (no set -e is in effect at that point, and there is no explicit || guard). If dsr1_fp8_mi300x_sglang-disagg.sh calls submit.sh, which can exit 1 on failure, the outer script's command substitution captures an empty string into JOB_ID. At that point execution continues normally. The single-node else branch (added in the same PR) does include if [ -z "$JOB_ID" ]; then exit 1; fi after its own salloc capture, but the new multi-node if branch omits the equivalent check entirely.

How the infinite loop manifests

After the empty-JOB_ID assignment: LOG_FILE="$BENCHMARK_LOGS_DIR/slurm_job-.out" — a file that never exists. The outer while \! ls "$LOG_FILE" loop therefore runs indefinitely. Inside the loop, squeue -u "$USER" --noheader --format='%i' | grep -q "$JOB_ID" becomes grep -q "", and an empty pattern matches every non-empty line of input, so as long as the user has any SLURM job running, the \! grep -q "" condition is always false and the early-exit branch never fires. The loop spins at 5-second intervals until all other user jobs happen to drain from the queue—which could take hours or days in a shared CI cluster.

Bug 2 – Unanchored grep causes false positive job-ID matches

Both grep calls use grep -q "$JOB_ID" against squeue --format='%i' output, which emits one integer job ID per line. grep matches substrings, not full lines. SLURM job IDs are sequentially assigned integers; if the current job is ID 123456 and another user job 1234567 is also in the queue, grep -q "123456" matches the line 1234567. Consequence at line 97 (wait-for-log loop): the early-exit guard never fires even after job 123456 fails and leaves the queue, so the loop waits forever for a log file that was never written. Consequence at line 108 (polling subshell): the while squeue … | grep -q "$JOB_ID" loop keeps looping after the real job exits because the sibling job keeps matching; POLL_PID never exits, so tail -F --pid=$POLL_PID keeps tailing, and wait $POLL_PID blocks indefinitely.

Step-by-step proof of the empty-JOB_ID infinite loop

  1. submit.sh inside dsr1_fp8_mi300x_sglang-disagg.sh exits non-zero (e.g., cluster unreachable).
  2. The benchmark script echoes nothing to stdout and exits 1.
  3. JOB_ID=$(bash "benchmarks/multi_node/…sh") — JOB_ID = "".
  4. No guard checks JOB_ID; execution continues.
  5. LOG_FILE="$BENCHMARK_LOGS_DIR/slurm_job-.out" — nonexistent path.
  6. Outer loop: while \! ls "slurm_job-.out" — always true.
  7. Inner check: squeue -u "$USER" --noheader --format='%i' | grep -q "" — empty pattern matches every non-empty line; as long as the cluster has any user job, \! grep -q "" is false, so exit 1 is never reached.
  8. Loop spins at 5 s intervals indefinitely. CI run hangs.

Fix

After the JOB_ID assignment: if [[ -z "$JOB_ID" ]]; then echo "ERROR: benchmark script produced no job ID"; exit 1; fi. Replace both grep calls with grep -qx "$JOB_ID" (or grep -q "^${JOB_ID}$") to enforce exact-line matching and prevent substring collisions.

- "Image (single-node): lmsysorg/sglang:v0.5.9-rocm700-mi30x"
- "Full pareto sweep: non-MTP and MTP configs across 2 curve points (no EP/DP), ISL 1k/1k and 8k/1k"
- "Runners: mi300x-amds_0, mi300x-amds_2, mi300x-amds_3 (3 working nodes)"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/issues/982
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The pr-link field in the new perf-changelog.yaml entry points to issues/982 (the tracking issue) instead of the implementing PR URL. All other non-placeholder entries in the file use the /pull/<number> format; the correct link for this PR should be /pull/998.

Extended reasoning...

What the bug is: The newly added perf-changelog.yaml entry (lines 1-15) sets pr-link: https://github.com/SemiAnalysisAI/InferenceX/issues/982. The field is named pr-link and is intended to reference the pull request that implements the change, not the issue that motivated it.

The specific code path: Line 13 of perf-changelog.yaml reads:

  pr-link: https://github.com/SemiAnalysisAI/InferenceX/issues/982

Every other properly-numbered entry in the file uses the /pull/<number> format (e.g., /pull/919, /pull/857, /pull/95). When a PR number is not yet known, the convention in this file is to use /pull/XXX as a placeholder (as in the entry for kimik2.5-int4-mi300x-vllm directly below).

Why existing code does not prevent it: There is no automated validation on pr-link values in the changelog; it is a free-text YAML field, so incorrect URLs are not caught at commit time.

Impact: Anyone reviewing the changelog will follow the pr-link to GitHub issue #982 (the tracking issue) rather than to this implementing PR (#998). This breaks the intended traceability — readers cannot navigate from the changelog entry to the code change itself.

How to fix it: Change line 13 from:

  pr-link: https://github.com/SemiAnalysisAI/InferenceX/issues/982

to:

  pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/998

Step-by-step proof:

  1. The PR description states "Closes starter task: MVP port mi355 deepseek disagg recipe to mi300 #982", meaning starter task: MVP port mi355 deepseek disagg recipe to mi300 #982 is the issue being closed, not the PR number.
  2. This PR is feat: MI300X disaggregated inference with Broadcom IBGDA (#982) #998 (confirmed by <pr number="998"> in the metadata).
  3. Scanning perf-changelog.yaml, every entry with a known PR number uses the /pull/<number> format.
  4. Entries with unknown PR numbers use /pull/XXX as a placeholder.
  5. The new entry is the only one using an /issues/ URL, making it factually incorrect metadata.

On the duplicate claim: One verifier noted this may overlap with bug_005. The underlying bug is real and independently confirmed by three verifiers; the issue stands on its own merits.

Comment on lines +8 to +14
#SBATCH --time=00:15:00
#SBATCH --output=/home/j9s/build-mi300x-%j.log
#SBATCH --error=/home/j9s/build-mi300x-%j.log
#SBATCH --chdir=/tmp

set -euo pipefail
bash /nfsdata/sa/j9s/inferencex/docker/build-sglang-bnxt-mi300x.sh
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 docker/build-sglang-bnxt-mi300x.sbatch hardcodes personal paths for user j9s (log dirs under /home/j9s/ and script path under /nfsdata/sa/j9s/), making the file unusable by any other team member. Fix by replacing the log paths with a shared or home-relative path (e.g. $HOME or %u in the #SBATCH directive) and the script path with a SLURM_SUBMIT_DIR-relative reference.

Extended reasoning...

Bug: Hardcoded personal user paths in build-sglang-bnxt-mi300x.sbatch

What the bug is and how it manifests

docker/build-sglang-bnxt-mi300x.sbatch is a new file introduced by this PR that contains three hardcoded absolute paths belonging to a single cluster user (j9s):

  • Line 9: #SBATCH --output=/home/j9s/build-mi300x-%j.log
  • Line 10: #SBATCH --error=/home/j9s/build-mi300x-%j.log
  • Line 14: bash /nfsdata/sa/j9s/inferencex/docker/build-sglang-bnxt-mi300x.sh

Any other team member who runs sbatch docker/build-sglang-bnxt-mi300x.sbatch will get an immediate failure: SLURM will reject the job (or fail silently at output redirection) because /home/j9s/ does not exist for them, and the bash invocation will fail with "No such file or directory" for the NFS path.

The specific code path that triggers it

The job is submitted with sbatch. SLURM processes the #SBATCH --output and --error directives before the script body runs, attempting to open the log files under /home/j9s/. Then line 14 executes bash /nfsdata/sa/j9s/inferencex/docker/build-sglang-bnxt-mi300x.sh, which resolves to a personal NFS home directory on the cluster.

Why existing code doesn't prevent it

The #SBATCH directives are evaluated statically by SLURM at submission time; environment variables like /root are not expanded in #SBATCH comment lines. However, SLURM does support the %u token (username expansion) in output/error paths, which would make them portable. The command on line 14 is a normal shell line and could use ${SLURM_SUBMIT_DIR} or a path relative to --chdir to reference the script in the repo, but --chdir=/tmp was set on line 11, making relative paths from the repo root unavailable without an explicit variable.

What the impact would be

Any team member other than j9s who tries to rebuild the MI300X image by submitting this batch job will get a broken job. The log output will be lost (or SLURM may refuse submission), and the actual build script will not be found, so the retag operation cannot complete. While this is a developer convenience script and the image is already published, the file is committed to a shared repository and gives the false impression that it is reusable.

How to fix it

  1. Replace the log paths with a shared or user-relative path, e.g.:

    #SBATCH --output=%u-build-mi300x-%j.log
    #SBATCH --error=%u-build-mi300x-%j.log
    

    Or use a writable shared directory such as /tmp/%u-build-mi300x-%j.log.

  2. Replace the hardcoded script path with one derived from the submission directory:

    bash "${SLURM_SUBMIT_DIR}/build-sglang-bnxt-mi300x.sh"
    

    (changing --chdir to something other than /tmp, or removing it and submitting from the repo root).

Step-by-step proof

  1. User alice clones the repo and runs: sbatch docker/build-sglang-bnxt-mi300x.sbatch
  2. SLURM reads --output=/home/j9s/build-mi300x-$SLURM_JOB_ID.log — /home/j9s does not exist for alice → SLURM may fail to open the log file or silently discard output.
  3. Even if the job starts, line 14 runs: bash /nfsdata/sa/j9s/inferencex/docker/build-sglang-bnxt-mi300x.sh — this NFS path is in j9s's personal directory and inaccessible to alice → bash: /nfsdata/sa/j9s/inferencex/docker/build-sglang-bnxt-mi300x.sh: No such file or directory, exit code 1.
  4. The build fails immediately without executing any docker registry logic.

- launch_mi300x-amds.sh: guard empty JOB_ID after benchmark script,
  use grep -qx for exact-line matching to prevent substring false positives
- perf-changelog.yaml: fix pr-link from issues/982 to pull/998
- build-sglang-bnxt-mi300x.sbatch: replace hardcoded j9s paths with
  %u (SLURM username token) and SLURM_SUBMIT_DIR
MI300X uses a node-local HF cache (/home/gharunner/gharunners/hf-hub-cache/)
that is not accessible from the controller where the launcher runs.
This causes MODEL_NAME to fall back to the bare name (e.g. DeepSeek-R1-0528)
instead of the full snapshot path. Add HF cache resolution inside job.slurm
which runs on the compute node where the cache actually exists.


dsr1-fp8-mi300x-sglang-disagg:
image: ghcr.io/jordannanos/sgl-mi300x-mori:v0.5.9-bnxt
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JordanNanos check ur email for access to https://hub.docker.com/u/semianalysiswork

nit: can u move this to our official SemiAnalysisAI public docker hub

also nit: why is mi325 disagg image different from mi300 disagg image? mi325 & mi300 is both cdna3 architecture just how how h100 & h200 are the same architecture/stack and a single image should work should mi300 & mi325.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah it's the same image, just changed the tag

JordanNanos and others added 4 commits April 3, 2026 19:35
MODEL_YAML_KEY only contains the repo name (DeepSeek-R1-0528) without
the org prefix. Globbing models--*--{MODEL_NAME} finds the correct
HF cache dir (e.g. models--deepseek-ai--DeepSeek-R1-0528) regardless
of org. The model exists on MI300X nodes (confirmed by successful
single-node runs) but was not found due to the incomplete dir name.
- Use single shared image (semianalysiswork/sgl-bnxt-cdna3:v0.5.9-bnxt) for
  both MI300X and MI325X disagg configs — MI300X (gfx942) and MI325X (gfx942)
  share the same CDNA3 architecture and the same image works for both
- Move image from personal GHCR (ghcr.io/jordannanos/) to official
  docker.io/semianalysiswork/ registry
- Rewrite build-sglang-bnxt-mi300x.sh to use crane for cross-registry copy
  (GHCR → Docker Hub); add crane auto-download to sbatch wrapper
- Add push-to-dockerhub.py / .sbatch as reference implementation

Addresses review comments on PR #998 from functionstackx.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
In job 4729, nodes 057/058 were temporarily overloaded (likely from
prior docker pulls), causing every srun step targeting those nodes to
be CANCELLED after ~17s with exit code 0:53.  This resulted in empty
IPADDRS ("104.207.141.247,,") and a 63-minute wasted job where the
main docker srun was also cancelled.

Changes:
- check_model_path: check path on batch node directly (NFS is shared)
  instead of srun-ing to all nodes
- IP gathering: use scontrol to read NodeAddr instead of srun+ip-route;
  fail loudly with a clear error if scontrol can't resolve an IP
- NFS refresh srun: add || echo "[WARN]" so a cancellation doesn't abort
  the batch script via set -e

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
@JordanNanos
Copy link
Copy Markdown
Collaborator Author

Multi-node MI300X CI status

All review comments from claude[bot] and functionstackx have been addressed (image on semianalysiswork Docker Hub, same CDNA3 image for MI300X and MI325X, sbatch paths fixed, JOB_ID guard, pr-link).

The multi-node disagg CI jobs are failing due to a pre-existing cluster infrastructure issue: every srun step targeting nodes 057/058 within a batch job is CANCELLED after ~17 seconds with exit code 0:53 (SLURM internal signal). This causes the docker containers on the decode nodes to never start.

Diagnosis:

  • sacct confirms this pattern across all multi-node MI300X jobs on this cluster (jobs 4689, 4692, 4695, 4698, 4729, 4752, etc.) — none have ever completed
  • Single-node sruns on those nodes work fine (confirmed with job 4751)
  • Multi-node srun from within a batch job fails on nodes 057/058 (likely a SLURM step prolog issue on those nodes)
  • This is unrelated to any code changes in this PR

Latest commit 27fab49 also improves job robustness:

  • IP gathering now uses scontrol instead of srun (avoids 30+ min of failed IP-gathering srun steps)
  • Model path check now runs locally on the batch node (NFS is shared)
  • Total job runtime reduced from ~63 min → ~31 min before failing

The cluster-level srun issue needs to be investigated/fixed separately by the cluster admins.

@JordanNanos
Copy link
Copy Markdown
Collaborator Author

/sweep

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 4, 2026

@JordanNanos Kicking off a sweep.

Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/23975459520
Command: ``
Pinned ref: 27fab49
Approval: not required (trusted collaborator).

@JordanNanos
Copy link
Copy Markdown
Collaborator Author

/sweep test-config --config-files .github/configs/amd-master.yaml --config-keys dsr1-fp8-mi300x-sglang-disagg dsr1-fp8-mi300x-sglang-disagg-mtp

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 4, 2026

@JordanNanos Kicking off a sweep.

Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/23975486262
Command: test-config --config-files .github/configs/amd-master.yaml --config-keys dsr1-fp8-mi300x-sglang-disagg dsr1-fp8-mi300x-sglang-disagg-mtp
Pinned ref: 27fab49
Approval: not required (trusted collaborator).

@JordanNanos
Copy link
Copy Markdown
Collaborator Author

/sweep test-config --config-files .github/configs/amd-master.yaml --config-keys dsr1-fp8-mi300x-sglang-disagg dsr1-fp8-mi300x-sglang-disagg-mtp

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 5, 2026

@JordanNanos Kicking off a sweep.

Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/23992085651
Command: test-config --config-files .github/configs/amd-master.yaml --config-keys dsr1-fp8-mi300x-sglang-disagg dsr1-fp8-mi300x-sglang-disagg-mtp
Pinned ref: 27fab49
Approval: not required (trusted collaborator).

@JordanNanos
Copy link
Copy Markdown
Collaborator Author

/sweep test-config --config-files .github/configs/amd-master.yaml --config-keys dsr1-fp8-mi300x-sglang-disagg dsr1-fp8-mi300x-sglang-disagg-mtp

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 5, 2026

@JordanNanos Kicking off a sweep.

Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/23993399652
Command: test-config --config-files .github/configs/amd-master.yaml --config-keys dsr1-fp8-mi300x-sglang-disagg dsr1-fp8-mi300x-sglang-disagg-mtp
Pinned ref: 27fab49
Approval: not required (trusted collaborator).

While node 057 is being rebooted, run only the low-concurrency (1P+1D)
configs that fit on 2 nodes.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
@JordanNanos
Copy link
Copy Markdown
Collaborator Author

/sweep test-config --config-files .github/configs/amd-master.yaml --config-keys dsr1-fp8-mi300x-sglang-disagg-2node dsr1-fp8-mi300x-sglang-disagg-mtp-2node

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 5, 2026

@JordanNanos Kicking off a sweep.

Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/23993923088
Command: test-config --config-files .github/configs/amd-master.yaml --config-keys dsr1-fp8-mi300x-sglang-disagg-2node dsr1-fp8-mi300x-sglang-disagg-mtp-2node
Pinned ref: a4e56c3
Approval: not required (trusted collaborator).

@JordanNanos
Copy link
Copy Markdown
Collaborator Author

/sweep test-config --config-keys dsr1-fp8-mi300x-sglang-disagg-2node dsr1-fp8-mi300x-sglang-disagg-mtp-2node --config-files .github/configs/amd-master.yaml

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 5, 2026

@JordanNanos Kicking off a sweep.

Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/23994337726
Command: test-config --config-keys dsr1-fp8-mi300x-sglang-disagg-2node dsr1-fp8-mi300x-sglang-disagg-mtp-2node --config-files .github/configs/amd-master.yaml
Pinned ref: a4e56c3
Approval: not required (trusted collaborator).

Worker nodes only have munge installed for root. Jobs submitted by
gharunner run as gharunner on worker nodes, causing srun step launches
to fail with auth/munge errors. Using sudo sbatch ensures the job runs
as root where munge is available.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
JordanNanos and others added 9 commits April 5, 2026 15:33
sbatch is now run via sudo, so jobs are owned by root rather than
gharunner. squeue -u gharunner can't see root-owned jobs, causing the
launcher to immediately declare the job failed. Use squeue -j JOB_ID
which works regardless of job owner.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
sudo resets the environment by default (env_reset in sudoers), so all
exported vars (MODEL_YAML_KEY, MODEL_PATH, CONTAINER_IMAGE, etc.) were
lost and job.slurm received empty model name. -E preserves the caller's
environment through sudo.

Also includes the squeue -j fix from the previous commit that wasn't
picked up by the last run due to runner caching.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
MODEL_NAME can be a HF cache path like
models--org--repo/snapshots/<hash> which contains '/' characters that
are invalid in Docker container names. Apply the same tr sanitization
already used for USER_NAME.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
If a worker node has a slow or missing NFS mount, the ls/stat/cat
calls block forever. Add --timeout=30 to srun and timeout 10 to each
individual command so the non-fatal NFS refresh step can't hold up
the entire job.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
srun does not have a --timeout option; wrap the entire srun call with
the timeout(1) command instead.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
DeepSeek-R1-0528 FP8 requires more VRAM than the original R1. With
TP4 (4x192GB = 768GB) the prefill server OOMs even at mem_fraction_
static=0.8. Bump to TP8 (8x192GB = 1536GB) to give adequate headroom.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
The chi-mi300x nodes have model weights at /vfs/models_blog/ but the
HF hub cache at /home/gharunner/.../hf-hub-cache has snapshot symlinks
pointing to missing blob files (metadata only). This caused SGLang to
crash loading shard 35+ with FileNotFoundError.

After selecting MODEL_PATH, verify that model-00001 is readable. If not
(broken symlinks), strip the date suffix from MODEL_YAML_KEY and search
/vfs/models_blog/ for a matching directory with actual weight files.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
The MoRI bootstrap server binds to the default NIC (external IP), but
UFW's policy DROP blocks inbound traffic from other nodes' external IPs
(which may be on different subnets from the allowed 10.162.224.0/24).

Add a pre-Docker srun step that collects each node's external IP and
adds UFW allow rules on all nodes before SGLang starts, ensuring the
bootstrap handshake can complete.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
The barrier server bound to host_ip (external NIC IP), but the barrier
check used IPADDRS (internal IPs from scontrol). On clusters where nodes
have separate external and internal NICs, the check failed since internal
IP connections were refused.

Binding to 0.0.0.0 makes the barrier reachable on all interfaces.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
Comment on lines +371 to +384
_UFW_DIR="${BENCHMARK_LOGS_DIR}/.ufw_${SLURM_JOB_ID}"
mkdir -p "$_UFW_DIR"
timeout 30 srun --nodelist="$SELECTED_NODELIST_SRUN" bash -c '
_IFACE=$(ip route show default 2>/dev/null | awk "NR==1{print \$5}")
_EXT_IP=$(ip -4 addr show "$_IFACE" 2>/dev/null | awk "/inet /{sub(/\/.*$/, \"\", \$2); print \$2; exit}")
if [[ -n "$_EXT_IP" ]]; then
echo "$_EXT_IP" > '"$_UFW_DIR"'/$SLURM_PROCID.ip
echo "[INFO] $(hostname): external IP $_EXT_IP (rank $SLURM_PROCID)"
fi
' 2>/dev/null || echo "[WARN] External IP collection failed (non-fatal)."
for _IP_FILE in "$_UFW_DIR"/*.ip; do
[[ -f "$_IP_FILE" ]] || continue
_PEER_IP=$(cat "$_IP_FILE")
[[ -z "$_PEER_IP" ]] && continue
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isnt this an 1 time thing that persistent between restarts? if it is an 1 time thing, maybe move it to an utils folder?

Disaggregated inference runs with RDMA failures produce zero output
tokens, resulting in TPOT=0. The intvty (inter-token latency) inverse
computation 1000/tpot crashed with ZeroDivisionError, preventing the
CI artifact upload step from being reached.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
tpot_val = float(value)
data[key.replace('_ms', '').replace(
'tpot', 'intvty')] = 1000.0 / float(value)
'tpot', 'intvty')] = 1000.0 / tpot_val if tpot_val != 0.0 else 0.0
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

plz undo this change, if it is 0 (i.e. benchmarking was broken), then the expected behavior of this post processing script is that it should crash...

@billishyahao
Copy link
Copy Markdown
Collaborator

Could you check the status of sweep? seems like it failed due to no slurm infra issue https://github.com/SemiAnalysisAI/InferenceX/actions/runs/23994337726

JordanNanos and others added 2 commits April 8, 2026 05:20
Revert zero-TPOT guard in process_result.py: crash on broken runs is
correct behavior so CI doesn't silently upload zero-valued results.

Add comment to UFW setup block in job.slurm explaining why it runs
per-job even though UFW rules persist across reboots (new/re-imaged
nodes need the rules automatically).

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
@JordanNanos
Copy link
Copy Markdown
Collaborator Author

Status update (2026-04-08)

Addressing all outstanding review comments:

  • process_result.py revert (1b936c5): Reverted the zero-TPOT guard per @functionstackx — crash-on-broken-run is the correct behavior so bad results never reach the DB.
  • job.slurm UFW comment: Added a comment explaining why the firewall setup runs per-job (new/re-imaged nodes need the rules; UFW persists across reboots but not across machine replacements).
  • Merged latest main (0ab1fcd): Branch was missing 4 changelog entries from PRs [NVIDIA] Qwen3.5 h200 sglang MTP #1001/1002/1003/1008, which caused the changelog validator to see "deletions" and fail the setup CI step.

New sweep is now running (nodes 049+054+058 all healthy, all 8 RDMA ports ACTIVE on each):

  • Job 4889 running (3-node: 1P+2D config)
  • Jobs 4890/4891 queued

Previous sweep failures (@billishyahao): root causes were (1) bnxt_re3 link flapping on node 054 causing RDMA QP failures on ISL=1024 runs, and (2) process_result.py crashing on zero TPOT from the broken runs, which prevented artifact upload. The link has since recovered and both issues are now addressed.

Node 049 returned to service missing the 'ufw allow from 10.162.224.0/24'
rule. This caused srun steps in multi-node batch jobs to hang (054/058
could not connect back to 049's srun process for I/O), resulting in
all steps being CANCELLED after 17s (MessageTimeout).

Fixed: added internal subnet + peer external IP rules to 049's UFW.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
@JordanNanos
Copy link
Copy Markdown
Collaborator Author

MI300X Disagg Performance Notes — RDMA SQ Saturation

The ISL=1024 disagg results are degraded relative to ISL=8192. Root cause is RDMA Send Queue (SQ) saturation on the Broadcom Thor 2 NICs (bnxt_re) used for GPU-direct KV cache transfers between prefill and decode nodes.

What's happening

In disaggregated inference, the prefill node computes the KV cache and transfers it to the decode node via RDMA using IBGDA (In-Band GPU Direct Access) — Broadcom's mechanism where the GPU posts RDMA work requests directly to the NIC's Send Queue, bypassing the CPU.

The SQ has a fixed max depth of 4351 entries. When multiple concurrent requests are transferring KV cache simultaneously, the rate of new SQ posts exceeds the rate the NIC can drain completions, and the queue fills:

SQ full (batch): ep=0 depth=4351 requested=1 max=4351

Once the SQ is full, every subsequent KV transfer for that batch fails:

Prefill transfer failed ... KV transfer failed
Decode transfer failed ... KV transfer failed

Why ISL=1024 is worse than ISL=8192

At ISL=1024 the prefill compute is fast, so many requests finish prefill and attempt RDMA transfer in rapid succession — hammering the SQ. At ISL=8192 prefill takes much longer, which naturally rate-limits KV transfer posting and keeps the SQ from filling.

Why this shows up at high concurrency

At low concurrency (conc=4), enough time passes between posts that completions drain and transfers eventually succeed. At high concurrency (conc=32+), the SQ stays permanently saturated and all transfers fail, producing 0 completed requests in the benchmark — which is why the ISL=1024 jobs show near-zero throughput above conc=32.

Hardware constraint

The SQ depth limit (4351) is a Broadcom Thor 2 NIC firmware/hardware limit — it cannot be increased in software. Both MI300X and MI325X use the same bnxt_re NICs with IBGDA, but MI300X's faster gfx942 prefill at short ISLs makes it hit this ceiling more aggressively.

Fix path

Two options in SGLang/MoRI:

  1. Backpressure: throttle KV transfer posting when the SQ occupancy is near max (monitor completion rate and insert delay when depth approaches 4351)
  2. Transfer batching: coalesce multiple small KV transfers into fewer, larger RDMA ops to reduce the number of SQ entries per request

Until one of these is implemented, ISL=1024 disagg throughput at high concurrency on MI300X will remain SQ-limited.

JordanNanos and others added 2 commits April 8, 2026 18:43
Previous sweep had several single-node jobs fail due to salloc resource
contention (nodes occupied by 3-node multi-node jobs). Re-running to
ensure complete pareto coverage across all configs.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

starter task: MVP port mi355 deepseek disagg recipe to mi300

3 participants