Skip to content
Draft
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
19 commits
Select commit Hold shift + click to select a range
834fe82
Add MI325X DeepSeek-R1 FP8 disaggregated inference (1P1D, Broadcom Th…
Mar 31, 2026
7b50476
Update amd-master.yaml
JordanNanos Mar 31, 2026
b40908c
Add MTP config, expand sweep to full pareto frontier, use -good image
Mar 31, 2026
2421ca5
Add perf-changelog entry for MI325X disagg configs
Mar 31, 2026
6abdf85
Fix MI325X QoS detection and NFS-safe cleanup for disagg benchmarks
JordanNanos Apr 1, 2026
3716258
Add local NVMe model caching for faster model loading
JordanNanos Apr 1, 2026
db677bd
Switch model caching from rsync to rclone sync
JordanNanos Apr 1, 2026
0a485de
Add MTP baseline to single-node MI325X DeepSeek-R1 FP8 config
JordanNanos Apr 1, 2026
67dec7c
Split MI325X single-node MTP into separate config key
JordanNanos Apr 1, 2026
f18257f
Fix MI325X single-node script resolution and add MTP support
JordanNanos Apr 2, 2026
3ccfba3
Fix decode dispatch token limit for DP attention disagg configs
JordanNanos Apr 2, 2026
0213032
Disable EP8/DP disagg configs on MI325X and bump MTP to 3 tokens
JordanNanos Apr 2, 2026
2afb24a
Add single-node EP8/DP test configs for MI325X disagg
JordanNanos Apr 2, 2026
36aebfd
Move container image to semianalysiswork Docker Hub and fix launcher …
JordanNanos Apr 3, 2026
b5a0bc2
Test EP8/DP workaround: drop MoRI a2a backend on MI325X bnxt_re
JordanNanos Apr 4, 2026
beb3808
Fix MODEL_NAME for EP8/DP test configs with MODEL_YAML_KEY override
JordanNanos Apr 4, 2026
23c2931
fix: resolve MODEL_NAME from flat repo dir when HF snapshot absent
JordanNanos Apr 4, 2026
e5b9d00
Tune EP8/DP test: lower concurrency + QP params for SQ full fix
JordanNanos Apr 4, 2026
76d89d0
fix: lower bnxt_re QP limits and concurrency for MI325X EP8/DP disagg
JordanNanos Apr 5, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
149 changes: 149 additions & 0 deletions .github/configs/amd-master.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -1231,3 +1231,152 @@ dsr1-fp4-mi355x-sglang-disagg-mtp:
- "DECODE_NODES=1"
- "DECODE_MTP_SIZE=1"


dsr1-fp8-mi325x-sglang-disagg:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ur missing perfchange log . yaml too

image: ghcr.io/jordannanos/sgl-mi325x-mori:v0.5.9-bnxt
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this built with? can u add the permalink to dockerfile and ur docker build commands?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its described in the PR description

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can u add the exact git clone xyz what sglang hash

and the wget broadcom drivers

docker build

exct coomands for reproducible?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the build command is empty rn in ur PR description. can u fix? generally prefer that the build scripts be checked into the repo instead of PR descr

image

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its not a wget, had to manually download from the broadcom site and then copy the tarball over to the cluster, based on exact firmware version installed on the cluster. happens for all thor2 NICs

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, maybe in utils/? and add an readme too about how to manually download and tarball from broadcom?

that way we can get amd engineer to read ur dockerfile & build command and fix upstream builds

model: deepseek-ai/DeepSeek-R1-0528
model-prefix: dsr1
runner: mi325x-disagg
precision: fp8
framework: sglang-disagg
multinode: true
disagg: true
seq-len-configs:
- isl: 1024
osl: 1024
search-space:
# "Top of curve" (1 prefill worker at TP8, 1 decode worker at DEP8)
- spec-decoding: "none"
conc-list: [ 512, 1024 ]
prefill:
num-worker: 1
tp: 8
ep: 1
dp-attn: false
additional-settings:
- "PREFILL_NODES=1"
decode:
num-worker: 1
tp: 8
ep: 8
dp-attn: true
additional-settings:
- "DECODE_NODES=2"
- "DECODE_MTP_SIZE=0"

# "Middle of curve" (1 prefill worker at TP8, 2 decode workers at DEP8)
- spec-decoding: "none"
conc-list: [ 768, 512, 256 ]
prefill:
num-worker: 1
tp: 8
ep: 1
dp-attn: false
additional-settings:
- "PREFILL_NODES=1"
decode:
num-worker: 2
tp: 8
ep: 8
dp-attn: true
additional-settings:
- "DECODE_NODES=2"
- "DECODE_MTP_SIZE=0"

# "Bottom of curve" (1 prefill worker at TP8, 2 decode workers at TP8)
- spec-decoding: "none"
conc-list: [ 256, 128, 64, 32, 16, 8, 4 ]
prefill:
num-worker: 1
tp: 8
ep: 1
dp-attn: false
additional-settings:
- "PREFILL_NODES=1"
decode:
num-worker: 2
tp: 8
ep: 1
dp-attn: false
additional-settings:
- "DECODE_NODES=2"
- "DECODE_MTP_SIZE=0"

# "Low concurrency" (1 prefill worker at TP4, 1 decode worker at TP8)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

r u sure that TP4 is on the pareto here? do u have an graph?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

u only have TP4 curve and u have "hide non-optimal"? can u run the rest of the 24 datapoints?

- spec-decoding: "none"
conc-list: [ 64, 32, 16, 8, 4, 2, 1 ]
prefill:
num-worker: 1
tp: 4
ep: 1
dp-attn: false
additional-settings:
- "PREFILL_NODES=1"
decode:
num-worker: 1
tp: 8
ep: 1
dp-attn: false
additional-settings:
- "DECODE_NODES=1"
- "DECODE_MTP_SIZE=0"

- isl: 8192
osl: 1024
search-space:
# "Top of curve" (2 prefill workers at DEP8, 1 decode worker at DEP8)
- spec-decoding: "none"
conc-list: [ 512, 1024 ]
prefill:
num-worker: 2
tp: 8
ep: 8
dp-attn: true
additional-settings:
- "PREFILL_NODES=2"
decode:
num-worker: 1
tp: 8
ep: 8
dp-attn: true
additional-settings:
- "DECODE_NODES=1"
- "DECODE_MTP_SIZE=0"

# "Bottom of curve" (1 prefill worker at TP8, 2 decode workers at TP8)
- spec-decoding: "none"
conc-list: [ 256, 128, 64, 32, 16, 8, 4 ]
prefill:
num-worker: 1
tp: 8
ep: 1
dp-attn: false
additional-settings:
- "PREFILL_NODES=1"
decode:
num-worker: 2
tp: 8
ep: 1
dp-attn: false
additional-settings:
- "DECODE_NODES=2"
- "DECODE_MTP_SIZE=0"

# "Low concurrency" (1 prefill worker at TP4, 1 decode worker at TP8)
- spec-decoding: "none"
conc-list: [ 64, 32, 16, 8, 4, 2, 1 ]
prefill:
num-worker: 1
tp: 4
ep: 1
dp-attn: false
additional-settings:
- "PREFILL_NODES=1"
decode:
num-worker: 1
tp: 8
ep: 1
dp-attn: false
additional-settings:
- "DECODE_NODES=1"
- "DECODE_MTP_SIZE=0"
5 changes: 5 additions & 0 deletions .github/configs/runners.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -75,6 +75,11 @@ mi325x:
- 'mi325x-amd_1'
- 'mi325x-amd_2'
- 'mi325x-amd_3'
mi325x-disagg:
- 'mi325x-amd_0'
- 'mi325x-amd_1'
- 'mi325x-amd_2'
- 'mi325x-amd_3'
mi355x:
- 'mi355x-amds_0'
- 'mi355x-amds_1'
Expand Down
3 changes: 3 additions & 0 deletions benchmarks/multi_node/amd_utils/env.sh
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,9 @@ if [[ -z "$IBDEVICES" ]]; then
export IBDEVICES=ionic_0,ionic_1,ionic_2,ionic_3,ionic_4,ionic_5,ionic_6,ionic_7
elif [[ $NODENAME == mia1* ]]; then
export IBDEVICES=rdma0,rdma1,rdma2,rdma3,rdma4,rdma5,rdma6,rdma7
elif [[ $NODENAME == chi-mi325x* ]]; then
# Vultr/CPE MI325X cluster: Broadcom RoCE (bnxt_re); bnxt_re6 is DOWN, skip it
export IBDEVICES=bnxt_re0,bnxt_re1,bnxt_re2,bnxt_re3,bnxt_re4,bnxt_re5,bnxt_re7,bnxt_re8
else
echo "ERROR: Unable to detect cluster from hostname $NODENAME and IBDEVICES not set" >&2
exit 1
Expand Down
20 changes: 12 additions & 8 deletions benchmarks/multi_node/amd_utils/job.slurm
Original file line number Diff line number Diff line change
Expand Up @@ -30,14 +30,18 @@ if [[ ! -f "$MODELS_YAML" ]]; then
exit 1
fi

# Validate MODEL_NAME exists as a top-level key in models.yaml
if ! grep -q "^${MODEL_NAME}:" "$MODELS_YAML"; then
echo "Error: Model '$MODEL_NAME' not found in models.yaml"
# MODEL_YAML_KEY is the models.yaml lookup key (bare model name, e.g. DeepSeek-R1-0528).
# MODEL_NAME may be a longer HF cache path (e.g. models--org--repo/snapshots/<hash>).
_MODEL_YAML_KEY="${MODEL_YAML_KEY:-$MODEL_NAME}"

# Validate the yaml key exists as a top-level key in models.yaml
if ! grep -q "^${_MODEL_YAML_KEY}:" "$MODELS_YAML"; then
echo "Error: Model '$_MODEL_YAML_KEY' not found in models.yaml"
echo "Available models:"
grep -E '^[A-Za-z]' "$MODELS_YAML" | sed 's/:.*$//' | sed 's/^/ - /'
exit 1
fi
echo "Model found: $MODEL_NAME"
echo "Model found: $_MODEL_YAML_KEY"

# All models use server.sh as the entrypoint
RUN_FILE="server.sh"
Expand Down Expand Up @@ -249,10 +253,9 @@ echo "NNODES is ${NNODES}"
echo "REPO Directory is ${DI_REPO_DIR}"
echo "USER_NAME is ${USER_NAME}"

# Get the RDMA priority and DSCP value from the NIC
# Get the RDMA priority and DSCP value from the NIC (optional - env.sh handles absence gracefully)
if ! command -v nicctl >/dev/null 2>&1; then
echo "Error: nicctl command not found. Please ensure nicctl is installed and available." >&2
exit 1
echo "[INFO] nicctl not found. RDMA QoS configuration will be skipped inside the container." >&2
fi

# Reduce log spam
Expand Down Expand Up @@ -357,7 +360,7 @@ exec sudo docker run --rm \
--privileged \
-v ${MODEL_DIR}:/models \
-v \$HOME/.ssh:/root/.ssh \
-v $(which nicctl):/usr/sbin/nicctl \
$(command -v nicctl &>/dev/null && echo "-v $(which nicctl):/usr/sbin/nicctl") \
Comment on lines 256 to +411
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can u verify if these changes break mi355 disagg? +viz @Oseltamivir

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the check for nicctl was breaking on this cluster, MoRI needs it to enforce QoS, disabled for now as it's not installed on these nodes or in the container built and seems unnecessary

--shm-size 128G \
-v /tmp:/run_logs \
-v ${BENCHMARK_LOGS_DIR}:/benchmark_logs \
Expand All @@ -373,6 +376,7 @@ exec sudo docker run --rm \
-e xP=\$xP \
-e yD=\$yD \
-e MODEL_NAME=\$MODEL_NAME \
-e MODEL_YAML_KEY=${_MODEL_YAML_KEY} \
-e IPADDRS=\$IPADDRS \
-e PREFILL_TP_SIZE=\$PREFILL_TP_SIZE \
-e PREFILL_ENABLE_EP=\$PREFILL_ENABLE_EP \
Expand Down
3 changes: 2 additions & 1 deletion benchmarks/multi_node/amd_utils/server.sh
Original file line number Diff line number Diff line change
Expand Up @@ -72,11 +72,12 @@ fi
# Load model config via inline Python (PyYAML is available in SGLang containers)
# Formula evaluation (e.g. "SGLANG_MORI_NUM_MAX_DISPATCH_TOKENS_PER_RANK * TP * xP")
# is done here in Python to avoid bash glob-expanding the * characters.
_MODEL_YAML_KEY="${MODEL_YAML_KEY:-$MODEL_NAME}"
eval "$(python3 -c "
import yaml, sys, os

config_path = '${MODELS_YAML}'
model_name = '${MODEL_NAME}'
model_name = '${_MODEL_YAML_KEY}'

with open(config_path) as f:
models = yaml.safe_load(f)
Expand Down
82 changes: 82 additions & 0 deletions benchmarks/multi_node/dsr1_fp8_mi325x_sglang-disagg.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
#!/usr/bin/env bash

source "$(dirname "$0")/../benchmark_lib.sh"

check_env_vars \
CONC_LIST \
ISL \
OSL \
IMAGE \
SPEC_DECODING \
MODEL_PATH \
PREFILL_NUM_WORKERS \
PREFILL_TP \
PREFILL_EP \
PREFILL_DP_ATTN \
DECODE_NUM_WORKERS \
DECODE_TP \
DECODE_EP \
DECODE_DP_ATTN \
PREFILL_NODES \
DECODE_NODES \
RANDOM_RANGE_RATIO

if [[ -n "$SLURM_JOB_ID" ]]; then
echo "JOB $SLURM_JOB_ID running on $SLURMD_NODENAME"
fi

set -x

# Use upstreamed multi_node scripts (no external clone needed)
cd "$GITHUB_WORKSPACE/benchmarks/multi_node/amd_utils" || exit 1

# Set up SGL launch script-specific environment variables
export TIME_LIMIT="08:00:00"
export MODEL_PATH=$MODEL_PATH
export MODEL_NAME=$MODEL_NAME
export CONTAINER_IMAGE=$IMAGE

if [[ "${PREFILL_EP:-1}" -eq 1 ]]; then
export PREFILL_ENABLE_EP=false
else
export PREFILL_ENABLE_EP=true
fi

if [[ "$PREFILL_DP_ATTN" == "true" ]]; then
export PREFILL_ENABLE_DP=true
else
export PREFILL_ENABLE_DP=false
fi

if [[ "${DECODE_EP:-1}" -eq 1 ]]; then
export DECODE_ENABLE_EP=false
else
export DECODE_ENABLE_EP=true
fi

if [[ "$DECODE_DP_ATTN" == "true" ]]; then
export DECODE_ENABLE_DP=true
else
export DECODE_ENABLE_DP=false
fi

# Launch jobs based on ISL/OSL
# Replace ' ' in CONC_LIST with 'x' such that the concurrency list is represented
# by a list of numbers delimited by 'x'. This is because of how the underlying launch script
# expects the concurrencies.
JOB_ID=$(bash ./submit.sh $PREFILL_NODES \
$PREFILL_NUM_WORKERS \
$DECODE_NODES \
$DECODE_NUM_WORKERS \
$ISL $OSL "${CONC_LIST// /x}" inf \
${PREFILL_ENABLE_EP} ${PREFILL_ENABLE_DP} \
${DECODE_ENABLE_EP} ${DECODE_ENABLE_DP} \
${PREFILL_TP} ${DECODE_TP} \
${RANDOM_RANGE_RATIO})

if [[ $? -ne 0 ]]; then
echo "Failed to submit job" >&2
exit 1
fi

echo "$JOB_ID"
Loading
Loading