-
Notifications
You must be signed in to change notification settings - Fork 3.5k
Open
Description
Checklist
- 1. I have searched related issues but cannot get the expected help.
- 2. The bug has not been fixed in the latest version.
- 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
- 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
- 5. Please use English, otherwise it will be closed.
Describe the bug
https://lmsys.org/blog/2025-07-20-k2-large-scale-ep/?linkId=100000374601795
I tried deploying 128*H200 + kimi k2 from this blog; since the released version of OME (0.1.3) seems to have issues with using pre-existing local models, I extracted the sglang startup command and ran it manually.
# sglang
docker run -itd --privileged --ipc=host --gpus all --network=host --ulimit memlock=-1 --ulimit stack=67108864 -v /llm:/root/.cache/huggingface --name sglang_ws 10.24.10.61:20405/sglang:v0.4.10-deepseek3.1-0822-my-re_mooncake bash
# decode 12nodes
MC_TE_METRIC=1 SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=480 NCCL_SOCKET_IFNAME="bond4" GLOO_SOCKET_IFNAME="bond4" python3 -m sglang.launch_server --model-path /llm/moonshotai/Kimi-K2-Instruct-0905 --trust-remote-code --host 0.0.0.0 --port 9000 --disaggregation-mode decode --disaggregation-ib-device mlx5_0,mlx5_1,mlx5_2,mlx5_3,mlx5_4,mlx5_5,mlx5_6,mlx5_9 --dist-init-addr 10.24.8.32:7000 --moe-dense-tp-size 1 --enable-dp-lm-head --enable-dp-attention --enable-deepep-moe --deepep-mode low_latency --cuda-graph-max-bs 64 --decode-log-interval 1 --max-running-requests 46080 --disable-radix-cache --disable-shared-experts-fusion --decode-log-interval 1 --enable-two-batch-overlap --enable-eplb --ep-num-redundant-experts 96 --tp-size 96 --dp-size 96 --mem-fraction-static 0.6 --nnodes 12 --node-rank 0
# prefill 4nodes
PYTHONUNBUFFERED=1 NCCL_SOCKET_IFNAME="bond4" GLOO_SOCKET_IFNAME="bond4" python3 -m sglang.launch_server --model-path /llm/moonshotai/Kimi-K2-Instruct-0905 --trust-remote-code --host 0.0.0.0 --port 9000 --disaggregation-mode prefill --disaggregation-ib-device mlx5_0,mlx5_1,mlx5_2,mlx5_3,mlx5_4,mlx5_5,mlx5_6,mlx5_9 --dist-init-addr 10.24.8.31:7000 --moe-dense-tp-size 1 --enable-dp-lm-head --enable-dp-attention --enable-deepep-moe --deepep-mode normal --disable-cuda-graph --disable-shared-experts-fusion --decode-log-interval 1 --max-running-requests 1024 --disable-radix-cache --enable-two-batch-overlap --enable-eplb --ep-dispatch-algorithm dynamic --eplb-algorithm deepseek --ep-num-redundant-experts 32 --max-total-tokens 131072 --tp-size 32 --dp-size 32 --mem-fraction-static 0.849 --nnodes 4 --node-rank 0
but failed:
[/sgl-workspace/nvshmem/src/host/coll/barrier/barrier.cpp:21] cuda failed with an illegal memory access was encountered
/sgl-workspace/nvshmem/src/host/util/cs.cpp:21: non-zero status: 16: Device or resource busy, exiting... mutex destroy failed
[/sgl-workspace/nvshmem/src/host/coll/barrier/barrier.cpp:21] cuda failed with an illegal memory access was encountered
/sgl-workspace/nvshmem/src/host/util/cs.cpp:21: non-zero status: 16: Device or resource busy, exiting... mutex destroy failed
[/sgl-workspace/nvshmem/src/host/coll/barrier/barrier.cpp:21] cuda failed with an illegal memory access was encountered
/sgl-workspace/nvshmem/src/host/util/cs.cpp:21: non-zero status: 16: Device or resource busy, exiting... mutex destroy failed
[/sgl-workspace/nvshmem/src/host/coll/barrier/barrier.cpp:21] cuda failed with an illegal memory access was encountered
/sgl-workspace/nvshmem/src/host/util/cs.cpp:21: non-zero status: 16: Device or resource busy, exiting... mutex destroy failed
sglang-1p1d-roleset-jqfh5-decode-86db8c5d6-0-0:1506:3886 [0] NCCL INFO [Service thread] Connection closed by localRank 7
sglang-1p1d-roleset-jqfh5-decode-86db8c5d6-0-0:1512:10316 [0] NCCL INFO [Service thread] Connection closed by localRank 1
sglang-1p1d-roleset-jqfh5-decode-86db8c5d6-0-0:1506:3886 [0] NCCL INFO [Service thread] Connection closed by localRank 1
sglang-1p1d-roleset-jqfh5-decode-86db8c5d6-0-0:1512:10316 [0] NCCL INFO [Service thread] Connection closed by localRank 4
sglang-1p1d-roleset-jqfh5-decode-86db8c5d6-0-0:1506:3886 [0] NCCL INFO [Service thread] Connection closed by localRank 4
sglang-1p1d-roleset-jqfh5-decode-86db8c5d6-0-0:1512:10316 [0] NCCL INFO [Service thread] Connection closed by localRank 0
sglang-1p1d-roleset-jqfh5-decode-86db8c5d6-0-0:1508:3880 [0] NCCL INFO [Service thread] Connection closed by localRank 0
[2025-10-31 01:34:56] DataParallelController hit an exception: Traceback (most recent call last):
File "/sgl-workspace/sglang/python/sglang/srt/managers/data_parallel_controller.py", line 352, in run_data_parallel_controller_process
controller = DataParallelController(
File "/sgl-workspace/sglang/python/sglang/srt/managers/data_parallel_controller.py", line 106, in __init__
dp_port_args = self.launch_dp_attention_schedulers(server_args, port_args)
File "/sgl-workspace/sglang/python/sglang/srt/managers/data_parallel_controller.py", line 181, in launch_dp_attention_schedulers
self.launch_tensor_parallel_group(server_args, port_args, 0, None)
File "/sgl-workspace/sglang/python/sglang/srt/managers/data_parallel_controller.py", line 264, in launch_tensor_parallel_group
scheduler_info.append(scheduler_pipe_readers[i].recv())
File "/usr/lib/python3.10/multiprocessing/connection.py", line 250, in recv
buf = self._recv_bytes()
File "/usr/lib/python3.10/multiprocessing/connection.py", line 414, in _recv_bytes
buf = self._recv(4)
File "/usr/lib/python3.10/multiprocessing/connection.py", line 383, in _recv
raise EOFError
EOFError
I tried many version of sglang:0.4.10/0.5.3/0.5.4-post1 but failed.
whats wrong with me ?
Reproduction
- sglang 0.4.10
- run the command as before
Environment
- sglang 0.4.10
- H200*128
Metadata
Metadata
Assignees
Labels
No labels