Skip to content

[Bug] kimi-k2+sglang 128*H200 failed: Device or resource busy #12454

@XiaobinZhao

Description

@XiaobinZhao

Checklist

  • 1. I have searched related issues but cannot get the expected help.
  • 2. The bug has not been fixed in the latest version.
  • 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
  • 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
  • 5. Please use English, otherwise it will be closed.

Describe the bug

https://lmsys.org/blog/2025-07-20-k2-large-scale-ep/?linkId=100000374601795

I tried deploying 128*H200 + kimi k2 from this blog; since the released version of OME (0.1.3) seems to have issues with using pre-existing local models, I extracted the sglang startup command and ran it manually.

# sglang
docker run -itd --privileged --ipc=host --gpus all --network=host --ulimit memlock=-1 --ulimit stack=67108864 -v /llm:/root/.cache/huggingface  --name sglang_ws 10.24.10.61:20405/sglang:v0.4.10-deepseek3.1-0822-my-re_mooncake  bash

# decode 12nodes
MC_TE_METRIC=1 SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=480 NCCL_SOCKET_IFNAME="bond4" GLOO_SOCKET_IFNAME="bond4" python3 -m sglang.launch_server --model-path /llm/moonshotai/Kimi-K2-Instruct-0905 --trust-remote-code --host 0.0.0.0 --port 9000 --disaggregation-mode decode --disaggregation-ib-device mlx5_0,mlx5_1,mlx5_2,mlx5_3,mlx5_4,mlx5_5,mlx5_6,mlx5_9 --dist-init-addr 10.24.8.32:7000 --moe-dense-tp-size 1 --enable-dp-lm-head  --enable-dp-attention --enable-deepep-moe --deepep-mode low_latency --cuda-graph-max-bs 64 --decode-log-interval 1 --max-running-requests 46080 --disable-radix-cache --disable-shared-experts-fusion  --decode-log-interval 1 --enable-two-batch-overlap --enable-eplb --ep-num-redundant-experts 96 --tp-size 96 --dp-size 96  --mem-fraction-static 0.6 --nnodes 12 --node-rank 0

# prefill 4nodes
PYTHONUNBUFFERED=1 NCCL_SOCKET_IFNAME="bond4" GLOO_SOCKET_IFNAME="bond4" python3 -m sglang.launch_server --model-path /llm/moonshotai/Kimi-K2-Instruct-0905  --trust-remote-code --host 0.0.0.0 --port 9000 --disaggregation-mode prefill --disaggregation-ib-device mlx5_0,mlx5_1,mlx5_2,mlx5_3,mlx5_4,mlx5_5,mlx5_6,mlx5_9 --dist-init-addr 10.24.8.31:7000 --moe-dense-tp-size 1 --enable-dp-lm-head  --enable-dp-attention --enable-deepep-moe --deepep-mode normal --disable-cuda-graph --disable-shared-experts-fusion --decode-log-interval 1 --max-running-requests 1024 --disable-radix-cache --enable-two-batch-overlap --enable-eplb --ep-dispatch-algorithm dynamic --eplb-algorithm deepseek --ep-num-redundant-experts 32 --max-total-tokens 131072 --tp-size 32 --dp-size 32  --mem-fraction-static 0.849 --nnodes 4 --node-rank 0

but failed:

[/sgl-workspace/nvshmem/src/host/coll/barrier/barrier.cpp:21] cuda failed with an illegal memory access was encountered
/sgl-workspace/nvshmem/src/host/util/cs.cpp:21: non-zero status: 16: Device or resource busy, exiting... mutex destroy failed

[/sgl-workspace/nvshmem/src/host/coll/barrier/barrier.cpp:21] cuda failed with an illegal memory access was encountered
/sgl-workspace/nvshmem/src/host/util/cs.cpp:21: non-zero status: 16: Device or resource busy, exiting... mutex destroy failed

[/sgl-workspace/nvshmem/src/host/coll/barrier/barrier.cpp:21] cuda failed with an illegal memory access was encountered
/sgl-workspace/nvshmem/src/host/util/cs.cpp:21: non-zero status: 16: Device or resource busy, exiting... mutex destroy failed

[/sgl-workspace/nvshmem/src/host/coll/barrier/barrier.cpp:21] cuda failed with an illegal memory access was encountered
/sgl-workspace/nvshmem/src/host/util/cs.cpp:21: non-zero status: 16: Device or resource busy, exiting... mutex destroy failed


sglang-1p1d-roleset-jqfh5-decode-86db8c5d6-0-0:1506:3886 [0] NCCL INFO [Service thread] Connection closed by localRank 7
sglang-1p1d-roleset-jqfh5-decode-86db8c5d6-0-0:1512:10316 [0] NCCL INFO [Service thread] Connection closed by localRank 1
sglang-1p1d-roleset-jqfh5-decode-86db8c5d6-0-0:1506:3886 [0] NCCL INFO [Service thread] Connection closed by localRank 1
sglang-1p1d-roleset-jqfh5-decode-86db8c5d6-0-0:1512:10316 [0] NCCL INFO [Service thread] Connection closed by localRank 4
sglang-1p1d-roleset-jqfh5-decode-86db8c5d6-0-0:1506:3886 [0] NCCL INFO [Service thread] Connection closed by localRank 4
sglang-1p1d-roleset-jqfh5-decode-86db8c5d6-0-0:1512:10316 [0] NCCL INFO [Service thread] Connection closed by localRank 0
sglang-1p1d-roleset-jqfh5-decode-86db8c5d6-0-0:1508:3880 [0] NCCL INFO [Service thread] Connection closed by localRank 0
[2025-10-31 01:34:56] DataParallelController hit an exception: Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/managers/data_parallel_controller.py", line 352, in run_data_parallel_controller_process
    controller = DataParallelController(
  File "/sgl-workspace/sglang/python/sglang/srt/managers/data_parallel_controller.py", line 106, in __init__
    dp_port_args = self.launch_dp_attention_schedulers(server_args, port_args)
  File "/sgl-workspace/sglang/python/sglang/srt/managers/data_parallel_controller.py", line 181, in launch_dp_attention_schedulers
    self.launch_tensor_parallel_group(server_args, port_args, 0, None)
  File "/sgl-workspace/sglang/python/sglang/srt/managers/data_parallel_controller.py", line 264, in launch_tensor_parallel_group
    scheduler_info.append(scheduler_pipe_readers[i].recv())
  File "/usr/lib/python3.10/multiprocessing/connection.py", line 250, in recv
    buf = self._recv_bytes()
  File "/usr/lib/python3.10/multiprocessing/connection.py", line 414, in _recv_bytes
    buf = self._recv(4)
  File "/usr/lib/python3.10/multiprocessing/connection.py", line 383, in _recv
    raise EOFError
EOFError

I tried many version of sglang:0.4.10/0.5.3/0.5.4-post1 but failed.

whats wrong with me ?

Reproduction

  1. sglang 0.4.10
  2. run the command as before

Environment

  1. sglang 0.4.10
  2. H200*128

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions