[Bug] kimi-k2+sglang 128*H200 failed: Device or resource busy

### Checklist

- [x] 1. I have searched related issues but cannot get the expected help.
- [x] 2. The bug has not been fixed in the latest version.
- [x] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
- [x] 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
- [x] 5. Please use English, otherwise it will be closed.

### Describe the bug

https://lmsys.org/blog/2025-07-20-k2-large-scale-ep/?linkId=100000374601795  

I tried deploying 128*H200 + kimi k2 from this blog; since the released version of OME (0.1.3) seems to have issues with using pre-existing local models, I extracted the sglang startup command and ran it manually.

```
# sglang
docker run -itd --privileged --ipc=host --gpus all --network=host --ulimit memlock=-1 --ulimit stack=67108864 -v /llm:/root/.cache/huggingface  --name sglang_ws 10.24.10.61:20405/sglang:v0.4.10-deepseek3.1-0822-my-re_mooncake  bash

# decode 12nodes
MC_TE_METRIC=1 SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=480 NCCL_SOCKET_IFNAME="bond4" GLOO_SOCKET_IFNAME="bond4" python3 -m sglang.launch_server --model-path /llm/moonshotai/Kimi-K2-Instruct-0905 --trust-remote-code --host 0.0.0.0 --port 9000 --disaggregation-mode decode --disaggregation-ib-device mlx5_0,mlx5_1,mlx5_2,mlx5_3,mlx5_4,mlx5_5,mlx5_6,mlx5_9 --dist-init-addr 10.24.8.32:7000 --moe-dense-tp-size 1 --enable-dp-lm-head  --enable-dp-attention --enable-deepep-moe --deepep-mode low_latency --cuda-graph-max-bs 64 --decode-log-interval 1 --max-running-requests 46080 --disable-radix-cache --disable-shared-experts-fusion  --decode-log-interval 1 --enable-two-batch-overlap --enable-eplb --ep-num-redundant-experts 96 --tp-size 96 --dp-size 96  --mem-fraction-static 0.6 --nnodes 12 --node-rank 0

# prefill 4nodes
PYTHONUNBUFFERED=1 NCCL_SOCKET_IFNAME="bond4" GLOO_SOCKET_IFNAME="bond4" python3 -m sglang.launch_server --model-path /llm/moonshotai/Kimi-K2-Instruct-0905  --trust-remote-code --host 0.0.0.0 --port 9000 --disaggregation-mode prefill --disaggregation-ib-device mlx5_0,mlx5_1,mlx5_2,mlx5_3,mlx5_4,mlx5_5,mlx5_6,mlx5_9 --dist-init-addr 10.24.8.31:7000 --moe-dense-tp-size 1 --enable-dp-lm-head  --enable-dp-attention --enable-deepep-moe --deepep-mode normal --disable-cuda-graph --disable-shared-experts-fusion --decode-log-interval 1 --max-running-requests 1024 --disable-radix-cache --enable-two-batch-overlap --enable-eplb --ep-dispatch-algorithm dynamic --eplb-algorithm deepseek --ep-num-redundant-experts 32 --max-total-tokens 131072 --tp-size 32 --dp-size 32  --mem-fraction-static 0.849 --nnodes 4 --node-rank 0
```
but failed：
```
[/sgl-workspace/nvshmem/src/host/coll/barrier/barrier.cpp:21] cuda failed with an illegal memory access was encountered
/sgl-workspace/nvshmem/src/host/util/cs.cpp:21: non-zero status: 16: Device or resource busy, exiting... mutex destroy failed

[/sgl-workspace/nvshmem/src/host/coll/barrier/barrier.cpp:21] cuda failed with an illegal memory access was encountered
/sgl-workspace/nvshmem/src/host/util/cs.cpp:21: non-zero status: 16: Device or resource busy, exiting... mutex destroy failed

[/sgl-workspace/nvshmem/src/host/coll/barrier/barrier.cpp:21] cuda failed with an illegal memory access was encountered
/sgl-workspace/nvshmem/src/host/util/cs.cpp:21: non-zero status: 16: Device or resource busy, exiting... mutex destroy failed

[/sgl-workspace/nvshmem/src/host/coll/barrier/barrier.cpp:21] cuda failed with an illegal memory access was encountered
/sgl-workspace/nvshmem/src/host/util/cs.cpp:21: non-zero status: 16: Device or resource busy, exiting... mutex destroy failed


sglang-1p1d-roleset-jqfh5-decode-86db8c5d6-0-0:1506:3886 [0] NCCL INFO [Service thread] Connection closed by localRank 7
sglang-1p1d-roleset-jqfh5-decode-86db8c5d6-0-0:1512:10316 [0] NCCL INFO [Service thread] Connection closed by localRank 1
sglang-1p1d-roleset-jqfh5-decode-86db8c5d6-0-0:1506:3886 [0] NCCL INFO [Service thread] Connection closed by localRank 1
sglang-1p1d-roleset-jqfh5-decode-86db8c5d6-0-0:1512:10316 [0] NCCL INFO [Service thread] Connection closed by localRank 4
sglang-1p1d-roleset-jqfh5-decode-86db8c5d6-0-0:1506:3886 [0] NCCL INFO [Service thread] Connection closed by localRank 4
sglang-1p1d-roleset-jqfh5-decode-86db8c5d6-0-0:1512:10316 [0] NCCL INFO [Service thread] Connection closed by localRank 0
sglang-1p1d-roleset-jqfh5-decode-86db8c5d6-0-0:1508:3880 [0] NCCL INFO [Service thread] Connection closed by localRank 0
[2025-10-31 01:34:56] DataParallelController hit an exception: Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/managers/data_parallel_controller.py", line 352, in run_data_parallel_controller_process
    controller = DataParallelController(
  File "/sgl-workspace/sglang/python/sglang/srt/managers/data_parallel_controller.py", line 106, in __init__
    dp_port_args = self.launch_dp_attention_schedulers(server_args, port_args)
  File "/sgl-workspace/sglang/python/sglang/srt/managers/data_parallel_controller.py", line 181, in launch_dp_attention_schedulers
    self.launch_tensor_parallel_group(server_args, port_args, 0, None)
  File "/sgl-workspace/sglang/python/sglang/srt/managers/data_parallel_controller.py", line 264, in launch_tensor_parallel_group
    scheduler_info.append(scheduler_pipe_readers[i].recv())
  File "/usr/lib/python3.10/multiprocessing/connection.py", line 250, in recv
    buf = self._recv_bytes()
  File "/usr/lib/python3.10/multiprocessing/connection.py", line 414, in _recv_bytes
    buf = self._recv(4)
  File "/usr/lib/python3.10/multiprocessing/connection.py", line 383, in _recv
    raise EOFError
EOFError
```
I tried many version of sglang：0.4.10/0.5.3/0.5.4-post1 but failed.

whats wrong with me ?

### Reproduction

1. sglang 0.4.10
2. run the command as before

### Environment

1. sglang 0.4.10
2. H200*128

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug] kimi-k2+sglang 128*H200 failed: Device or resource busy #12454

Checklist

Describe the bug

Reproduction

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug] kimi-k2+sglang 128*H200 failed: Device or resource busy #12454

Description

Checklist

Describe the bug

Reproduction

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions