Skip to content

[PD] bug fix: Update status if nixl receiver send a a dummy req.#6720

Merged
zhyncs merged 3 commits intosgl-project:mainfrom
bytedance-iaas:fix-dummy-request
May 29, 2025
Merged

[PD] bug fix: Update status if nixl receiver send a a dummy req.#6720
zhyncs merged 3 commits intosgl-project:mainfrom
bytedance-iaas:fix-dummy-request

Conversation

@thesues
Copy link
Copy Markdown
Contributor

@thesues thesues commented May 28, 2025

@trevor-m
@jokerwyt

For PD, if Decode's tp-size is less than Prefill's tp-size. Prefill will throw this exception.

Exception in thread Thread-5 (bootstrap_thread):
Traceback (most recent call last):
  File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
Exception in thread Thread-5 (bootstrap_thread):
Traceback (most recent call last):
  File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
Exception in thread Thread-5 (bootstrap_thread):
    self.run()
  File "/usr/lib/python3.10/threading.py", line 953, in run
Traceback (most recent call last):
  File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/sgl-workspace/sglang_deepep/python/sglang/srt/disaggregation/nixl/conn.py", line 351, in bootstrap_thread
    self._target(*self._args, **self._kwargs)
  File "/sgl-workspace/sglang_deepep/python/sglang/srt/disaggregation/nixl/conn.py", line 351, in bootstrap_thread
    self.run()
    required_dst_info_num = int(waiting_req_bytes[10].decode("ascii"))
    required_dst_info_num = int(waiting_req_bytes[10].decode("ascii"))
IndexError: list index out of range
  File "/usr/lib/python3.10/threading.py", line 953, in run
IndexError: list index out of range
    self._target(*self._args, **self._kwargs)
  File "/sgl-workspace/sglang_deepep/python/sglang/srt/disaggregation/nixl/conn.py", line 351, in bootstrap_thread
    required_dst_info_num = int(waiting_req_bytes[10].decode("ascii"))
IndexError: list index out of range

The nixl PREFILL manager should detect the dummy msg by counting waiting_req_bytes, and update status to WaitingForInput. My test is OK for these commands:

PREFILL/TP4

python -m sglang.launch_server --model-path /data/models/DeepSeek-V2-Lite/ --port 40001 --host 127.0.0.1  --tp-size 4  --trust-remote-code --disaggregation-mode prefill --page-size 32 --base-gpu-id 0 --disaggregation-transfer-backend nixl --disable-radix-cache 

DECODE/TP2

python -m sglang.launch_server --model-path /data/models/DeepSeek-V2-Lite/ --port 40100 --host 127.0.0.1  --tp-size 2--trust-remote-code --disaggregation-mode decode --page-size 32 --base-gpu-id 4 --disaggregation-transfer-backend nixl --disable-radix-cache

Loading
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants