Skip to content

Fix topk inference performance reduce#6474

Merged
zhyncs merged 2 commits intosgl-project:mainfrom
lambert0312:fix_topk_inference_performance_reduce
May 23, 2025
Merged

Fix topk inference performance reduce#6474
zhyncs merged 2 commits intosgl-project:mainfrom
lambert0312:fix_topk_inference_performance_reduce

Conversation

@lambert0312
Copy link
Copy Markdown
Contributor

@lambert0312 lambert0312 commented May 21, 2025

Motivation

When the following logic is added to topk.py, the inference performance will be significantly affected:

torch.compile(
_mask_topk_ids_padded_region, dynamic=True, backend=get_compiler_backend()
)(topk_ids, num_token_non_padded)

Run command:

python3 -m sglang.launch_server --model-path /path/to/DeepSeek-V3-0324 --trust-remote-code --host 0.0.0.0 --port 30000 --attention-backend flashinfer --n-share-experts-fusion 16 --tp 16 --dist-init-addr IP:20000 --nnodes 2 --node-rank 0

Ref: #6175

Modifications

Add num_token_non_padded judgment logic. If it is None, directly return the previous result.

Checklist

@zhyncs zhyncs merged commit 2f42749 into sgl-project:main May 23, 2025
0 of 37 checks passed
Layssy pushed a commit to Layssy/sglang-iaas that referenced this pull request Jun 9, 2025
xwu-intel pushed a commit to xwu-intel/sglang that referenced this pull request Jun 17, 2025
@lambert0312 lambert0312 deleted the fix_topk_inference_performance_reduce branch June 20, 2025 07:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants