Use apply_rope_with_cos_sin_cache_inplace for DeepSeek by strgrb · Pull Request #4764 · sgl-project/sglang

strgrb · 2025-03-25T13:09:19Z

Motivation

It seems that rope kernel in flashinfer can be directly applied to DeepSeek, that will fuse some torch kernels for cuda backend.

Modifications

I change DeepseekScalingRotaryEmbedding.forward to DeepseekScalingRotaryEmbedding.forward_native to use forward_cuda in parent class, this will call apply_rope_with_cos_sin_cache_inplace at last.

Following results are benchmarked with 8*H20 and compiled with cuda12.8

server:

python -m sglang.launch_server --served-model-name auto --port 11086 \
        --model-path /home/admin/DeepSeek-R1/ --tp 8 \
        --mem-fraction-static 0.96 --trust-remote-code --dtype auto --kv-cache-dtype auto \
        --context-length 32768 --enable-cache-report --log-level info --tensor-parallel-size 8 --max-running-requests 48 --quantization fp8 --chunked-prefill-size 2048 --disable-radix-cache \
        --enable-flashinfer-mla

accuracy test:

python3 benchmark/gsm8k/bench_sglang.py --num-questions 2000 --parallel 128 --num-shots 8 --port 11086 --data-path ../test.jsonl
Accuracy: 0.954
Invalid: 0.000
Latency: 410.744 s
Output throughput: 327.718 token/s

benchmark:

python -m sglang.bench_serving --backend sglang --port $PORT --random-range-ratio 1.0 --dataset-name random --model $MODEL_PATH --num-prompts $NUM_PROMPTS --max-concurrency $BATCH --random-input-len $INPUT_LEN --random-output-len $OUTPUT_LEN --dataset-path $DATASET

result before:

{"backend": "sglang", "dataset_name": "random", "request_rate": Infinity, "max_concurrency": 16, "sharegpt_output_len": null, "random_input_len": 4000, "random_output_len": 1000, "random_range_ratio": 1.0, "duration": 516.5919100739993, "completed": 144, "total_input_tokens": 576000, "total_output_tokens": 144000, "total_output_tokens_retokenized": 143504, "request_throughput": 0.2787500098082695, "input_throughput": 1115.000039233078, "output_throughput": 278.7500098082695, "mean_e2e_latency_ms": 57389.57438097633, "median_e2e_latency_ms": 57418.031494598836, "std_e2e_latency_ms": 227.86403482450487, "p99_e2e_latency_ms": 57802.51287078019, "mean_ttft_ms": 5598.557771244992, "median_ttft_ms": 5878.1934613361955, "std_ttft_ms": 2890.718799358219, "p99_ttft_ms": 10001.823026114143, "mean_tpot_ms": 51.84285946920054, "median_tpot_ms": 51.700865166742844, "std_tpot_ms": 2.9046915102650357, "p99_tpot_ms": 56.73462425342279, "mean_itl_ms": 51.84357542675082, "median_itl_ms": 47.52367036417127, "std_itl_ms": 165.47613664240282, "p95_itl_ms": 49.84061436261982, "p99_itl_ms": 51.81014743633568, "concurrency": 15.997344421589565, "accept_length": null}

result after:

{"backend": "sglang", "dataset_name": "random", "request_rate": Infinity, "max_concurrency": 16, "sharegpt_output_len": null, "random_input_len": 4000, "random_output_len": 1000, "random_range_ratio": 1.0, "duration": 491.06120979599655, "completed": 144, "total_input_tokens": 576000, "total_output_tokens": 144000, "total_output_tokens_retokenized": 143442, "request_throughput": 0.29324246576067875, "input_throughput": 1172.969863042715, "output_throughput": 293.2424657606787, "mean_e2e_latency_ms": 54554.430641402076, "median_e2e_latency_ms": 54603.0342138838, "std_e2e_latency_ms": 288.1696831787882, "p99_e2e_latency_ms": 55143.40737964492, "mean_ttft_ms": 5342.889874468609, "median_ttft_ms": 5581.580740166828, "std_ttft_ms": 2759.9641624639858, "p99_ttft_ms": 9731.60787946079, "mean_tpot_ms": 49.26080156850197, "median_tpot_ms": 49.08072854228154, "std_tpot_ms": 2.776873240763932, "p99_tpot_ms": 53.85334397982656, "mean_itl_ms": 49.261481463359885, "median_itl_ms": 45.09735247120261, "std_itl_ms": 159.1181666153573, "p95_itl_ms": 47.3682118114084, "p99_itl_ms": 49.84505341388285, "concurrency": 15.997675759454673, "accept_length": null}

5% gain for most of the metrics.

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

zhyncs · 2025-03-27T08:46:05Z

Thanks for this improvement!

…roject#4764)" This reverts commit 886fcbd.

) Co-authored-by: Zhang Kaihong <zhangkaihong.zkh@alibaba-inc.com>

Zhang Kaihong added 2 commits March 25, 2025 18:14

use fused rope

0d75160

fix

bdf21f6

strgrb requested review from HaiShaw, Ying1123, ispobock, merrymercy and zhyncs as code owners March 25, 2025 13:09

Merge branch 'main' into dev/fused_rope

bbf3e8c

zhyncs approved these changes Mar 25, 2025

View reviewed changes

zhyncs added the high priority label Mar 26, 2025

zhyncs added 2 commits March 26, 2025 00:30

Merge branch 'main' into dev/fused_rope

523785c

Merge branch 'main' into dev/fused_rope

c9b6c31

zhyncs merged commit 886fcbd into sgl-project:main Mar 27, 2025
34 of 40 checks passed

guoyuhong mentioned this pull request Apr 1, 2025

Revert PR 4764 & 4813 related to R1 RoPE #4959

Merged

6 tasks

guoyuhong added a commit to guoyuhong/sglang that referenced this pull request Apr 1, 2025

Revert "Use apply_rope_with_cos_sin_cache_inplace for DeepSeek (sgl-p…

8e09e60

…roject#4764)" This reverts commit 886fcbd.

zhyncs mentioned this pull request Apr 14, 2025

Apply deepseek cuda rope #5385

Merged

jimoosciuc pushed a commit to Furion-cn/sglang that referenced this pull request Apr 17, 2025

Use apply_rope_with_cos_sin_cache_inplace for DeepSeek (sgl-project#4764

95ed7f2

) Co-authored-by: Zhang Kaihong <zhangkaihong.zkh@alibaba-inc.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use apply_rope_with_cos_sin_cache_inplace for DeepSeek#4764

Use apply_rope_with_cos_sin_cache_inplace for DeepSeek#4764
zhyncs merged 5 commits intosgl-project:mainfrom
strgrb:dev/fused_rope

strgrb commented Mar 25, 2025

Uh oh!

Uh oh!

zhyncs commented Mar 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

strgrb commented Mar 25, 2025

Motivation

Modifications

Checklist

Uh oh!

Uh oh!

zhyncs commented Mar 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants