Skip to content

Use apply_rope_with_cos_sin_cache_inplace for DeepSeek#4764

Merged
zhyncs merged 5 commits intosgl-project:mainfrom
strgrb:dev/fused_rope
Mar 27, 2025
Merged

Use apply_rope_with_cos_sin_cache_inplace for DeepSeek#4764
zhyncs merged 5 commits intosgl-project:mainfrom
strgrb:dev/fused_rope

Conversation

@strgrb
Copy link
Copy Markdown
Collaborator

@strgrb strgrb commented Mar 25, 2025

Motivation

It seems that rope kernel in flashinfer can be directly applied to DeepSeek, that will fuse some torch kernels for cuda backend.

Modifications

I change DeepseekScalingRotaryEmbedding.forward to DeepseekScalingRotaryEmbedding.forward_native to use forward_cuda in parent class, this will call apply_rope_with_cos_sin_cache_inplace at last.

Following results are benchmarked with 8*H20 and compiled with cuda12.8

  • server:
python -m sglang.launch_server --served-model-name auto --port 11086 \
        --model-path /home/admin/DeepSeek-R1/ --tp 8 \
        --mem-fraction-static 0.96 --trust-remote-code --dtype auto --kv-cache-dtype auto \
        --context-length 32768 --enable-cache-report --log-level info --tensor-parallel-size 8 --max-running-requests 48 --quantization fp8 --chunked-prefill-size 2048 --disable-radix-cache \
        --enable-flashinfer-mla 
  • accuracy test:
python3 benchmark/gsm8k/bench_sglang.py --num-questions 2000 --parallel 128 --num-shots 8 --port 11086 --data-path ../test.jsonl
Accuracy: 0.954
Invalid: 0.000
Latency: 410.744 s
Output throughput: 327.718 token/s
  • benchmark:
python -m sglang.bench_serving --backend sglang --port $PORT --random-range-ratio 1.0 --dataset-name random --model $MODEL_PATH --num-prompts $NUM_PROMPTS --max-concurrency $BATCH --random-input-len $INPUT_LEN --random-output-len $OUTPUT_LEN --dataset-path $DATASET
  • result before:
{"backend": "sglang", "dataset_name": "random", "request_rate": Infinity, "max_concurrency": 16, "sharegpt_output_len": null, "random_input_len": 4000, "random_output_len": 1000, "random_range_ratio": 1.0, "duration": 516.5919100739993, "completed": 144, "total_input_tokens": 576000, "total_output_tokens": 144000, "total_output_tokens_retokenized": 143504, "request_throughput": 0.2787500098082695, "input_throughput": 1115.000039233078, "output_throughput": 278.7500098082695, "mean_e2e_latency_ms": 57389.57438097633, "median_e2e_latency_ms": 57418.031494598836, "std_e2e_latency_ms": 227.86403482450487, "p99_e2e_latency_ms": 57802.51287078019, "mean_ttft_ms": 5598.557771244992, "median_ttft_ms": 5878.1934613361955, "std_ttft_ms": 2890.718799358219, "p99_ttft_ms": 10001.823026114143, "mean_tpot_ms": 51.84285946920054, "median_tpot_ms": 51.700865166742844, "std_tpot_ms": 2.9046915102650357, "p99_tpot_ms": 56.73462425342279, "mean_itl_ms": 51.84357542675082, "median_itl_ms": 47.52367036417127, "std_itl_ms": 165.47613664240282, "p95_itl_ms": 49.84061436261982, "p99_itl_ms": 51.81014743633568, "concurrency": 15.997344421589565, "accept_length": null}
  • result after:
{"backend": "sglang", "dataset_name": "random", "request_rate": Infinity, "max_concurrency": 16, "sharegpt_output_len": null, "random_input_len": 4000, "random_output_len": 1000, "random_range_ratio": 1.0, "duration": 491.06120979599655, "completed": 144, "total_input_tokens": 576000, "total_output_tokens": 144000, "total_output_tokens_retokenized": 143442, "request_throughput": 0.29324246576067875, "input_throughput": 1172.969863042715, "output_throughput": 293.2424657606787, "mean_e2e_latency_ms": 54554.430641402076, "median_e2e_latency_ms": 54603.0342138838, "std_e2e_latency_ms": 288.1696831787882, "p99_e2e_latency_ms": 55143.40737964492, "mean_ttft_ms": 5342.889874468609, "median_ttft_ms": 5581.580740166828, "std_ttft_ms": 2759.9641624639858, "p99_ttft_ms": 9731.60787946079, "mean_tpot_ms": 49.26080156850197, "median_tpot_ms": 49.08072854228154, "std_tpot_ms": 2.776873240763932, "p99_tpot_ms": 53.85334397982656, "mean_itl_ms": 49.261481463359885, "median_itl_ms": 45.09735247120261, "std_itl_ms": 159.1181666153573, "p95_itl_ms": 47.3682118114084, "p99_itl_ms": 49.84505341388285, "concurrency": 15.997675759454673, "accept_length": null}

5% gain for most of the metrics.

Checklist

Zhang Kaihong added 2 commits March 25, 2025 18:14
@zhyncs zhyncs merged commit 886fcbd into sgl-project:main Mar 27, 2025
34 of 40 checks passed
@zhyncs
Copy link
Copy Markdown
Collaborator

zhyncs commented Mar 27, 2025

Thanks for this improvement!

guoyuhong added a commit to guoyuhong/sglang that referenced this pull request Apr 1, 2025
@zhyncs zhyncs mentioned this pull request Apr 14, 2025
jimoosciuc pushed a commit to Furion-cn/sglang that referenced this pull request Apr 17, 2025
)

Co-authored-by: Zhang Kaihong <zhangkaihong.zkh@alibaba-inc.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants