[Generative Score API] Optimization to Remove Decode by sundar24295s · Pull Request #1 · sundar24295s/sglang

sundar24295s · 2025-08-05T02:48:01Z

🚀 Motivation

This PR is a follow-up to the Decoder-only Scoring API introduced in PR #6460, which was initially proposed and discussed in Issue #5973.
Achieved a 40% reduction in single-request latency for a 300-token input (120 query tokens + 180 item tokens), decreasing from 33 ms to 20 ms.

🔧 Modifications

Introduces dedicated objects for the Scoring API to simplify the implementation and enable targeted optimizations.
Remote Decode Optimization:
Scoring workloads typically require only log probabilities at specific token positions and do not involve token generation. This PR removes the decode phase entirely for such workloads, resulting in significantly lower latency and improved throughput.
The new code structure cleanly separates scoring-specific logic from the text generation path, ensuring that prefill-only optimizations do not interfere with generation behavior.

📈 Future Optimizations

This refactor lays the foundation for further improvements, including:

Memory Transfer Optimization:
Move all CPU-GPU synchronization to the post-processing loop to better overlap run_batch with post-processing.
Multi-Item Scoring:
Support scoring multiple items within a single prompt using custom attention masks via FlashInfer.
→ See flashinfer-ai/flashinfer#1015

Breaking Changes

None for public APIs - the /v1/score endpoint maintains backward compatibility
Internal request handling has been refactored but doesn't affect external interfaces

Accuracy Test

TBA

Profiling

Updated Benchmark Scripts: Enhanced bench_score.py to properly test the new scoring pipeline
Before this change, for a single request we see forward_extend and forward_decode.

- After this change, we only have `forward_extend`.

🧪 Benchmark Comparison: Qwen3-0.5B on H100 (CUDA 12.8)

Setup:

Model: Qwen3-0.5B (pruned, decoder-only)
Prompt length: 300 tokens
Hardware: H100 GPU
Duration: 120s
Target RPS: 70
Item Count: 10
Distribution: Poisson
Transport: HTTP

Results

Achieved a 40% reduction in single-request latency for a 300-token input (120 query tokens + 180 item tokens), decreasing from 33 ms to 20 ms.
For batch size 10 and the same 300-token input, the table below compares the latency percentiles before and after the change.

🔍 Summary of Improvement

Metric	Baseline	With Change	Improvement
Avg Response Time (ms)	83.92	51.16	↓ 39%
P50 Latency (ms)	51.77	39.10	↓ 24%
P90 Latency (ms)	114.13	73.37	↓ 36%
P99 Latency (ms)	621.66	391.96	↓ 37%
Single Request Latency	33 ms	20 ms	↓ 39%

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

lanking520 · 2025-08-05T16:11:42Z

python/sglang/srt/managers/scheduler_output_processor_mixin.py

+        self.token_to_kv_pool_allocator.free_group_begin()
+
+        # Process logprobs for scoring requests
+        if batch.return_logprob and logits_output is not None:


Suggested change

if batch.return_logprob and logits_output is not None:

if batch.return_logprob and logits_output:

lanking520 · 2025-08-05T16:12:04Z

python/sglang/srt/managers/scheduler_output_processor_mixin.py

+            logits_output, _, _ = self.tp_worker.resolve_last_batch_result(launch_done)
+        else:
+            # Move logprobs to CPU if needed
+            if batch.return_logprob and logits_output is not None:


Suggested change

if batch.return_logprob and logits_output is not None:

if batch.return_logprob and logits_output:

lanking520 · 2025-08-05T16:13:05Z

python/sglang/srt/managers/scheduler_output_processor_mixin.py

+                        logits_output.next_token_token_ids_logprobs_idx is not None):
+
+                        # Initialize all the logprob fields for scoring request
+                        if req.input_token_logprobs_val is None:


what is the consequence if we just pass in None value as input? Should we just change the default value of request to [] if needed?

Optimize Score - Remove Decode

c9bbcec

sundar24295s changed the title ~~Optimize Score - Remove Decode~~ [Generative Score API] Optimization to Remove Decode Aug 5, 2025

lanking520 reviewed Aug 5, 2025

View reviewed changes

sundar24295s closed this Aug 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Generative Score API] Optimization to Remove Decode#1

[Generative Score API] Optimization to Remove Decode#1
sundar24295s wants to merge 1 commit intomainfrom
suramach/removedecode

sundar24295s commented Aug 5, 2025 •

edited

Loading

Uh oh!

lanking520 Aug 5, 2025

Uh oh!

lanking520 Aug 5, 2025

Uh oh!

lanking520 Aug 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	if batch.return_logprob and logits_output is not None:
	if batch.return_logprob and logits_output:

Conversation

sundar24295s commented Aug 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🚀 Motivation

🔧 Modifications

📈 Future Optimizations

Breaking Changes

Accuracy Test

Profiling

🧪 Benchmark Comparison: Qwen3-0.5B on H100 (CUDA 12.8)

Results

🔍 Summary of Improvement

Checklist

Uh oh!

lanking520 Aug 5, 2025

Choose a reason for hiding this comment

Uh oh!

lanking520 Aug 5, 2025

Choose a reason for hiding this comment

Uh oh!

lanking520 Aug 5, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sundar24295s commented Aug 5, 2025 •

edited

Loading