[Generative Score API] Optimization to Remove Decode#1
Closed
sundar24295s wants to merge 1 commit intomainfrom
Closed
[Generative Score API] Optimization to Remove Decode#1sundar24295s wants to merge 1 commit intomainfrom
sundar24295s wants to merge 1 commit intomainfrom
Conversation
lanking520
reviewed
Aug 5, 2025
| self.token_to_kv_pool_allocator.free_group_begin() | ||
|
|
||
| # Process logprobs for scoring requests | ||
| if batch.return_logprob and logits_output is not None: |
There was a problem hiding this comment.
Suggested change
| if batch.return_logprob and logits_output is not None: | |
| if batch.return_logprob and logits_output: |
| logits_output, _, _ = self.tp_worker.resolve_last_batch_result(launch_done) | ||
| else: | ||
| # Move logprobs to CPU if needed | ||
| if batch.return_logprob and logits_output is not None: |
There was a problem hiding this comment.
Suggested change
| if batch.return_logprob and logits_output is not None: | |
| if batch.return_logprob and logits_output: |
| logits_output.next_token_token_ids_logprobs_idx is not None): | ||
|
|
||
| # Initialize all the logprob fields for scoring request | ||
| if req.input_token_logprobs_val is None: |
There was a problem hiding this comment.
what is the consequence if we just pass in None value as input? Should we just change the default value of request to [] if needed?
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
🚀 Motivation
🔧 Modifications
Introduces dedicated objects for the Scoring API to simplify the implementation and enable targeted optimizations.
Remote Decode Optimization:
Scoring workloads typically require only log probabilities at specific token positions and do not involve token generation. This PR removes the decode phase entirely for such workloads, resulting in significantly lower latency and improved throughput.
The new code structure cleanly separates scoring-specific logic from the text generation path, ensuring that prefill-only optimizations do not interfere with generation behavior.
📈 Future Optimizations
This refactor lays the foundation for further improvements, including:
Memory Transfer Optimization:
Move all CPU-GPU synchronization to the post-processing loop to better overlap
run_batchwith post-processing.Multi-Item Scoring:
Support scoring multiple items within a single prompt using custom attention masks via FlashInfer.
→ See flashinfer-ai/flashinfer#1015
Breaking Changes
/v1/scoreendpoint maintains backward compatibilityAccuracy Test
Profiling
bench_score.pyto properly test the new scoring pipelineforward_extendandforward_decode.🧪 Benchmark Comparison: Qwen3-0.5B on H100 (CUDA 12.8)
Setup:
Results
🔍 Summary of Improvement
Checklist