Skip to content

[Generative Score API] Optimization to Remove Decode#1

Closed
sundar24295s wants to merge 1 commit intomainfrom
suramach/removedecode
Closed

[Generative Score API] Optimization to Remove Decode#1
sundar24295s wants to merge 1 commit intomainfrom
suramach/removedecode

Conversation

@sundar24295s
Copy link
Copy Markdown
Owner

@sundar24295s sundar24295s commented Aug 5, 2025

🚀 Motivation

  • This PR is a follow-up to the Decoder-only Scoring API introduced in PR #6460, which was initially proposed and discussed in Issue #5973.
  • Achieved a 40% reduction in single-request latency for a 300-token input (120 query tokens + 180 item tokens), decreasing from 33 ms to 20 ms.

🔧 Modifications

  • Introduces dedicated objects for the Scoring API to simplify the implementation and enable targeted optimizations.

  • Remote Decode Optimization:
    Scoring workloads typically require only log probabilities at specific token positions and do not involve token generation. This PR removes the decode phase entirely for such workloads, resulting in significantly lower latency and improved throughput.

  • The new code structure cleanly separates scoring-specific logic from the text generation path, ensuring that prefill-only optimizations do not interfere with generation behavior.

📈 Future Optimizations

This refactor lays the foundation for further improvements, including:

  • Memory Transfer Optimization:
    Move all CPU-GPU synchronization to the post-processing loop to better overlap run_batch with post-processing.

  • Multi-Item Scoring:
    Support scoring multiple items within a single prompt using custom attention masks via FlashInfer.
    → See flashinfer-ai/flashinfer#1015

Breaking Changes

  • None for public APIs - the /v1/score endpoint maintains backward compatibility
  • Internal request handling has been refactored but doesn't affect external interfaces

Accuracy Test

  • TBA

Profiling

  • Updated Benchmark Scripts: Enhanced bench_score.py to properly test the new scoring pipeline
  • Before this change, for a single request we see forward_extend and forward_decode.
image - After this change, we only have `forward_extend`. image

🧪 Benchmark Comparison: Qwen3-0.5B on H100 (CUDA 12.8)

Setup:

  • Model: Qwen3-0.5B (pruned, decoder-only)
  • Prompt length: 300 tokens
  • Hardware: H100 GPU
  • Duration: 120s
  • Target RPS: 70
  • Item Count: 10
  • Distribution: Poisson
  • Transport: HTTP

Results

  • Achieved a 40% reduction in single-request latency for a 300-token input (120 query tokens + 180 item tokens), decreasing from 33 ms to 20 ms.
  • For batch size 10 and the same 300-token input, the table below compares the latency percentiles before and after the change.

🔍 Summary of Improvement

Metric Baseline With Change Improvement
Avg Response Time (ms) 83.92 51.16 ↓ 39%
P50 Latency (ms) 51.77 39.10 ↓ 24%
P90 Latency (ms) 114.13 73.37 ↓ 36%
P99 Latency (ms) 621.66 391.96 ↓ 37%
Single Request Latency 33 ms 20 ms ↓ 39%

Checklist

@sundar24295s sundar24295s changed the title Optimize Score - Remove Decode [Generative Score API] Optimization to Remove Decode Aug 5, 2025
self.token_to_kv_pool_allocator.free_group_begin()

# Process logprobs for scoring requests
if batch.return_logprob and logits_output is not None:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if batch.return_logprob and logits_output is not None:
if batch.return_logprob and logits_output:

logits_output, _, _ = self.tp_worker.resolve_last_batch_result(launch_done)
else:
# Move logprobs to CPU if needed
if batch.return_logprob and logits_output is not None:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if batch.return_logprob and logits_output is not None:
if batch.return_logprob and logits_output:

logits_output.next_token_token_ids_logprobs_idx is not None):

# Initialize all the logprob fields for scoring request
if req.input_token_logprobs_val is None:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the consequence if we just pass in None value as input? Should we just change the default value of request to [] if needed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants