Some clarifying questions on the paper #46

ekagra-ranjan · 2025-07-03T16:57:06Z

ekagra-ranjan
Jul 3, 2025

Hi Nadav,

I recently came across your DSI paper. Congrats on the ICLR 25 acceptance! I also watched your Neurips 24 recording. It looks interesting and I have some questions.

Eq 1 for finding lookahead suggests that we run drafter on 1 GPU and target on multiple GPUs. The number of GPUs to run target is proportional to (target latency) / (lookahead * draft latency).

I am imagining that once a draft generates a token, it schedules a verification thread on only one of the available GPU for target verification. This would make sense since we want at least (target latency) / (lookahead * draft latency) GPUs. However, the pseudocode mentions that we schedule the same verification on ALL the GPUs. I didn't get why we need to schedule the same verification task on ALL the GPUs?
When the target rejects, we cancel all the threads which have the incorrect sampled token. Since the drafter is using chain drafts, this would mean we are cancelling ALL the running and waiting processes that were scheduled, right? It means there is no reuse of the prev computation.
Most of the explanation is using one draft token as lookahead. If lookahead is lets say 3, then the drafter will sample 3 times before sending it off for verification. If the target accepts only 2 tokens then we will have to cancel all the processes as in point 2 above since all the processes assumed the 3rd token to be accepted. So for the verification latency to hide, we need all the lookahead tokens to be accepted, right?
If point 3 is correct then how do we analyze the speedup just by the Acceptance Length? I believe if the lookahead is 3 and AL is 2.2 then the speedup will be determined by the number of times 3 tokens were accepted so we would need to study the distribution of the accepted tokens at each step, i.e., how many steps had 3 tokens accepted and how had <3 tokens accepted. We can have a different split into these 2 buckets and still have the same AL of 2.2. What do you think?

Looking forward to hearing your thoughts on your work!

keyboardAnt · 2025-07-04T22:31:34Z

keyboardAnt
Jul 4, 2025
Maintainer

Hi @ekagra-ranjan - Thanks for the kind words and your interest. I hope the following addresses all your questions.

Lookahead

Relation to speedup

[...] So for the verification latency to hide, we need all the lookahead tokens to be accepted, right?

Yes, exactly!

I believe if the lookahead is 3 and AL is 2.2 then the speedup will be determined by the number of times 3 tokens were accepted [...]

Yes, the overall speedup depends on the number of verification requests that were fully accepted. In other words, we have (A) verifications and (B) threads relying on draft tokens. A are parallel to B only if A don't ultimately terminate B.

That's why selecting the minimal lookahead satisfying Equation 1 is ideal:

Optimal lookahead

Given sp target servers, and a drafter of latency ratio c (where c < 1 is equal to target_fwd / drafter_fwd), we should select the minimal lookahead such that lookahead >= 1/(c * sp).

Lookahead as a general drafting budget

Most of the explanation is using one draft token as lookahead. [...]

We use lookahead=1 for simplicity. More generally, we can view lookahead as "drafting budget". That's our way of ensuring that verification requests sent to the pool of target servers are immediately processed without waiting on busy target servers. In this more general case, we can use the drafting time to generate general draft token trees.

Implementations without lookahead

In practice, we can implement distributed speculative decoding without lookahead at all. As mentioned earlier, the lookahead purpose is to ensure no waiting on busy target servers, which is assumed by our theoretical guarantees. But we can either:

Use event-based control and signal the drafter once a target server is getting idle and therefore ready to process a new verification request from the drafter.
Implement a task queue for verification requests.

Each verification request is processed by at most one target server. With queueing, verification requests could be terminated before they are processed.

Target servers

DSI orchestrates servers that run the drafter and target models. It is agnostic to the number of GPUs used by each server. A simple way to think about it is as if we deploy independent vLLM servers. The simplest configuration of DSI is when we have only 2 servers running the target model and 1 server running the drafter.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Some clarifying questions on the paper #46

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Some clarifying questions on the paper #46

Uh oh!

Uh oh!

ekagra-ranjan Jul 3, 2025

Replies: 1 comment

Uh oh!

Uh oh!

keyboardAnt Jul 4, 2025 Maintainer

Lookahead

Relation to speedup

Optimal lookahead

Lookahead as a general drafting budget

Implementations without lookahead

Target servers

ekagra-ranjan
Jul 3, 2025

keyboardAnt
Jul 4, 2025
Maintainer