Some clarifying questions on the paper #46
Replies: 1 comment
-
|
Hi @ekagra-ranjan - Thanks for the kind words and your interest. I hope the following addresses all your questions. LookaheadRelation to speedup
Yes, exactly!
Yes, the overall speedup depends on the number of verification requests that were fully accepted. In other words, we have (A) verifications and (B) threads relying on draft tokens. A are parallel to B only if A don't ultimately terminate B. That's why selecting the minimal Optimal lookaheadGiven Lookahead as a general drafting budget
We use Implementations without lookaheadIn practice, we can implement distributed speculative decoding without
Each verification request is processed by at most one target server. With queueing, verification requests could be terminated before they are processed. Target serversDSI orchestrates servers that run the drafter and target models. It is agnostic to the number of GPUs used by each server. A simple way to think about it is as if we deploy independent vLLM servers. The simplest configuration of DSI is when we have only 2 servers running the target model and 1 server running the drafter. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi Nadav,
I recently came across your DSI paper. Congrats on the ICLR 25 acceptance! I also watched your Neurips 24 recording. It looks interesting and I have some questions.
Eq 1 for finding lookahead suggests that we run drafter on 1 GPU and target on multiple GPUs. The number of GPUs to run target is proportional to (target latency) / (lookahead * draft latency).
I am imagining that once a draft generates a token, it schedules a verification thread on only one of the available GPU for target verification. This would make sense since we want at least (target latency) / (lookahead * draft latency) GPUs. However, the pseudocode mentions that we schedule the same verification on ALL the GPUs. I didn't get why we need to schedule the same verification task on ALL the GPUs?
When the target rejects, we cancel all the threads which have the incorrect sampled token. Since the drafter is using chain drafts, this would mean we are cancelling ALL the running and waiting processes that were scheduled, right? It means there is no reuse of the prev computation.
Most of the explanation is using one draft token as lookahead. If lookahead is lets say 3, then the drafter will sample 3 times before sending it off for verification. If the target accepts only 2 tokens then we will have to cancel all the processes as in point 2 above since all the processes assumed the 3rd token to be accepted. So for the verification latency to hide, we need all the lookahead tokens to be accepted, right?
If point 3 is correct then how do we analyze the speedup just by the Acceptance Length? I believe if the lookahead is 3 and AL is 2.2 then the speedup will be determined by the number of times 3 tokens were accepted so we would need to study the distribution of the accepted tokens at each step, i.e., how many steps had 3 tokens accepted and how had <3 tokens accepted. We can have a different split into these 2 buckets and still have the same AL of 2.2. What do you think?
Looking forward to hearing your thoughts on your work!
Beta Was this translation helpful? Give feedback.
All reactions