-
-
Notifications
You must be signed in to change notification settings - Fork 11.7k
[Bugfix][SpecDecode] apply sampling parameters to target probabilities for consistency in rejection sampling. #10198
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
|
@sroy745 @LiuXiaoxuanPKU @njhill |
njhill
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jeongin601 this looks like a very nice finding!
We may still want to make and use a (shallow) copy of the sampling parameters with the seed removed in the case a seed is set, to avoid doing seeded sampling for the non-final tokens.
@njhill, I'm curious about the reason why the seed should be removed, especially if it is used for the target model sampling and affects the output token selection when proposals are rejected. |
Signed-off-by: jeongin601 <[email protected]>
Signed-off-by: jeongin601 <[email protected]> Signed-off-by: jeong_in.bae <[email protected]>
Signed-off-by: jeongin601 <[email protected]>
289341d to
a54d83e
Compare
Signed-off-by: jeongin601 <[email protected]>
Signed-off-by: jeongin601 <[email protected]>
Signed-off-by: jeongin601 <[email protected]>
@joennlae ah sorry, perhaps I misremembered the logic, I didn't think those sampled tokens could end up getting used. I'll check it again but if you're right then makes sense to ignore that seed optimization. |
|
Adding /ready to kick off the tests and verify nothing else fails from this |
@njhill vllm/vllm/model_executor/layers/rejection_sampler.py Lines 183 to 189 in 3a763ba
|
Hi, cc: @tdoublep who made the change for respecting per request seed in spec-decode worker. @tdoublep can you PTAL and see if this change impacts the per request seeding logic or not. @jeongin601 there is one test failure in the spec_decoding tests (test_many_k[1-32-2-test_llm_kwargs3-baseline_llm_kwargs0-per_test_common_llm_kwargs0-common_llm_kwargs0]). I ran the test locally and it passes. Also from the failure logs it seems transient. Can you please trigger the tests once to see if it passes or not? |
|
Thank you @sroy745, I was able to check correctly after your comments. I found out that this PR also corrects the seed for I also confirmed that this section remains unchanged by this PR and is already using the correct sampling parameters. This PR cannot affect the |
Signed-off-by: jeongin601 <[email protected]>
Signed-off-by: jeongin601 <[email protected]>
Signed-off-by: jeongin601 <[email protected]>
|
|
Signed-off-by: jeongin601 <[email protected]>
Signed-off-by: jeongin601 <[email protected]>
|
@sroy745 I retriggered the tests, but they still fail on a seeded speculative decoding test. It seems this happens because my code reproduces the same results when the sampling seed is set to 'None.' But if the seed value is set to 'None,' sampling should use the default seed value. Then why should it produce different results? |
This test was added in this PR. |
Hi, The output is determined by the probability distribution of the target model. Prior to this change the temperature of the target model would be set to 1 (https://sourcegraph.com/github.com/vllm-project/vllm/-/blob/vllm/spec_decode/batch_expansion.py?L317) and hence the probability distribution would be uniform as shown in the logs below. When the output distribution is uniform having a seed matters in-order to guarantee a deterministic output. However after your change, for the failure case you are now setting the sampling temperature for the target model to 0.1. This means that the probability distribution of the sampled tokens is no longer uniform (highly skewed). When that happens I think it does not matter if there is a seed or not. We will always sample the token with prob 1.0 I tried running the test with temperature 0.6 instead of 0.1 and it passes. |
|
I was always wondering why we use different sampling parameters for the speculative model vs. main model. I reviewed the paper referenced in the rejection sampling code and it doesn't explicitly say either way. I guess one has the freedom to sample the speculative model however one likes, but the results in this PR certainly suggest it makes sense to use the same sampling params as the main model. Really cool! @jeongin601
I don't think that is correct: temperature=1.0 doesn't imply a uniform distribution, the distribution is determined by the logits and the temperature is just a scaling factor in the softmax that transforms the logits to probabilities.
Yes, setting the temperature to 0.1 will make the test more deterministic than it would be with temperature 1.0. However, it definitely doesn't mean we will always sample the same token with probability exactly 1.0, but rather some high probability. @sroy745 I think your explanation makes sense: the failing test checks that we get different output each time when not applying the seed. This is not guaranteed to happen when using a low temperature (e.g., approaching the greedy limit), so the test is not really well-defined. The changes in this PR will make everything more deterministic at low temperature, which explains why it was passing before (e.g., because we were using temperature=1.0 to sample the speculative model). I would recommend we modify that test to run only when the temperature is >=1.0. |
Signed-off-by: jeongin601 <[email protected]>
|
Thank you for all the reviews and analysis. I understood why my code kept failing the seeded correctness test. As you all pointed out, I also believe we need to test at higher temperatures. To address this, I modified the test to run at temperatures of 0.6, 1.0, and 1.2. |
Signed-off-by: jeongin601 <[email protected]>
sroy745
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the pr. Left one comment about one test. otherwise lgtm
| @pytest.mark.parametrize("output_len", [64]) | ||
| @pytest.mark.parametrize("batch_size", [1, 32]) | ||
| @pytest.mark.parametrize("temperature", [0.1, 1.0]) | ||
| @pytest.mark.parametrize("temperature", [0.6, 1.0, 1.2]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit - wondering if it would be ok to run it only for temperature 1.0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated it ! :) What do you think about the A100 distribution test failure? ..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi I think the failures in distributed-tests-a100 may not be related to this pr (I see errors related CustomAllreduce in the failure logs). I see this test fail in some other prs as well. The other 2 failures seem related to timeouts.
Signed-off-by: jeongin601 <[email protected]>
LiuXiaoxuanPKU
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Sorry for the late reply here, yes, using the same sampling params for the draft and target model makes a lot of sense to me.
|
Hi @simon-mo, cc: @LiuXiaoxuanPKU |
|
Thanks @jeongin601 @llsj14 @sroy745 @tdoublep for all of the analysis here! @jeongin601 could you merge in the latest main which should hopefully address the test failure? |
…s for consistency in rejection sampling. (vllm-project#10198) Signed-off-by: jeongin601 <[email protected]> Signed-off-by: jeong_in.bae <[email protected]> Signed-off-by: Andrew Feldman <[email protected]>
…s for consistency in rejection sampling. (vllm-project#10198) Signed-off-by: jeongin601 <[email protected]> Signed-off-by: jeong_in.bae <[email protected]>
…s for consistency in rejection sampling. (vllm-project#10198) Signed-off-by: jeongin601 <[email protected]> Signed-off-by: jeong_in.bae <[email protected]>
FIX #9834 (link existing issues this PR will resolve)
Problem
The current
BatchExpansionTop1Scorerimplements a speculative scoring mechanism that uses batch expansion to estimate the probabilities of speculative tokens based on the scoring model. However, in the existing setup,SequenceGroupMetadataapplies default sampling parameters (top_p=1.0, temperature=1.0, repetition_penalty=1.0) when generating target probabilities. According to comments in the code, this choice seems to be made since the sampled tokens are not used directly.Modification
Although we do not directly sample tokens from the target model while scoring, I believe applying consistent sampling parameters to both draft and target probabilities is essential for accurate rejection sampling. The current implementation uses draft probabilities influenced by sampling (filtered by top_p), while target probabilities are not, leading to a mismatch that could affect scoring accuracy. Because the unsampled target probabilities don’t represent actual usage probabilities, I modified the code to apply the same sampling parameters to both draft and target probabilities for consistency in rejection sampling.
In my experiment, this change resulted in a significant difference in the acceptance rate, as shown in the figures below.
Experiment
Setting
As-Is
To-be (applied in this PR)