Implement Adaptive-P Sampler#1100
Conversation
|
Implementing this will address #1074. Note that this sampler is not 100% finalised yet, so it may be best to merge and make appropriate edits to align with upstream until it's merged in upstream llama.cpp. |
|
Pretty cool! Will try it! |
|
Thank you for the PR. Will review when back from vacation. |
|
I've built and tested this PR. Unfortunately, once context is fully ingested, upon any generation, I run into this fatal error which causes the program to crash:
Seems to be related to a buffer overflow at first glance. EDIT: Fixed with latest commits. |
|
Docs: https://github.com/MrJackSpade/adaptive-p-docs/blob/main/README.md Reference implementation docs: https://github.com/MrJackSpade/adaptive-p-docs/blob/main/sections/09_implementation.md Our mainline PR: ggml-org/llama.cpp#17927 Feel free to ping me with questions if I can help. Thanks for this PR. |
|
SillyTavern/SillyTavern@2fb4ab3 applied to llama.cpp seems more rambunctious than the one in AesSedai. 0.3/0.4 is still relatively normal and 0.05-0.15 is more extreme. Unlike with XTC, this one is a bit more subtle and behaves differently on finetuned models. I think I'll have to try it on something finetuned more hostile like qwen/glm instead of community cohere and llamas. |
|
I still see commit activity here. Is it ready for review? |
|
Not yet. There's still some further code refactoring that needs to be done to match it closer to the mainline PR. Forwarded from @ddh0, who wrote some more pertinent information to me earlier: Note: not all these might be directly applicable for ik_llama vs. mainline, so take with an appropriate grain of salt.
For @dungquixote42, make sure to read the documentation applied here for any further implementation requirements. |
|
It seemed not to play so nice with banned strings on ST. Model would runaway generations. DRY works when applied before it but is it really necessary? |
As a result of how the original probabilities are stored and used as a moving average, anything that alters the probabilities before this sampler can have a negative effect on the output. Here is an explanation from Claude based off the Llama.cpp code Adaptive-P stores the probability distribution it receives and uses those values to update its EMA: llama_sampler_softmax_impl(cur_p, false);
for (size_t i = 0; i < cur_p->size; ++i) {
ctx->original_probs[i] = cur_p->data[i].p;
}
// ... later, after selection:
ctx->weighted_sum = ctx->original_probs[idx] + ctx->decay * ctx->weighted_sum;The adaptive target formula is If anything modifies the distribution before Adaptive-P runs, the EMA tracks the modified probabilities rather than the model's actual probabilities. The adaptive mechanism then makes corrections based on incorrect information—it's compensating for a history that doesn't reflect what the model actually predicted. If your target is 0.5 but the EMA is recording inflated probabilities, the adaptive mechanism thinks you're consistently selecting above-target tokens. It compensates by pushing the calculated target lower, which biases selection toward lower-probability tokens. This can cascade into runaway behavior as the EMA keeps recording distorted probabilities and over-correcting. |
|
Wow.. so min_P and that's it. Waiting on devstral to finish to see if I can bypass "oh". Loves to start every reply with that and will be a very easy a/b test. On cohere I was able to use dry and bans but adaptive-p didn't seem to have much effect. L3 tune is where I had all the trouble, that model may not be super stable though and compounded the above effects. Am also curious the effect this will have on making image prompts from the context. In RP that's where the instruction following has to shine. Some models make great dialogue and then fail that portion. ohhhhhhh K So this sampler can't fix "Oh", "OH" problem of devstral but for some reason neither can token banning or logit bias. Even when I put the token IDs in. Is it working here? Perhaps its due to being the first token? When I request logprobs the value for the tokens is the same. |
EDIT:
Mirostat does it.
If the maximum logit is tracked while the new logit are calculated, the subsequent loop in the softmax for searching the maximum logit can be bypassed. I think this is worthwhile. This looks correct to me. I am working on addressing other comments. |
|
@ikawrakow I see DRY does not implement |
The sampling situation in |
|
@ikawrakow Thanks. |
src/llama-sampling.cpp
Outdated
| const int64_t t_start_sample_us = ggml_time_us(); | ||
|
|
||
| // softmax with known maximum logit | ||
| llama_sample_softmax_nosort_impl(nullptr, candidates, &(adapt_p_ctx->max_logit)); |
There was a problem hiding this comment.
Where/when is adapt_p_ctx->max_logit initialized to a meaningful value?
NVM, I saw it below.
src/llama-sampling.cpp
Outdated
| llama_sample_softmax_nosort_impl(nullptr, candidates, &(adapt_p_ctx->max_logit)); | ||
|
|
||
| // sample | ||
| std::vector<float> probs; |
There was a problem hiding this comment.
Can this vector be made a member of the adaptive sampler context? So that a new allocation for each new token is not required.
There was a problem hiding this comment.
Yes. I will make it a member in follow-up commits.
src/llama-sampling.cpp
Outdated
|
|
||
| // quadratic near target for finite differentiation, transitioning to linear decay in tails | ||
| // unbounded negative logits suppress far-from-target tokens after softmax | ||
| float max_logit = std::numeric_limits<float>::min(); |
There was a problem hiding this comment.
Is the intent here to use the minimum float value that is not zero (the value of std::numeric_limits<float>::min() = 1.17549e-38 ) or perhaps more something like -INFINITY ?
There was a problem hiding this comment.
I did not know -INFINITY was a thing. Heh. Google showed mestd::numeric_limits<float>::min(), and I said lgtm. Follow-up commits will have -INFINITY.
There was a problem hiding this comment.
Yes. I meant to use -INFINITY. C dev here. C++ noob.
src/llama-sampling.cpp
Outdated
| probs.emplace_back(candidates->data[i].p); | ||
| } | ||
| std::discrete_distribution<> dist(probs.begin(), probs.end()); | ||
| llama_token id = candidates->data[dist(smpl->rng)].id; |
There was a problem hiding this comment.
If the logits have not been filtered to a relatively small number of candidates, this will be a fairly computationally expensive operation with typical vocabulary sizes.
There was a problem hiding this comment.
This block is basically copied from llama_sample_token_with_rng_impl, minus push_back vs emplace_back. Is emplace_back much slower than push_back, or did I miss something here?
There was a problem hiding this comment.
It is OK to merge it like this. But having done quite a bit of Monte Carlo in a previous life, I couldn't help myself but comment.
It is not the emplace_back() that is slow, but the overall implementation (and yes, I know, mainline's implementation also inherited here is far from ideal). We are basically going 3 times over the whole array of token probabilities, to then construct a std::discrete_distribution object, to get just a single random sample from that. If the candidates have been reduced to a relatively small number via top_k or min_p or similar, this is fine. But if we are going over the entire vocabulary of ~200k tokens, this is going to add a noticeable extra time relative to say, 100 t/s generation speed. My guess is that the best thing to do would be to just compute the cumulative probability distribution on-the-fly, and then use binary search to find the candidate given a random number between 0 and 1 multiplied with the last element of the cumulative distribution.
There was a problem hiding this comment.
I wouldn't mind updating the mainline implementation as well, as long as the distribution modification doesn't affect the result
The mainline implementation inherited a lot of inefficiency due to my own personal choice in models + hardware rarely exceeding ~5t/s. At those speeds, any optimization is a micro-optimization.
I'm having a difficult time visualizing your suggestion though.
There was a problem hiding this comment.
// first cum_prob is spacer
const size_t count = candidates->size + 1;
adapt_p_ctx->probs.reserve(count);`
// cumulative distribution
const float max_logit = adapt_p_ctx->max_logit;
float cum_prob = 0.0f;
for (size_t i = 0; i < count; ++i) {
adapt_p_ctx->probs.emplace_back(cum_prob);
cum_prob += expf(candidates->data[i].logit - max_logit);
}
const float target_cprob = cum_prob * (float)adapt_p_ctx->rng() / (float)adapt_p_ctx->rng.max();
// my binary search
bool done = false;
size_t idx = (count >> 1) + 1;
size_t stride = (count >> 1) + 1;
while (!done) {
stride = (stride >> 1) + 1;
const float cprob = adapt_p_ctx->probs[idx];
if (target_cprob > cprob) {
idx += stride;
}
else if (target_cprob < cprob - adapt_p_ctx->probs[idx - 1]) {
idx -= stride;
}
else {
done = true;
}
}
// ai slop
auto it = std::lower_bound(adapt_p_ctx->probs.begin(), adapt_p_ctx->probs.end(), target_cprob);
size_t idx = std::distance(adapt_p_ctx->probs.begin(), it) - 2;
llama_token id = candidates->data[idx].id;
It does not work yet, but is this what you had in mind?
There was a problem hiding this comment.
Yes, something along these lines.
I think, this should work:
// first cum_prob is spacer
adapt_p_ctx->probs.reserve(candidates->size);
// cumulative distribution
const float max_logit = adapt_p_ctx->max_logit;
float cum_prob = 0.0f;
for (size_t i = 0; i < count; ++i) {
cum_prob += expf(candidates->data[i].logit - max_logit);
adapt_p_ctx->probs.emplace_back(cum_prob); // note: we emplace **after** adding the current probability
}
// add a safety to the last element just to be sure we avoid numerical issues when the random
// number is (nearly) at maximum.
adapt_p_ctx->probs.back() += 1.0f;
const float target_cprob = cum_prob * (float)adapt_p_ctx->rng() / (float)adapt_p_ctx->rng.max();
auto it = std::upper_bound(adapt_p_ctx->probs.begin(), adapt_p_ctx->probs.end(), target_prob);
GGML_ASSERT(it != adapt_p_ctx->probs.end());
llama_token id = candidates->data[std::distance(adapt_p_ctx->probs.begin(), it);|
I merged the latest stuff from main and the updates to the PR. My generation is way less likely to be unstable or run away. |
|
If someone wants to ping me pre-merge I can run the full suite of tests against the implementation, just to make sure it's still behaving as expected. |
|
My last push incorporates all the feedbacks as much as I could.
@MrJackSpade Yes, please. Muy gracias. |
|
All right. Will run tonight and report back with confirmation ASAP
…On Fri, Jan 9, 2026, 8:24 PM dungquixote42 ***@***.***> wrote:
*dungquixote42* left a comment (ikawrakow/ik_llama.cpp#1100)
<#1100 (comment)>
My last push incorporates all the feedbacks as much as I could.
If someone wants to ping me pre-merge I can run the full suite of tests
against the implementation, just to make sure it's still behaving as
expected.
@MrJackSpade <https://github.com/MrJackSpade> Yes, please. Muy gracias.
—
Reply to this email directly, view it on GitHub
<#1100 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AEFCJR7OZE5MNUWS7FL4FIL4GBWFHAVCNFSM6AAAAACQIUR356VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTOMZRG43DCNBYGY>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
I just ported things over. Tested with a SillyTavern fork. Links below.
Acknowledgements:
@MrJackSpade - for the original implementation of the sampler: https://github.com/MrJackSpade/llama.cpp/
@ddh0 - for the mainline PR: ggml-org/llama.cpp#17927
@AesSedai - for the frontend: https://github.com/AesSedai/SillyTavern/tree/power-law-sampler