Skip to content

Implement Adaptive-P Sampler#1100

Merged
ikawrakow merged 26 commits intoikawrakow:mainfrom
dungquixote42:main
Jan 10, 2026
Merged

Implement Adaptive-P Sampler#1100
ikawrakow merged 26 commits intoikawrakow:mainfrom
dungquixote42:main

Conversation

@dungquixote42
Copy link
Copy Markdown
Contributor

@dungquixote42 dungquixote42 commented Dec 29, 2025

I just ported things over. Tested with a SillyTavern fork. Links below.

Acknowledgements:
@MrJackSpade - for the original implementation of the sampler: https://github.com/MrJackSpade/llama.cpp/
@ddh0 - for the mainline PR: ggml-org/llama.cpp#17927
@AesSedai - for the frontend: https://github.com/AesSedai/SillyTavern/tree/power-law-sampler

@Geechan
Copy link
Copy Markdown

Geechan commented Dec 29, 2025

Implementing this will address #1074. Note that this sampler is not 100% finalised yet, so it may be best to merge and make appropriate edits to align with upstream until it's merged in upstream llama.cpp.

@Ph0rk0z
Copy link
Copy Markdown

Ph0rk0z commented Dec 30, 2025

Pretty cool! Will try it!

@ikawrakow
Copy link
Copy Markdown
Owner

Thank you for the PR. Will review when back from vacation.

@Geechan
Copy link
Copy Markdown

Geechan commented Dec 30, 2025

I've built and tested this PR. Unfortunately, once context is fully ingested, upon any generation, I run into this fatal error which causes the program to crash:

/usr/include/c++/15.2.1/bits/stl_vector.h:1263: constexpr std::vector<_Tp, _Alloc>::reference std::vector<_Tp, _Alloc>::operator[](size_type) [with _Tp = float; _Alloc = std::allocator<float>; reference = float&; size_type = long unsigned int]: Assertion '__n < this->size()' failed.

Seems to be related to a buffer overflow at first glance.

EDIT: Fixed with latest commits.

@ddh0
Copy link
Copy Markdown

ddh0 commented Jan 3, 2026

Docs: https://github.com/MrJackSpade/adaptive-p-docs/blob/main/README.md

Reference implementation docs: https://github.com/MrJackSpade/adaptive-p-docs/blob/main/sections/09_implementation.md

Our mainline PR: ggml-org/llama.cpp#17927

Feel free to ping me with questions if I can help. Thanks for this PR.

@Ph0rk0z
Copy link
Copy Markdown

Ph0rk0z commented Jan 4, 2026

SillyTavern/SillyTavern@2fb4ab3 applied to llama.cpp seems more rambunctious than the one in AesSedai. 0.3/0.4 is still relatively normal and 0.05-0.15 is more extreme. Unlike with XTC, this one is a bit more subtle and behaves differently on finetuned models.

I think I'll have to try it on something finetuned more hostile like qwen/glm instead of community cohere and llamas.

@ikawrakow
Copy link
Copy Markdown
Owner

I still see commit activity here. Is it ready for review?

@Geechan
Copy link
Copy Markdown

Geechan commented Jan 5, 2026

Not yet. There's still some further code refactoring that needs to be done to match it closer to the mainline PR. Forwarded from @ddh0, who wrote some more pertinent information to me earlier:

Note: not all these might be directly applicable for ik_llama vs. mainline, so take with an appropriate grain of salt.

  • The entire function llama_sample_softmax_nosort_impl(...) is probably not necessary; just use llama_sample_softmax_impl correctly (i.e. by not setting sorted manually - just do it like the other samplers are doing it).
  • const int64_t t_start_sample_us = ggml_time_us(); is set inside llama_sample_token_adaptive_p_impl, no other sampler does this (it is handled somewhere else; you don't need to worry about measuring timings).
  • This is still using samplaw as a variable name instead of smpl or sampler which is confusing for no reason and inconsistent.
  • The behaviour in clone does not match mainline. In particular result_ctx->pre_xform_probs - looks like it was copied based on the old code. The new code should be implemented instead.
  • The way seed is handled in llama_sampler_init_adaptive_p_impl seems incorrect and unnecessary (just pass the seed as given, don't special-case LLAMA_DEFAULT_SEED. I think this is handled elsewhere as well).
  • @dungquixote42 is keeping track of max_logit in llama_sampler_adaptive_p_apply, this is completely unnecessary and incorrect.
  • The constants are declared inside _apply each time which is wasteful. It's probably optimized away by the compiler but it's better to be explicit about it and declare them as static constexpr at the file level since they do not change.

For @dungquixote42, make sure to read the documentation applied here for any further implementation requirements.

@Ph0rk0z
Copy link
Copy Markdown

Ph0rk0z commented Jan 5, 2026

It seemed not to play so nice with banned strings on ST. Model would runaway generations. DRY works when applied before it but is it really necessary?

@MrJackSpade
Copy link
Copy Markdown

It seemed not to play so nice with banned strings on ST. Model would runaway generations. DRY works when applied before it but is it really necessary?

As a result of how the original probabilities are stored and used as a moving average, anything that alters the probabilities before this sampler can have a negative effect on the output.

Here is an explanation from Claude based off the Llama.cpp code


Adaptive-P stores the probability distribution it receives and uses those values to update its EMA:

llama_sampler_softmax_impl(cur_p, false);
for (size_t i = 0; i < cur_p->size; ++i) {
    ctx->original_probs[i] = cur_p->data[i].p;
}
// ... later, after selection:
ctx->weighted_sum = ctx->original_probs[idx] + ctx->decay * ctx->weighted_sum;

The adaptive target formula is 2 * target - weighted_average. This compensates based on what probabilities the sampler thinks it has been selecting.

If anything modifies the distribution before Adaptive-P runs, the EMA tracks the modified probabilities rather than the model's actual probabilities. The adaptive mechanism then makes corrections based on incorrect information—it's compensating for a history that doesn't reflect what the model actually predicted.

If your target is 0.5 but the EMA is recording inflated probabilities, the adaptive mechanism thinks you're consistently selecting above-target tokens. It compensates by pushing the calculated target lower, which biases selection toward lower-probability tokens. This can cascade into runaway behavior as the EMA keeps recording distorted probabilities and over-correcting.

@Ph0rk0z
Copy link
Copy Markdown

Ph0rk0z commented Jan 5, 2026

Wow.. so min_P and that's it. Waiting on devstral to finish to see if I can bypass "oh". Loves to start every reply with that and will be a very easy a/b test. On cohere I was able to use dry and bans but adaptive-p didn't seem to have much effect. L3 tune is where I had all the trouble, that model may not be super stable though and compounded the above effects.

Am also curious the effect this will have on making image prompts from the context. In RP that's where the instruction following has to shine. Some models make great dialogue and then fail that portion.

ohhhhhhh K

So this sampler can't fix "Oh", "OH" problem of devstral but for some reason neither can token banning or logit bias. Even when I put the token IDs in. Is it working here? Perhaps its due to being the first token? When I request logprobs the value for the tokens is the same.

@dungquixote42
Copy link
Copy Markdown
Contributor Author

dungquixote42 commented Jan 6, 2026

The entire function llama_sample_softmax_nosort_impl(...) is probably not necessary; just use llama_sample_softmax_impl correctly (i.e. by not setting sorted manually - just do it like the other samplers are doing it).

llama_sample_softmax_impl either sorts or assumes sorted candidates. It does not support searching. Which means:
candidates->sorted=true after transform (lie) > second softmax assumes first logit is max (it's not) > sampler breaks
or
candidates->sorted=false after transform (truth) > second softmax sorts and loses correct index to pre-transform probabilities > sampler breaks
I believe llama_sample_softmax_impl is incompatible with this sampler and a nosort version is necessary.

EDIT:
I learned it is possible to use the token ID to update, which is compatible with llama_sample_softmax_impl. But that's how it is implemented in kcpp, which is having problems with the sampler. So I am inclined to keep my implementation.

const int64_t t_start_sample_us = ggml_time_us(); is set inside llama_sample_token_adaptive_p_impl, no other sampler does this (it is handled somewhere else; you don't need to worry about measuring timings).

Mirostat does it. llama_sample_token_adaptive_p_impl is a "selector", more akin to llama_sample_token_mirostat_impl or llama_sample_token_with_rng_impl. From what I can tell, samplers that need to know which token is selected implements a separate "selector" because samplers in ik_llama do not directly indicate which token is to be selected.

This is still using samplaw as a variable name instead of smpl or sampler which is confusing for no reason and inconsistent.

smpl is taken by DRY in llama_sampling_context and typically used for struct llama_sampling * types, not struct llama_sampler * types. sampler is not used as a variable name in general. w_smpl or smpl_w?

keeping track of max_logit in llama_sampler_adaptive_p_apply, this is completely unnecessary and incorrect.

If the maximum logit is tracked while the new logit are calculated, the subsequent loop in the softmax for searching the maximum logit can be bypassed. I think this is worthwhile. This looks correct to me.

I am working on addressing other comments.

@dungquixote42
Copy link
Copy Markdown
Contributor Author

dungquixote42 commented Jan 7, 2026

@ikawrakow I see DRY does not implement iface found in struct llama_sampler. Is iface actually required? I believe this sampler can be implemented better if not.

@ikawrakow
Copy link
Copy Markdown
Owner

@ikawrakow I see DRY does not implement iface found in struct llama_sampler. Is iface actually required? I believe this sampler can be implemented better if not.

The sampling situation in ik_llama.cpp is a bit of a mess. There have been some sporadic updates with stuff from mainline, with iface only being there to just make something compile.

@dungquixote42
Copy link
Copy Markdown
Contributor Author

@ikawrakow Thanks. iface is removed in the latest commit, which also includes simplifications/improvements possible with removing iface. I believe this PR is ready for review.

const int64_t t_start_sample_us = ggml_time_us();

// softmax with known maximum logit
llama_sample_softmax_nosort_impl(nullptr, candidates, &(adapt_p_ctx->max_logit));
Copy link
Copy Markdown
Owner

@ikawrakow ikawrakow Jan 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where/when is adapt_p_ctx->max_logit initialized to a meaningful value?

NVM, I saw it below.

llama_sample_softmax_nosort_impl(nullptr, candidates, &(adapt_p_ctx->max_logit));

// sample
std::vector<float> probs;
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this vector be made a member of the adaptive sampler context? So that a new allocation for each new token is not required.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I will make it a member in follow-up commits.


// quadratic near target for finite differentiation, transitioning to linear decay in tails
// unbounded negative logits suppress far-from-target tokens after softmax
float max_logit = std::numeric_limits<float>::min();
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the intent here to use the minimum float value that is not zero (the value of std::numeric_limits<float>::min() = 1.17549e-38 ) or perhaps more something like -INFINITY ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not know -INFINITY was a thing. Heh. Google showed mestd::numeric_limits<float>::min(), and I said lgtm. Follow-up commits will have -INFINITY.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I meant to use -INFINITY. C dev here. C++ noob.

probs.emplace_back(candidates->data[i].p);
}
std::discrete_distribution<> dist(probs.begin(), probs.end());
llama_token id = candidates->data[dist(smpl->rng)].id;
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the logits have not been filtered to a relatively small number of candidates, this will be a fairly computationally expensive operation with typical vocabulary sizes.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This block is basically copied from llama_sample_token_with_rng_impl, minus push_back vs emplace_back. Is emplace_back much slower than push_back, or did I miss something here?

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is OK to merge it like this. But having done quite a bit of Monte Carlo in a previous life, I couldn't help myself but comment.

It is not the emplace_back() that is slow, but the overall implementation (and yes, I know, mainline's implementation also inherited here is far from ideal). We are basically going 3 times over the whole array of token probabilities, to then construct a std::discrete_distribution object, to get just a single random sample from that. If the candidates have been reduced to a relatively small number via top_k or min_p or similar, this is fine. But if we are going over the entire vocabulary of ~200k tokens, this is going to add a noticeable extra time relative to say, 100 t/s generation speed. My guess is that the best thing to do would be to just compute the cumulative probability distribution on-the-fly, and then use binary search to find the candidate given a random number between 0 and 1 multiplied with the last element of the cumulative distribution.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't mind updating the mainline implementation as well, as long as the distribution modification doesn't affect the result

The mainline implementation inherited a lot of inefficiency due to my own personal choice in models + hardware rarely exceeding ~5t/s. At those speeds, any optimization is a micro-optimization.

I'm having a difficult time visualizing your suggestion though.

Copy link
Copy Markdown
Contributor Author

@dungquixote42 dungquixote42 Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

    // first cum_prob is spacer
    const size_t count = candidates->size + 1;
    adapt_p_ctx->probs.reserve(count);`

    // cumulative distribution
    const float max_logit = adapt_p_ctx->max_logit;
    float cum_prob = 0.0f;
    for (size_t i = 0; i < count; ++i) {
        adapt_p_ctx->probs.emplace_back(cum_prob);
        cum_prob += expf(candidates->data[i].logit - max_logit);
    }
    const float target_cprob = cum_prob * (float)adapt_p_ctx->rng() / (float)adapt_p_ctx->rng.max();

    // my binary search
    bool done = false;
    size_t idx = (count >> 1) + 1;
    size_t stride = (count >> 1) + 1;
    while (!done) {
        stride = (stride >> 1) + 1;
        const float cprob = adapt_p_ctx->probs[idx];
        if (target_cprob > cprob) {
            idx += stride;
        }
        else if (target_cprob < cprob - adapt_p_ctx->probs[idx - 1]) {
            idx -= stride;
        }
        else {
            done = true;
        }
    }

    // ai slop
    auto it = std::lower_bound(adapt_p_ctx->probs.begin(), adapt_p_ctx->probs.end(), target_cprob);
    size_t idx = std::distance(adapt_p_ctx->probs.begin(), it) - 2;

    llama_token id = candidates->data[idx].id;

It does not work yet, but is this what you had in mind?

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, something along these lines.
I think, this should work:

    // first cum_prob is spacer
    adapt_p_ctx->probs.reserve(candidates->size);

    // cumulative distribution
    const float max_logit = adapt_p_ctx->max_logit;
    float cum_prob = 0.0f;
    for (size_t i = 0; i < count; ++i) {
        cum_prob += expf(candidates->data[i].logit - max_logit);
        adapt_p_ctx->probs.emplace_back(cum_prob); // note: we emplace **after** adding the current probability 
    }
    // add a safety to the last element just to be sure we avoid numerical issues when the random
    // number is (nearly) at maximum.
    adapt_p_ctx->probs.back() += 1.0f;
    const float target_cprob = cum_prob * (float)adapt_p_ctx->rng() / (float)adapt_p_ctx->rng.max();
    auto it = std::upper_bound(adapt_p_ctx->probs.begin(), adapt_p_ctx->probs.end(), target_prob);
    GGML_ASSERT(it != adapt_p_ctx->probs.end());
    llama_token id = candidates->data[std::distance(adapt_p_ctx->probs.begin(), it);

@Ph0rk0z
Copy link
Copy Markdown

Ph0rk0z commented Jan 9, 2026

I merged the latest stuff from main and the updates to the PR. My generation is way less likely to be unstable or run away.

@MrJackSpade
Copy link
Copy Markdown

If someone wants to ping me pre-merge I can run the full suite of tests against the implementation, just to make sure it's still behaving as expected.

@dungquixote42
Copy link
Copy Markdown
Contributor Author

My last push incorporates all the feedbacks as much as I could.

If someone wants to ping me pre-merge I can run the full suite of tests against the implementation, just to make sure it's still behaving as expected.

@MrJackSpade Yes, please. Muy gracias.

@MrJackSpade
Copy link
Copy Markdown

MrJackSpade commented Jan 10, 2026 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants