Implement Adaptive-P Sampler by dungquixote42 · Pull Request #1100 · ikawrakow/ik_llama.cpp

dungquixote42 · 2025-12-29T22:43:38Z

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

I just ported things over. Tested with a SillyTavern fork. Links below.

Acknowledgements:
@MrJackSpade - for the original implementation of the sampler: https://github.com/MrJackSpade/llama.cpp/
@ddh0 - for the mainline PR: ggml-org/llama.cpp#17927
@AesSedai - for the frontend: https://github.com/AesSedai/SillyTavern/tree/power-law-sampler

Geechan · 2025-12-29T23:55:30Z

Implementing this will address #1074. Note that this sampler is not 100% finalised yet, so it may be best to merge and make appropriate edits to align with upstream until it's merged in upstream llama.cpp.

Ph0rk0z · 2025-12-30T11:00:26Z

Pretty cool! Will try it!

ikawrakow · 2025-12-30T13:39:43Z

Thank you for the PR. Will review when back from vacation.

Geechan · 2025-12-30T22:51:11Z

I've built and tested this PR. Unfortunately, once context is fully ingested, upon any generation, I run into this fatal error which causes the program to crash:

/usr/include/c++/15.2.1/bits/stl_vector.h:1263: constexpr std::vector<_Tp, _Alloc>::reference std::vector<_Tp, _Alloc>::operator[](size_type) [with _Tp = float; _Alloc = std::allocator<float>; reference = float&; size_type = long unsigned int]: Assertion '__n < this->size()' failed.

Seems to be related to a buffer overflow at first glance.

EDIT: Fixed with latest commits.

ddh0 · 2026-01-03T22:01:54Z

Docs: https://github.com/MrJackSpade/adaptive-p-docs/blob/main/README.md

Reference implementation docs: https://github.com/MrJackSpade/adaptive-p-docs/blob/main/sections/09_implementation.md

Our mainline PR: ggml-org/llama.cpp#17927

Feel free to ping me with questions if I can help. Thanks for this PR.

Ph0rk0z · 2026-01-04T13:59:51Z

SillyTavern/SillyTavern@2fb4ab3 applied to llama.cpp seems more rambunctious than the one in AesSedai. 0.3/0.4 is still relatively normal and 0.05-0.15 is more extreme. Unlike with XTC, this one is a bit more subtle and behaves differently on finetuned models.

I think I'll have to try it on something finetuned more hostile like qwen/glm instead of community cohere and llamas.

ikawrakow · 2026-01-05T06:01:54Z

I still see commit activity here. Is it ready for review?

Geechan · 2026-01-05T07:28:09Z

Not yet. There's still some further code refactoring that needs to be done to match it closer to the mainline PR. Forwarded from @ddh0, who wrote some more pertinent information to me earlier:

Note: not all these might be directly applicable for ik_llama vs. mainline, so take with an appropriate grain of salt.

The entire function llama_sample_softmax_nosort_impl(...) is probably not necessary; just use llama_sample_softmax_impl correctly (i.e. by not setting sorted manually - just do it like the other samplers are doing it).
const int64_t t_start_sample_us = ggml_time_us(); is set inside llama_sample_token_adaptive_p_impl, no other sampler does this (it is handled somewhere else; you don't need to worry about measuring timings).
This is still using samplaw as a variable name instead of smpl or sampler which is confusing for no reason and inconsistent.
The behaviour in clone does not match mainline. In particular result_ctx->pre_xform_probs - looks like it was copied based on the old code. The new code should be implemented instead.
The way seed is handled in llama_sampler_init_adaptive_p_impl seems incorrect and unnecessary (just pass the seed as given, don't special-case LLAMA_DEFAULT_SEED. I think this is handled elsewhere as well).
@dungquixote42 is keeping track of max_logit in llama_sampler_adaptive_p_apply, this is completely unnecessary and incorrect.
The constants are declared inside _apply each time which is wasteful. It's probably optimized away by the compiler but it's better to be explicit about it and declare them as static constexpr at the file level since they do not change.

For @dungquixote42, make sure to read the documentation applied here for any further implementation requirements.

Ph0rk0z · 2026-01-05T20:27:02Z

It seemed not to play so nice with banned strings on ST. Model would runaway generations. DRY works when applied before it but is it really necessary?

MrJackSpade · 2026-01-05T21:16:40Z

It seemed not to play so nice with banned strings on ST. Model would runaway generations. DRY works when applied before it but is it really necessary?

As a result of how the original probabilities are stored and used as a moving average, anything that alters the probabilities before this sampler can have a negative effect on the output.

Here is an explanation from Claude based off the Llama.cpp code

Adaptive-P stores the probability distribution it receives and uses those values to update its EMA:

llama_sampler_softmax_impl(cur_p, false);
for (size_t i = 0; i < cur_p->size; ++i) {
    ctx->original_probs[i] = cur_p->data[i].p;
}
// ... later, after selection:
ctx->weighted_sum = ctx->original_probs[idx] + ctx->decay * ctx->weighted_sum;

The adaptive target formula is 2 * target - weighted_average. This compensates based on what probabilities the sampler thinks it has been selecting.

If anything modifies the distribution before Adaptive-P runs, the EMA tracks the modified probabilities rather than the model's actual probabilities. The adaptive mechanism then makes corrections based on incorrect information—it's compensating for a history that doesn't reflect what the model actually predicted.

If your target is 0.5 but the EMA is recording inflated probabilities, the adaptive mechanism thinks you're consistently selecting above-target tokens. It compensates by pushing the calculated target lower, which biases selection toward lower-probability tokens. This can cascade into runaway behavior as the EMA keeps recording distorted probabilities and over-correcting.

Ph0rk0z · 2026-01-05T21:54:24Z

Wow.. so min_P and that's it. Waiting on devstral to finish to see if I can bypass "oh". Loves to start every reply with that and will be a very easy a/b test. On cohere I was able to use dry and bans but adaptive-p didn't seem to have much effect. L3 tune is where I had all the trouble, that model may not be super stable though and compounded the above effects.

Am also curious the effect this will have on making image prompts from the context. In RP that's where the instruction following has to shine. Some models make great dialogue and then fail that portion.

ohhhhhhh K

So this sampler can't fix "Oh", "OH" problem of devstral but for some reason neither can token banning or logit bias. Even when I put the token IDs in. Is it working here? Perhaps its due to being the first token? When I request logprobs the value for the tokens is the same.

dungquixote42 · 2026-01-06T03:52:49Z

The entire function llama_sample_softmax_nosort_impl(...) is probably not necessary; just use llama_sample_softmax_impl correctly (i.e. by not setting sorted manually - just do it like the other samplers are doing it).

llama_sample_softmax_impl either sorts or assumes sorted candidates. It does not support searching. Which means:
candidates->sorted=true after transform (lie) > second softmax assumes first logit is max (it's not) > sampler breaks
or
candidates->sorted=false after transform (truth) > second softmax sorts and loses correct index to pre-transform probabilities > sampler breaks
I believe llama_sample_softmax_impl is incompatible with this sampler and a nosort version is necessary.

EDIT:
I learned it is possible to use the token ID to update, which is compatible with llama_sample_softmax_impl. But that's how it is implemented in kcpp, which is having problems with the sampler. So I am inclined to keep my implementation.

const int64_t t_start_sample_us = ggml_time_us(); is set inside llama_sample_token_adaptive_p_impl, no other sampler does this (it is handled somewhere else; you don't need to worry about measuring timings).

Mirostat does it. llama_sample_token_adaptive_p_impl is a "selector", more akin to llama_sample_token_mirostat_impl or llama_sample_token_with_rng_impl. From what I can tell, samplers that need to know which token is selected implements a separate "selector" because samplers in ik_llama do not directly indicate which token is to be selected.

This is still using samplaw as a variable name instead of smpl or sampler which is confusing for no reason and inconsistent.

smpl is taken by DRY in llama_sampling_context and typically used for struct llama_sampling * types, not struct llama_sampler * types. sampler is not used as a variable name in general. w_smpl or smpl_w?

keeping track of max_logit in llama_sampler_adaptive_p_apply, this is completely unnecessary and incorrect.

If the maximum logit is tracked while the new logit are calculated, the subsequent loop in the softmax for searching the maximum logit can be bypassed. I think this is worthwhile. This looks correct to me.

I am working on addressing other comments.

dungquixote42 · 2026-01-07T02:01:11Z

@ikawrakow I see DRY does not implement iface found in struct llama_sampler. Is iface actually required? I believe this sampler can be implemented better if not.

ikawrakow · 2026-01-07T06:53:15Z

@ikawrakow I see DRY does not implement iface found in struct llama_sampler. Is iface actually required? I believe this sampler can be implemented better if not.

The sampling situation in ik_llama.cpp is a bit of a mess. There have been some sporadic updates with stuff from mainline, with iface only being there to just make something compile.

…bilities

dungquixote42 · 2026-01-08T01:22:18Z

@ikawrakow Thanks. iface is removed in the latest commit, which also includes simplifications/improvements possible with removing iface. I believe this PR is ready for review.

ikawrakow · 2026-01-08T06:26:44Z

src/llama-sampling.cpp

+    const int64_t t_start_sample_us = ggml_time_us();
+
+    // softmax with known maximum logit
+    llama_sample_softmax_nosort_impl(nullptr, candidates, &(adapt_p_ctx->max_logit));


Where/when is adapt_p_ctx->max_logit initialized to a meaningful value?

NVM, I saw it below.

ikawrakow · 2026-01-08T06:28:00Z

src/llama-sampling.cpp

+    llama_sample_softmax_nosort_impl(nullptr, candidates, &(adapt_p_ctx->max_logit));
+
+    // sample
+    std::vector<float> probs;


Can this vector be made a member of the adaptive sampler context? So that a new allocation for each new token is not required.

Yes. I will make it a member in follow-up commits.

ikawrakow · 2026-01-08T06:37:10Z

src/llama-sampling.cpp

+
+    // quadratic near target for finite differentiation, transitioning to linear decay in tails
+    // unbounded negative logits suppress far-from-target tokens after softmax
+    float max_logit = std::numeric_limits<float>::min();


Is the intent here to use the minimum float value that is not zero (the value of std::numeric_limits<float>::min() = 1.17549e-38 ) or perhaps more something like -INFINITY ?

I did not know -INFINITY was a thing. Heh. Google showed mestd::numeric_limits<float>::min(), and I said lgtm. Follow-up commits will have -INFINITY.

Yes. I meant to use -INFINITY. C dev here. C++ noob.

ikawrakow · 2026-01-08T06:40:11Z

src/llama-sampling.cpp

+        probs.emplace_back(candidates->data[i].p);
+    }
+    std::discrete_distribution<> dist(probs.begin(), probs.end());
+    llama_token id = candidates->data[dist(smpl->rng)].id;


If the logits have not been filtered to a relatively small number of candidates, this will be a fairly computationally expensive operation with typical vocabulary sizes.

This block is basically copied from llama_sample_token_with_rng_impl, minus push_back vs emplace_back. Is emplace_back much slower than push_back, or did I miss something here?

It is OK to merge it like this. But having done quite a bit of Monte Carlo in a previous life, I couldn't help myself but comment.

It is not the emplace_back() that is slow, but the overall implementation (and yes, I know, mainline's implementation also inherited here is far from ideal). We are basically going 3 times over the whole array of token probabilities, to then construct a std::discrete_distribution object, to get just a single random sample from that. If the candidates have been reduced to a relatively small number via top_k or min_p or similar, this is fine. But if we are going over the entire vocabulary of ~200k tokens, this is going to add a noticeable extra time relative to say, 100 t/s generation speed. My guess is that the best thing to do would be to just compute the cumulative probability distribution on-the-fly, and then use binary search to find the candidate given a random number between 0 and 1 multiplied with the last element of the cumulative distribution.

I wouldn't mind updating the mainline implementation as well, as long as the distribution modification doesn't affect the result

The mainline implementation inherited a lot of inefficiency due to my own personal choice in models + hardware rarely exceeding ~5t/s. At those speeds, any optimization is a micro-optimization.

I'm having a difficult time visualizing your suggestion though.

// first cum_prob is spacer const size_t count = candidates->size + 1; adapt_p_ctx->probs.reserve(count);` // cumulative distribution const float max_logit = adapt_p_ctx->max_logit; float cum_prob = 0.0f; for (size_t i = 0; i < count; ++i) { adapt_p_ctx->probs.emplace_back(cum_prob); cum_prob += expf(candidates->data[i].logit - max_logit); } const float target_cprob = cum_prob * (float)adapt_p_ctx->rng() / (float)adapt_p_ctx->rng.max(); // my binary search bool done = false; size_t idx = (count >> 1) + 1; size_t stride = (count >> 1) + 1; while (!done) { stride = (stride >> 1) + 1; const float cprob = adapt_p_ctx->probs[idx]; if (target_cprob > cprob) { idx += stride; } else if (target_cprob < cprob - adapt_p_ctx->probs[idx - 1]) { idx -= stride; } else { done = true; } } // ai slop auto it = std::lower_bound(adapt_p_ctx->probs.begin(), adapt_p_ctx->probs.end(), target_cprob); size_t idx = std::distance(adapt_p_ctx->probs.begin(), it) - 2; llama_token id = candidates->data[idx].id;

It does not work yet, but is this what you had in mind?

Yes, something along these lines.
I think, this should work:

// first cum_prob is spacer adapt_p_ctx->probs.reserve(candidates->size); // cumulative distribution const float max_logit = adapt_p_ctx->max_logit; float cum_prob = 0.0f; for (size_t i = 0; i < count; ++i) { cum_prob += expf(candidates->data[i].logit - max_logit); adapt_p_ctx->probs.emplace_back(cum_prob); // note: we emplace **after** adding the current probability } // add a safety to the last element just to be sure we avoid numerical issues when the random // number is (nearly) at maximum. adapt_p_ctx->probs.back() += 1.0f; const float target_cprob = cum_prob * (float)adapt_p_ctx->rng() / (float)adapt_p_ctx->rng.max(); auto it = std::upper_bound(adapt_p_ctx->probs.begin(), adapt_p_ctx->probs.end(), target_prob); GGML_ASSERT(it != adapt_p_ctx->probs.end()); llama_token id = candidates->data[std::distance(adapt_p_ctx->probs.begin(), it);

…t + stochastic sampling

Ph0rk0z · 2026-01-09T12:22:58Z

I merged the latest stuff from main and the updates to the PR. My generation is way less likely to be unstable or run away.

MrJackSpade · 2026-01-10T03:15:06Z

If someone wants to ping me pre-merge I can run the full suite of tests against the implementation, just to make sure it's still behaving as expected.

dungquixote42 · 2026-01-10T03:23:40Z

My last push incorporates all the feedbacks as much as I could.

If someone wants to ping me pre-merge I can run the full suite of tests against the implementation, just to make sure it's still behaving as expected.

@MrJackSpade Yes, please. Muy gracias.

MrJackSpade · 2026-01-10T03:47:55Z

All right. Will run tonight and report back with confirmation ASAP

…

On Fri, Jan 9, 2026, 8:24 PM dungquixote42 ***@***.***> wrote: *dungquixote42* left a comment (ikawrakow/ik_llama.cpp#1100) <#1100 (comment)> My last push incorporates all the feedbacks as much as I could. If someone wants to ping me pre-merge I can run the full suite of tests against the implementation, just to make sure it's still behaving as expected. @MrJackSpade <https://github.com/MrJackSpade> Yes, please. Muy gracias. — Reply to this email directly, view it on GitHub <#1100 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AEFCJR7OZE5MNUWS7FL4FIL4GBWFHAVCNFSM6AAAAACQIUR356VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTOMZRG43DCNBYGY> . You are receiving this because you were mentioned.Message ID: ***@***.***>

dungquixote42 added 3 commits December 28, 2025 18:06

initial implementation of adaptive-p sampler

402204e

explicitly mark candidates unsorted + cleanup qualifiers

05897f6

cosmetic update

9264476

dungquixote42 added 4 commits December 29, 2025 20:50

reorg prototypes

6d50cd0

lockstep with mainline

7d51f81

add _impl for _init + reorg

55ff01b

add LLAMA_API to prototypes

31ae3c2

update sharpness to 10

3b16355

dungquixote42 added 10 commits December 30, 2025 19:43

lockstep: rng seed

29e7f3d

delete llama_sampling member in llama_sampler_adaptive_p

34ca871

fix LLAMA_API return type

ada31a4

lockstep: rng seed cont

d0f030b

actually correct implementation

46f70f6

lockstep: sorting behavior

61eb7f6

const -> constexpr for known constants

51acff4

add missing space

4607d0f

fix softmax usage in adaptive p sampler

8b0361c

cosmetic changes

0ab5089

implement do-not-sort version of softmax

0483601

simpify rng seed, add static to constexpr

f0c2533

refactor: remove iface + use shared rng + use actually original proba…

dd30141

…bilities

ikawrakow reviewed Jan 8, 2026

View reviewed changes

ikawrakow approved these changes Jan 8, 2026

View reviewed changes

dungquixote42 added 3 commits January 8, 2026 01:55

adaptive-p: add dedicated rng back in

02f92e4

fix initial max_logit + add float vector to adaptive p sampler contex…

438e41c

…t + stochastic sampling

adaptive-p: fuse first softmax with transformation

a7afa9a

dungquixote42 added 2 commits January 9, 2026 21:21

adaptive-p: implement binary search selection

6cf24a6

adaptive-p: update comment

dcdd8ab

ikawrakow approved these changes Jan 10, 2026

View reviewed changes

ikawrakow merged commit 6695c6c into ikawrakow:main Jan 10, 2026

JaKasb mentioned this pull request Jan 10, 2026

Ignore (False Positive) : Bug: Segmentation Fault in Adaptive P Sampler #1132

Closed

dungquixote42 mentioned this pull request Jan 18, 2026

Adaptive p: bugfix + optimization + refactor #1155

Merged

4 tasks

Conversation

dungquixote42 commented Dec 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Geechan commented Dec 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Ph0rk0z commented Dec 30, 2025

Uh oh!

ikawrakow commented Dec 30, 2025

Uh oh!

Geechan commented Dec 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ddh0 commented Jan 3, 2026

Uh oh!

Ph0rk0z commented Jan 4, 2026

Uh oh!

ikawrakow commented Jan 5, 2026

Uh oh!

Geechan commented Jan 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Ph0rk0z commented Jan 5, 2026

Uh oh!

MrJackSpade commented Jan 5, 2026

Uh oh!

Ph0rk0z commented Jan 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dungquixote42 commented Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dungquixote42 commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ikawrakow commented Jan 7, 2026

Uh oh!

dungquixote42 commented Jan 8, 2026

Uh oh!

ikawrakow Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dungquixote42 Jan 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Ph0rk0z commented Jan 9, 2026

Uh oh!

MrJackSpade commented Jan 10, 2026

Uh oh!

dungquixote42 commented Jan 10, 2026

Uh oh!

MrJackSpade commented Jan 10, 2026 via email

Uh oh!

Reviewers

Assignees

Labels

Projects

dungquixote42 commented Dec 29, 2025 •

edited

Loading

Geechan commented Dec 29, 2025 •

edited

Loading

Geechan commented Dec 30, 2025 •

edited

Loading

Geechan commented Jan 5, 2026 •

edited

Loading

Ph0rk0z commented Jan 5, 2026 •

edited

Loading

dungquixote42 commented Jan 6, 2026 •

edited

Loading

dungquixote42 commented Jan 7, 2026 •

edited

Loading

ikawrakow Jan 8, 2026 •

edited

Loading

dungquixote42 Jan 9, 2026 •

edited

Loading