Skip to content

A hopefully more efficient adaptive_p sampling#1161

Merged
ikawrakow merged 4 commits intomainfrom
ik/adaptive_p
Jan 19, 2026
Merged

A hopefully more efficient adaptive_p sampling#1161
ikawrakow merged 4 commits intomainfrom
ik/adaptive_p

Conversation

@ikawrakow
Copy link
Copy Markdown
Owner

Ref. #1158

@Ph0rk0z Does this work better?

@Ph0rk0z
Copy link
Copy Markdown

Ph0rk0z commented Jan 18, 2026

I get about 27-28 t/s with this version.

@ikawrakow
Copy link
Copy Markdown
Owner Author

I get about 27-28 t/s with this version.

That's better than 23. But does it work the same?

@Ph0rk0z
Copy link
Copy Markdown

Ph0rk0z commented Jan 18, 2026

The outputs are shorter. It's hard to tell. This is the only place I have tried adaptive_P. Attempted to get it going in exllama to compare but either his implementation is broken or I'm fucking something up.

@NarpasSword
Copy link
Copy Markdown

Chiming in to say I also noticed a regression in TG performance after #1155

Just tested and this PR does restore some of the performance back, but not quite to the extent as before #1155

@Geechan
Copy link
Copy Markdown

Geechan commented Jan 18, 2026

The outputs are shorter. It's hard to tell. This is the only place I have tried adaptive_P. Attempted to get it going in exllama to compare but either his implementation is broken or I'm fucking something up.

@Ph0rk0z There is no better comparison point than with mainline llama.cpp, which has a correct, working implementation and no performance delta. It would be prudent to start there.

@dungquixote42
Copy link
Copy Markdown
Contributor

One big difference from the mainline implementation is using the actually original probabilities, which is often many more than using however many the sampler receives. Decreasing kDelta prunes more tokens. Perhaps it can be made into its own server argument.

@Ph0rk0z
Copy link
Copy Markdown

Ph0rk0z commented Jan 18, 2026

I got exllama working and there's no performance difference there either.

struct llama_sampler_adaptive_p * adapt_p_ctx)
{
struct llama_sampler_adaptive_p * adapt_p_ctx) {
constexpr float kDelta = 16.6f;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see good results with kDelta=11.5. High hit rate, minimal performance drop.

@Geechan
Copy link
Copy Markdown

Geechan commented Jan 19, 2026

With my hardware, I see a very minimal performance drop with the current implementation and pre-bugfix. About 0.5t/s, so we're talking 11.5t/s vs. 12t/s. With this PR, it resolves that slight performance delta for me, so it's back to around 12t/s.

As for generation quality with Adaptive P active, I see basically no difference, which is a good sign. I test with same seeded generations to try and eliminate RNG where possible, and any minor differences can be accounted for RNG differentials with context ingestion. The important thing is the overall quality remains the same in my eyes. I think it is safe to experiment with changing the kDelta to see if it further improves performance, or to implement as-is.

@ikawrakow
Copy link
Copy Markdown
Owner Author

There is no better comparison point than with mainline llama.cpp, which has a correct, working implementation and no performance delta. It would be prudent to start there.

What is the "correct" implementation?

The implementation in mainline only uses tokens that have been left after all other samplers have been applied. As people typically have min_p and or top_k in the sampler chain, the number of tokens processed by adaptive_p will be small (@Ph0rk0z mentioned top_k of 200, even 200 is really a small number of tokens). The probability statistics used for the power low distribution are computed from the remaining tokens only.

The implementation here is different. It uses all tokens above a very low threshold (currently e^-16.6, in the comment above @dungquixote42 says e^-11.5 would be OK too) to update the power low statistics. Which means that it needs to go over a huge number of tokens, and this is why it is slower than mainline.

So, what is it that people want?

If mainline's implementation, I can change it on the spot to do that, and it will be at least as fast as in llama.cpp.

If this implementation is preferable, I can merge this PR. It removes one of the major bottlenecks (sorting the tokens in the entire vocabulary).

There is a second bottleneck left, which is the construction of a map containing all original token probabilities above the e^-16.6 threshold. Maps are good if used many times to search for stuff. The map here is used exactly once (to look up the original probability after a token has been sampled), so the high cost of constructing the map is not amortized. My guess is that even a linear search would be faster than the std::unorderd_map being used. I can do that in a follow up PR.

@MrJackSpade
Copy link
Copy Markdown

The ideal implementation uses the original probabilities as calculated by the model.

However, in practice, it should not make a difference which method is used. The only real world scenario in which the result is going to differ by any meaningful value, is going to be scenarios where the user is performing some kind of insane sampling before the candidates reach the sampler. The tiny drift that happens when applying top-k/min-p and then (maybe?) applying softmax again should be smaller than the effects of RNG on the selection.

I don't think it it ultimately matters whether the approach taken is pragmatism, or mathematical purity.

I do agree about the map though and made the same comment when we were developing the llama.cpp version. A standard linear search is substantially faster than using the map, it requires a max of one iteration over the array.

@Geechan
Copy link
Copy Markdown

Geechan commented Jan 19, 2026

The ideal implementation uses the original probabilities as calculated by the model.

However, in practice, it should not make a difference which method is used. The only real world scenario in which the result is going to differ by any meaningful value, is going to be scenarios where the user is performing some kind of insane sampling before the candidates reach the sampler. The tiny drift that happens when applying top-k/min-p and then (maybe?) applying softmax again should be smaller than the effects of RNG on the selection.

I don't think it it ultimately matters whether the approach taken is pragmatism, or mathematical purity.

I do agree about the map though and made the same comment when we were developing the llama.cpp version. A standard linear search is substantially faster than using the map, it requires a max of one iteration over the array.

For reference, @MrJackSpade is the original creator of the idea of the sampler. I'll also second that either approach can work well and it's ultimately a matter of preference; the original llama.cpp implementation can be considered as correct as this current implementation, as long as the core logic is intact.

In my own further personal testing between llama.cpp and this implementation (using a similar methodology to above), the actual output quality is within margin of error, accounting for prior sampling truncation, which is a further good sign.

@ikawrakow
Copy link
Copy Markdown
Owner Author

The ideal implementation uses the original probabilities as calculated by the model.

OK, then this implementation is correct, and mainline's is not.

However, in practice, it should not make a difference which method is used.

I have at least some doubts about this. I observe that recent open weight models have significantly higher Wikitex2 perplexity than the early open weight models. This is not because the new models don't know Wikipedia by heart, but because of their much larger vocabulary. GLM-4.7 has a vocabulary of 202k tokens compared to 32k tokens for LlaMA-1/2. Because of the large vocabulary, a non-negligible fraction of the probability is carried by the probability tail (i.e., low-probability tokens). Hence, the most likely tokens tend to have a lower probability, and hence the PPL is higher even when the top token predicts the test corpus correctly 100% of the time. Given this, my expectation is that the original probabilities will be quite different from the probabilities of only the top-N tokens. Even more so when the model is diverted from the beaten path (which is the main purpose of the adaptive_p sampler), and hence the original probability is even more flat (the tail carries a larger portion than normally).

Perhaps this does not make a noticeable difference in practice. But if we can make the implementation of the original model fast enough, wouldn't that be preferable?

@Ph0rk0z
Copy link
Copy Markdown

Ph0rk0z commented Jan 19, 2026

The quality on the slow implementation was really good. So basically it was throwing away any of my sampling and still running on the whole vocab?

For some models, the baked in top token probability is really high. 235b qwen-vl was really bad about this, to the point of it being unusable for me. Devstral is already a decent and creative model sans spamming variations of "oh" at the beginning of replies.

So what point am I trying to make? That whether you want to do original model or post sampling is going to vary. I guess you can make the target really low?

@ikawrakow
Copy link
Copy Markdown
Owner Author

So basically it was throwing away any of my sampling and still running on the whole vocab?

No, it wasn't. When you turn on adaptive_p, in this implementation a function is called before any other sampler, so it operates on the full vocabulary, but does not modify the logits. This function computes probabilities using the original logits as they came out of the model. Later, after all other samplers have run, the adaptive_p updates its statistics that are used for the power law distribution based on the original probabilities computed in the 1st step, but only samples from whatever logits were left after the other samplers, just like mainline.

In contrast, mainline's implementation does not use the logits as they came out from the model. It first runs all other samplers, then computes probabilities needed to update the power low distribution. Hence, the computationally expensive probability calculation is only done on whatever was left after the other samplers (200 tokens worst case with your sampler settings). This is not equivalent to the original adaptive_p design (see @MrJackSpade's comment above).

You keep saying that the original implementation was better. This implementation is mathematically 100% equivalent to the original implementation.

@Ph0rk0z
Copy link
Copy Markdown

Ph0rk0z commented Jan 19, 2026

Ok that clears it up. So I may as well turn the top_k off. Only have it on to speed up other samplers.

I said #1155 had good outputs compared to what was before, despite being slow. Prior implementation was faster but way less effective. #1165 brought back almost all the speed.

@ikawrakow
Copy link
Copy Markdown
Owner Author

Ok that clears it up. So I may as well turn the top_k off. Only have it on to speed up other samplers.

Haha, who knows what bottlenecks in the other samplers this will trigger.

@ikawrakow
Copy link
Copy Markdown
Owner Author

@Ph0rk0z

I just turned off top_k, but otherwise using the default sampling chain. We get 10 ms/token sampling time, up from 0.07 ms.

@ikawrakow ikawrakow merged commit fa58c20 into main Jan 19, 2026
@Ph0rk0z
Copy link
Copy Markdown

Ph0rk0z commented Jan 19, 2026

It seems pretty fast now. I had set topK mainly to help with DRY. Still a little mixed on this sampler. Got a little incoherent in longer chats and started making odd word choices. At 0.4 there isn't as much effect. At 0.3 had the mentioned problem. Currently tweaking by re-rolling outputs until at least some of them don't start with the same words. That might be the sweet spot.

@Geechan
Copy link
Copy Markdown

Geechan commented Jan 19, 2026

It seems pretty fast now. I had set topK mainly to help with DRY. Still a little mixed on this sampler. Got a little incoherent in longer chats and started making odd word choices. At 0.4 there isn't as much effect. At 0.3 had the mentioned problem. Currently tweaking by re-rolling outputs until at least some of them don't start with the same words. That might be the sweet spot.

I find it helpful to increase your Min P if you notice such coherency issues. Changing the decay value can also help dial in a particular target; lower decay will result in a shorter history, allowing the sampler to hit the target more frequently. This can help a lot with repetition and slop while not needing to decrease the target value as much. Try decay values around 0.75-0.85.

The sampler with all the new PRs is now significantly faster while feeling potent. Great job on all the fast improvements, @ikawrakow.

@saood06
Copy link
Copy Markdown
Collaborator

saood06 commented Jan 19, 2026

It seems pretty fast now. I had set topK mainly to help with DRY. Still a little mixed on this sampler. Got a little incoherent in longer chats and started making odd word choices. At 0.4 there isn't as much effect. At 0.3 had the mentioned problem. Currently tweaking by re-rolling outputs until at least some of them don't start with the same words. That might be the sweet spot.

I'd be curious to hear your opinion on how it compares to the top-n σ sampler. They both attempt to make more "creative" tokens more likely at points where there is no blatantly obvious best token and thus there are good options for "creative" tokens.

@Geechan
Copy link
Copy Markdown

Geechan commented Jan 19, 2026

I'd be curious to hear your opinion on how it compares to the top-n σ sampler. They both attempt to make more "creative" tokens more likely at points where there is no blatantly obvious best token and thus there are good options for "creative" tokens.

@saood06 They're both fundamentally different samplers. Top-n σ is a truncation sampler, focused on selecting tokens that are clustered together in a sigmoid curve. This makes it effective at curtailing higher temperature settings, but all it's doing is truncating the tail end of the probabilities. Adaptive P is a probability adjusting sampler, placing more emphasis for token probabilities to favour a chosen target value. Truncation samplers complement Adaptive P.

There's some more information here, albeit not specifically about top-n σ.

@saood06
Copy link
Copy Markdown
Collaborator

saood06 commented Jan 19, 2026

I'd be curious to hear your opinion on how it compares to the top-n σ sampler. They both attempt to make more "creative" tokens more likely at points where there is no blatantly obvious best token and thus there are good options for "creative" tokens.

@saood06 They're both fundamentally different samplers. Top-n σ is a truncation sampler, focused on selecting tokens that are clustered together in a sigmoid curve. This makes it effective at curtailing higher temperature settings, but all it's doing is truncating the tail end of the probabilities. Adaptive P is a probability adjusting sampler, placing more emphasis for token probabilities to favour a chosen target value. Truncation samplers complement Adaptive P.

I understand the mechanisms behind both. Top-n σ attempts to "maintain a stable sampling space regardless of temperature scaling". In order to that it has to do something that effectively masks the effect of temperature for select token distributions while amplifying it for others. Which sounds a lot like adaptive_p. Those same moments where that Top-n σ masks temperature, on adaptive-p those would swing it's weighted average toward amplifying temperature for later tokens (and the same applies for the other case).

@Ph0rk0z
Copy link
Copy Markdown

Ph0rk0z commented Jan 19, 2026

top-n σ sampler

IMO, this sampler isn't creative at all and more of a "right answers" sampler. Side effect is that slop tends to bubble to the top. When you raise it past 1.0, it relaxes a little but it's still very deterministic in terms of outputs. After using sigma a bunch I ended up not being a big fan for RP things.

I read messing with the decay too much would cause oscillation so I stayed away from it. Haven't had as much slop or repetition, enough that I could turn DRY off. Still shows up in some outputs but doesn't keep coming back.

The way I understand adaptive_p is that the target is separate from the most probable tokens and would be the opposite of sigma. In practice the model is more likely to swear and say natural things depending on how low you go, until you end up too far and suddenly your drinks "melt" or you/me get switched, etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants