A hopefully more efficient adaptive_p sampling#1161
Conversation
|
I get about 27-28 t/s with this version. |
That's better than 23. But does it work the same? |
|
The outputs are shorter. It's hard to tell. This is the only place I have tried adaptive_P. Attempted to get it going in exllama to compare but either his implementation is broken or I'm fucking something up. |
@Ph0rk0z There is no better comparison point than with mainline llama.cpp, which has a correct, working implementation and no performance delta. It would be prudent to start there. |
|
One big difference from the mainline implementation is using the actually original probabilities, which is often many more than using however many the sampler receives. Decreasing |
|
I got exllama working and there's no performance difference there either. |
| struct llama_sampler_adaptive_p * adapt_p_ctx) | ||
| { | ||
| struct llama_sampler_adaptive_p * adapt_p_ctx) { | ||
| constexpr float kDelta = 16.6f; |
There was a problem hiding this comment.
I see good results with kDelta=11.5. High hit rate, minimal performance drop.
|
With my hardware, I see a very minimal performance drop with the current implementation and pre-bugfix. About 0.5t/s, so we're talking 11.5t/s vs. 12t/s. With this PR, it resolves that slight performance delta for me, so it's back to around 12t/s. As for generation quality with Adaptive P active, I see basically no difference, which is a good sign. I test with same seeded generations to try and eliminate RNG where possible, and any minor differences can be accounted for RNG differentials with context ingestion. The important thing is the overall quality remains the same in my eyes. I think it is safe to experiment with changing the kDelta to see if it further improves performance, or to implement as-is. |
What is the "correct" implementation? The implementation in mainline only uses tokens that have been left after all other samplers have been applied. As people typically have The implementation here is different. It uses all tokens above a very low threshold (currently So, what is it that people want? If mainline's implementation, I can change it on the spot to do that, and it will be at least as fast as in If this implementation is preferable, I can merge this PR. It removes one of the major bottlenecks (sorting the tokens in the entire vocabulary). There is a second bottleneck left, which is the construction of a map containing all original token probabilities above the |
72abdf2 to
61eccfc
Compare
|
The ideal implementation uses the original probabilities as calculated by the model. However, in practice, it should not make a difference which method is used. The only real world scenario in which the result is going to differ by any meaningful value, is going to be scenarios where the user is performing some kind of insane sampling before the candidates reach the sampler. The tiny drift that happens when applying top-k/min-p and then (maybe?) applying softmax again should be smaller than the effects of RNG on the selection. I don't think it it ultimately matters whether the approach taken is pragmatism, or mathematical purity. I do agree about the map though and made the same comment when we were developing the llama.cpp version. A standard linear search is substantially faster than using the map, it requires a max of one iteration over the array. |
For reference, @MrJackSpade is the original creator of the idea of the sampler. I'll also second that either approach can work well and it's ultimately a matter of preference; the original llama.cpp implementation can be considered as correct as this current implementation, as long as the core logic is intact. In my own further personal testing between llama.cpp and this implementation (using a similar methodology to above), the actual output quality is within margin of error, accounting for prior sampling truncation, which is a further good sign. |
OK, then this implementation is correct, and mainline's is not.
I have at least some doubts about this. I observe that recent open weight models have significantly higher Wikitex2 perplexity than the early open weight models. This is not because the new models don't know Wikipedia by heart, but because of their much larger vocabulary. GLM-4.7 has a vocabulary of 202k tokens compared to 32k tokens for LlaMA-1/2. Because of the large vocabulary, a non-negligible fraction of the probability is carried by the probability tail (i.e., low-probability tokens). Hence, the most likely tokens tend to have a lower probability, and hence the PPL is higher even when the top token predicts the test corpus correctly 100% of the time. Given this, my expectation is that the original probabilities will be quite different from the probabilities of only the top-N tokens. Even more so when the model is diverted from the beaten path (which is the main purpose of the Perhaps this does not make a noticeable difference in practice. But if we can make the implementation of the original model fast enough, wouldn't that be preferable? |
|
The quality on the slow implementation was really good. So basically it was throwing away any of my sampling and still running on the whole vocab? For some models, the baked in top token probability is really high. 235b qwen-vl was really bad about this, to the point of it being unusable for me. Devstral is already a decent and creative model sans spamming variations of "oh" at the beginning of replies. So what point am I trying to make? That whether you want to do original model or post sampling is going to vary. I guess you can make the target really low? |
No, it wasn't. When you turn on In contrast, mainline's implementation does not use the logits as they came out from the model. It first runs all other samplers, then computes probabilities needed to update the power low distribution. Hence, the computationally expensive probability calculation is only done on whatever was left after the other samplers (200 tokens worst case with your sampler settings). This is not equivalent to the original You keep saying that the original implementation was better. This implementation is mathematically 100% equivalent to the original implementation. |
Haha, who knows what bottlenecks in the other samplers this will trigger. |
|
I just turned off |
|
It seems pretty fast now. I had set topK mainly to help with DRY. Still a little mixed on this sampler. Got a little incoherent in longer chats and started making odd word choices. At 0.4 there isn't as much effect. At 0.3 had the mentioned problem. Currently tweaking by re-rolling outputs until at least some of them don't start with the same words. That might be the sweet spot. |
I find it helpful to increase your Min P if you notice such coherency issues. Changing the decay value can also help dial in a particular target; lower decay will result in a shorter history, allowing the sampler to hit the target more frequently. This can help a lot with repetition and slop while not needing to decrease the target value as much. Try decay values around 0.75-0.85. The sampler with all the new PRs is now significantly faster while feeling potent. Great job on all the fast improvements, @ikawrakow. |
I'd be curious to hear your opinion on how it compares to the top-n σ sampler. They both attempt to make more "creative" tokens more likely at points where there is no blatantly obvious best token and thus there are good options for "creative" tokens. |
@saood06 They're both fundamentally different samplers. Top-n σ is a truncation sampler, focused on selecting tokens that are clustered together in a sigmoid curve. This makes it effective at curtailing higher temperature settings, but all it's doing is truncating the tail end of the probabilities. Adaptive P is a probability adjusting sampler, placing more emphasis for token probabilities to favour a chosen target value. Truncation samplers complement Adaptive P. There's some more information here, albeit not specifically about top-n σ. |
I understand the mechanisms behind both. Top-n σ attempts to "maintain a stable sampling space regardless of temperature scaling". In order to that it has to do something that effectively masks the effect of temperature for select token distributions while amplifying it for others. Which sounds a lot like adaptive_p. Those same moments where that Top-n σ masks temperature, on adaptive-p those would swing it's weighted average toward amplifying temperature for later tokens (and the same applies for the other case). |
IMO, this sampler isn't creative at all and more of a "right answers" sampler. Side effect is that slop tends to bubble to the top. When you raise it past 1.0, it relaxes a little but it's still very deterministic in terms of outputs. After using sigma a bunch I ended up not being a big fan for RP things. I read messing with the decay too much would cause oscillation so I stayed away from it. Haven't had as much slop or repetition, enough that I could turn DRY off. Still shows up in some outputs but doesn't keep coming back. The way I understand adaptive_p is that the target is separate from the most probable tokens and would be the opposite of sigma. In practice the model is more likely to swear and say natural things depending on how low you go, until you end up too far and suddenly your drinks "melt" or you/me get switched, etc. |
Ref. #1158
@Ph0rk0z Does this work better?