Conversation
|
Is there a real case where the three filters (top-p/top-k/min-p) will be applied altogether? If so, what should be the order? |
Generally, there's little reason to use all three at the same time, but users may still choose do that, and, either way, during batching there can be requests using min-p and others that use top-p/top-k processed together. Regarding the order, I copied the order that is used in VLLM and HF Transformers (top-k->top-p->min-p) |
|
I'm also interested, in what use case scenarios would it be used? Are there any specific examples? |
Min-p, coupled with a higher temperature is generally used for creative writing (which is a significant chunk of LLM usage), due to allowing for more varied and creative responses, while still remaining coherent. But it is also a good replacement for top-k/top-p in general LLM usage. You can read the explanation and benchmarks in the paper |
|
Awesome, I've been looking forward to this for a long time |
|
@intervitens |
Motivation
#1071
Modifications
Implemented the min-p sampling algorithm using both flashinfer kernels and the native pytorch sampling implementation. There is a slight slowdown when using min-p due to the current lack of a fused min-p/top-p/top-k kernel in flashinfer. To avoid this slowdown when min-p is not used, I implemented a fallback to the
top_k_top_p_sampling_from_probskernel.Checklist