Skip to content

Support min-p sampling#1167

Merged
hnyls2002 merged 3 commits intosgl-project:mainfrom
intervitens:min_p
Aug 21, 2024
Merged

Support min-p sampling#1167
hnyls2002 merged 3 commits intosgl-project:mainfrom
intervitens:min_p

Conversation

@intervitens
Copy link
Copy Markdown
Contributor

@intervitens intervitens commented Aug 20, 2024

Motivation

#1071

Modifications

Implemented the min-p sampling algorithm using both flashinfer kernels and the native pytorch sampling implementation. There is a slight slowdown when using min-p due to the current lack of a fused min-p/top-p/top-k kernel in flashinfer. To avoid this slowdown when min-p is not used, I implemented a fallback to the top_k_top_p_sampling_from_probs kernel.

Checklist

  • Format your code according to the Contributor Guide.
  • Add unit tests as outlined in the Contributor Guide.
  • Update documentation as needed, including docstrings or example tutorials.

@yzh119
Copy link
Copy Markdown
Collaborator

yzh119 commented Aug 21, 2024

Is there a real case where the three filters (top-p/top-k/min-p) will be applied altogether? If so, what should be the order?

@intervitens
Copy link
Copy Markdown
Contributor Author

Is there a real case where the three filters (top-p/top-k/min-p) will be applied altogether? If so, what should be the order?

Generally, there's little reason to use all three at the same time, but users may still choose do that, and, either way, during batching there can be requests using min-p and others that use top-p/top-k processed together. Regarding the order, I copied the order that is used in VLLM and HF Transformers (top-k->top-p->min-p)

@zhyncs
Copy link
Copy Markdown
Collaborator

zhyncs commented Aug 21, 2024

I'm also interested, in what use case scenarios would it be used? Are there any specific examples?

@intervitens
Copy link
Copy Markdown
Contributor Author

I'm also interested, in what use case scenarios would it be used? Are there any specific examples?

Min-p, coupled with a higher temperature is generally used for creative writing (which is a significant chunk of LLM usage), due to allowing for more varied and creative responses, while still remaining coherent. But it is also a good replacement for top-k/top-p in general LLM usage. You can read the explanation and benchmarks in the paper

@hnyls2002 hnyls2002 enabled auto-merge (squash) August 21, 2024 21:41
@hnyls2002 hnyls2002 merged commit 068e9ea into sgl-project:main Aug 21, 2024
@intervitens intervitens deleted the min_p branch August 22, 2024 02:52
@81549361
Copy link
Copy Markdown

Awesome, I've been looking forward to this for a long time

@81549361
Copy link
Copy Markdown

@intervitens
oobabooga/text-generation-webui#5677
Are you interested in implementing this sampler?
This sampler can solve the problem that some models are easily repeated in long chats, such as Nemo 12b

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants