Conversation
|
@81549361 Hi, thanks for the contribution. However, "Dry sample" is not currently a mainstream method. To keep our current code minimal, we may be unable to accept this PR. One thing I can think of is to expose a custom logit processor interface, then other users can easily register their external custom logit processor without changing the sglang code. |
|
@merrymercy yes it could be helpful, at least, to have a way to resigester our own logit processor. |
Not sure what your definition of mainstream is, but DRY is already supported by:
... and probably other projects that I'm not aware of. DRY is also commonly recommended by model authors on their HF model cards. See the original DRY pull request for more information. |
|
Hi @p-e-w, @81549361, @supa-thibaud Custom logit processor is supported by #2396. We can implement try sample with the custom logit processor. We can still accept new sampling methods. We can merge some built-in custom logit processor into the sglang source code. Contributions are welcome! |
Motivation
Looping is an undesirable behavior where the model repeats phrases verbatim that have previously occurred in the input. It affects most models, and is exacerbated by the use of truncation samplers. Chat formats are particularly susceptible due to their regular structure, which models appear to interpret as an invitation to repeat previous messages in whole or in part. Prompting the model to avoid looping has little or no effect.
The traditional weapon to combat looping are the three flavors of repetition penalty that are built into most loaders (multiplicative, additive, and frequency penalty). But those samplers are rather blunt instruments that distort the grammar of standard language, which the model has been painstakingly trained to reproduce. I have previously attempted to fix this problem ggml-org/llama.cpp#5561 that protects the basic structure of language from being penalized, but that's a hacky solution that fails to do the right thing in many cases, and even in their raw form, classical repetition penalties don't actually prevent looping reliably.
In the past weeks, I have rethought the looping problem from the ground up, and in this PR present the DRY repetition penalty, a mechanism that is able to detect textual looping and steer against it. It is far superior to the existing samplers at preventing verbatim repetition, while having essentially none of their negative effects on language structure. The result is less repetitive and higher quality output.
I have tested this sampler for about 20 hours in chat scenarios so far, and they have without question been the highest-quality chats I have ever experienced. Looping in the traditional sense simply does not happen with DRY, and the positive effects from being able to drop the standard repetition penalty are very noticeable.
Modifications
Checklist