Fix max_length in chat.py to be just truncation length#5382
Fix max_length in chat.py to be just truncation length#5382bartowski1182 wants to merge 1 commit intooobabooga:devfrom
Conversation
|
If wanting to maintain the idea that, if you're trying to exceed your capacity with new tokens you should truncate the prompt, I propose: max_length = min(truncate_prompt_length, max_seq_len - min_length) which would by default limit the prompt to the truncate length as appropriate, and in a worst case limit the prompt to whatever the model can support as configured minus the minimum amount of new tokens you want to generate Ideally though, it seems it would be most appropriate to introduce new logic that makes it so that truncate_prompt_length + max_new_tokens is always = max_seq_len So that way, if you have a max sequence length of 4000, truncate_prompt_length would be set to something below that like 3000 (3/4 seems like a decent starting place?) and max_new_tokens would be set to the remainder, 1000 (or 1/4) Then at least in the UI it's obvious what's going on Or, at minimum, truncate_prompt_length shouldn't be set to the total context of the model, because then it doesn't leave any room for generating new tokens, it should be set to some fraction of the total available context (but these are design decisions outside the scope of this fix) |
|
If the user wants to generate It is true that chat completions inputs are being completely removed in case they don't fit into the available space, which is an issue. The input should be left truncated instead. #5066 aims to solve this problem but in a way that I find too complicated. There is a probably a simple (few lines) way to modify this loop text-generation-webui/modules/chat.py Line 173 in e7a760e |
|
If you want to maintain this functionality then by all means, but from my point of view it's misleading to have a value that explicitly specifies the max prompt length, but then attempts to adjust it dynamically based on theoretical newly generated tokens, especially since it can lead to situations like mine where nothing gets sent to the backend even though there's tons of available context I understand how it works, and now it makes sense from a logical perspective, but from a UX perspective it's terrible. Even in discord I saw people dismissing the new Deepseek model entirely because in the chat interface all its responses were garbage, because most people have max new tokens at 4000 so it was removing their prompt completely |
|
The HF text generation pipeline will still fail if the prompt length + max_new_tokens is greater than the model's context length. This must be ensured. I am closing this in favor of #5439. If it doesn't work as expected and you still experience issues please let me know. |
|
Seems reasonable. The only other chance I would consider making is not defaulting the prompt truncation to be the full model context, doesn't seem logical to ever have it be at that level since you'll never want your whole context to be prompt anyways, you'd want room to generate. For my liking, i'd default the prompt truncation to be some reasonable fraction of the full context (4/5? 3/4?). If that's agreeable to you, I can make a PR |
Checklist:
get_max_prompt_length is returning truncate_prompt_length - max_new_tokens, which I'm not sure where the value is for that
If you have a model that has a max_new_tokens of 4096, and the truncate length is also 4096, then that makes the max prompt length returned be 0
This causes all messages to be removed from the prompt, resulting in a blank prompt sent to the model
This fixes issue #5374