Fix max_length in chat.py to be just truncation length by bartowski1182 · Pull Request #5382 · oobabooga/text-generation-webui

bartowski1182 · 2024-01-27T18:20:46Z

Checklist:

I have read the Contributing guidelines.

get_max_prompt_length is returning truncate_prompt_length - max_new_tokens, which I'm not sure where the value is for that

If you have a model that has a max_new_tokens of 4096, and the truncate length is also 4096, then that makes the max prompt length returned be 0

This causes all messages to be removed from the prompt, resulting in a blank prompt sent to the model

This fixes issue #5374

bartowski1182 · 2024-01-27T20:54:30Z

If wanting to maintain the idea that, if you're trying to exceed your capacity with new tokens you should truncate the prompt, I propose:

max_length = min(truncate_prompt_length, max_seq_len - min_length)

which would by default limit the prompt to the truncate length as appropriate, and in a worst case limit the prompt to whatever the model can support as configured minus the minimum amount of new tokens you want to generate

Ideally though, it seems it would be most appropriate to introduce new logic that makes it so that truncate_prompt_length + max_new_tokens is always = max_seq_len

So that way, if you have a max sequence length of 4000, truncate_prompt_length would be set to something below that like 3000 (3/4 seems like a decent starting place?) and max_new_tokens would be set to the remainder, 1000 (or 1/4)

Then at least in the UI it's obvious what's going on

Or, at minimum, truncate_prompt_length shouldn't be set to the total context of the model, because then it doesn't leave any room for generating new tokens, it should be set to some fraction of the total available context (but these are design decisions outside the scope of this fix)

oobabooga · 2024-01-28T17:13:48Z

truncate_prompt_length is automatically obtained from the model metadata and is set to the model's sequence length -- 4096 for llama-2, 32768 for mixtral, etc.

If the user wants to generate max_new_tokens tokens, then the prompt length can't exceed truncate_prompt_length - max_new_tokens, so that there is room for the new tokens.

It is true that chat completions inputs are being completely removed in case they don't fit into the available space, which is an issue. The input should be left truncated instead. #5066 aims to solve this problem but in a way that I find too complicated. There is a probably a simple (few lines) way to modify this loop

text-generation-webui/modules/chat.py

Line 173 in e7a760e

while len(messages) > 0 and get_encoded_length(prompt) > max_length:

to obtain this result.

bartowski1182 · 2024-01-28T17:30:17Z

If you want to maintain this functionality then by all means, but from my point of view it's misleading to have a value that explicitly specifies the max prompt length, but then attempts to adjust it dynamically based on theoretical newly generated tokens, especially since it can lead to situations like mine where nothing gets sent to the backend even though there's tons of available context

I understand how it works, and now it makes sense from a logical perspective, but from a UX perspective it's terrible. Even in discord I saw people dismissing the new Deepseek model entirely because in the chat interface all its responses were garbage, because most people have max new tokens at 4000 so it was removing their prompt completely

oobabooga · 2024-02-05T05:33:53Z

The HF text generation pipeline will still fail if the prompt length + max_new_tokens is greater than the model's context length. This must be ensured.

I am closing this in favor of #5439. If it doesn't work as expected and you still experience issues please let me know.

bartowski1182 · 2024-02-05T20:23:28Z

Seems reasonable. The only other chance I would consider making is not defaulting the prompt truncation to be the full model context, doesn't seem logical to ever have it be at that level since you'll never want your whole context to be prompt anyways, you'd want room to generate. For my liking, i'd default the prompt truncation to be some reasonable fraction of the full context (4/5? 3/4?). If that's agreeable to you, I can make a PR

bartowski1182 changed the base branch from main to dev January 27, 2024 21:07

Fix max_length in chat.py to be just truncation length

ee6977c

bartowski1182 force-pushed the main branch from 5d587d0 to ee6977c Compare January 27, 2024 21:08

bartowski1182 mentioned this pull request Feb 2, 2024

meta-llamaLlama-2-70b-chat-hf doesn't work with text-generation-webui, when max_new_tokens is set to be 4096 #5399

Closed

1 task

oobabooga closed this Feb 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix max_length in chat.py to be just truncation length#5382

Fix max_length in chat.py to be just truncation length#5382
bartowski1182 wants to merge 1 commit intooobabooga:devfrom
bartowski1182:main

bartowski1182 commented Jan 27, 2024

Uh oh!

bartowski1182 commented Jan 27, 2024 •

edited

Loading

Uh oh!

oobabooga commented Jan 28, 2024

Uh oh!

bartowski1182 commented Jan 28, 2024

Uh oh!

oobabooga commented Feb 5, 2024

Uh oh!

bartowski1182 commented Feb 5, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

bartowski1182 commented Jan 27, 2024

Checklist:

Uh oh!

bartowski1182 commented Jan 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

oobabooga commented Jan 28, 2024

Uh oh!

bartowski1182 commented Jan 28, 2024

Uh oh!

oobabooga commented Feb 5, 2024

Uh oh!

bartowski1182 commented Feb 5, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

bartowski1182 commented Jan 27, 2024 •

edited

Loading