Skip to content

Fix max_length in chat.py to be just truncation length#5382

Closed
bartowski1182 wants to merge 1 commit intooobabooga:devfrom
bartowski1182:main
Closed

Fix max_length in chat.py to be just truncation length#5382
bartowski1182 wants to merge 1 commit intooobabooga:devfrom
bartowski1182:main

Conversation

@bartowski1182
Copy link
Copy Markdown
Contributor

Checklist:

get_max_prompt_length is returning truncate_prompt_length - max_new_tokens, which I'm not sure where the value is for that

If you have a model that has a max_new_tokens of 4096, and the truncate length is also 4096, then that makes the max prompt length returned be 0

This causes all messages to be removed from the prompt, resulting in a blank prompt sent to the model

This fixes issue #5374

@bartowski1182
Copy link
Copy Markdown
Contributor Author

bartowski1182 commented Jan 27, 2024

If wanting to maintain the idea that, if you're trying to exceed your capacity with new tokens you should truncate the prompt, I propose:

max_length = min(truncate_prompt_length, max_seq_len - min_length)

which would by default limit the prompt to the truncate length as appropriate, and in a worst case limit the prompt to whatever the model can support as configured minus the minimum amount of new tokens you want to generate

Ideally though, it seems it would be most appropriate to introduce new logic that makes it so that truncate_prompt_length + max_new_tokens is always = max_seq_len

So that way, if you have a max sequence length of 4000, truncate_prompt_length would be set to something below that like 3000 (3/4 seems like a decent starting place?) and max_new_tokens would be set to the remainder, 1000 (or 1/4)

Then at least in the UI it's obvious what's going on

Or, at minimum, truncate_prompt_length shouldn't be set to the total context of the model, because then it doesn't leave any room for generating new tokens, it should be set to some fraction of the total available context (but these are design decisions outside the scope of this fix)

@bartowski1182 bartowski1182 changed the base branch from main to dev January 27, 2024 21:07
@oobabooga
Copy link
Copy Markdown
Owner

truncate_prompt_length is automatically obtained from the model metadata and is set to the model's sequence length -- 4096 for llama-2, 32768 for mixtral, etc.

If the user wants to generate max_new_tokens tokens, then the prompt length can't exceed truncate_prompt_length - max_new_tokens, so that there is room for the new tokens.

It is true that chat completions inputs are being completely removed in case they don't fit into the available space, which is an issue. The input should be left truncated instead. #5066 aims to solve this problem but in a way that I find too complicated. There is a probably a simple (few lines) way to modify this loop

while len(messages) > 0 and get_encoded_length(prompt) > max_length:
to obtain this result.

@bartowski1182
Copy link
Copy Markdown
Contributor Author

If you want to maintain this functionality then by all means, but from my point of view it's misleading to have a value that explicitly specifies the max prompt length, but then attempts to adjust it dynamically based on theoretical newly generated tokens, especially since it can lead to situations like mine where nothing gets sent to the backend even though there's tons of available context

I understand how it works, and now it makes sense from a logical perspective, but from a UX perspective it's terrible. Even in discord I saw people dismissing the new Deepseek model entirely because in the chat interface all its responses were garbage, because most people have max new tokens at 4000 so it was removing their prompt completely

@oobabooga
Copy link
Copy Markdown
Owner

The HF text generation pipeline will still fail if the prompt length + max_new_tokens is greater than the model's context length. This must be ensured.

I am closing this in favor of #5439. If it doesn't work as expected and you still experience issues please let me know.

@oobabooga oobabooga closed this Feb 5, 2024
@bartowski1182
Copy link
Copy Markdown
Contributor Author

Seems reasonable. The only other chance I would consider making is not defaulting the prompt truncation to be the full model context, doesn't seem logical to ever have it be at that level since you'll never want your whole context to be prompt anyways, you'd want room to generate. For my liking, i'd default the prompt truncation to be some reasonable fraction of the full context (4/5? 3/4?). If that's agreeable to you, I can make a PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants