Skip to content

Conversation

@zelinms
Copy link

@zelinms zelinms commented Aug 1, 2024

Issue

Before this fix: if we start from a prompt with length < 4K but the total sequence length will >4K after generation, the generated tokens after 4K will be gabage.

Root Cause

LongRoPE uses difference scaling factors for <=4k / > 4k sequences. When the generation comes across the switch point of 4K, the kv cache of prefill tokens are calculated based on short factors, but the new tokens are generated based on long factors. This inconsistency leads to the generation crash.

Solution

Pass prompt_length + max_generation_tokens to the model.
If it is larger than 4K, we should use the long factors for all calculations of this sequence.

@zelinms zelinms changed the title Fit cluster tests fp8 longrope Fix LongRoPE KV Cache Aug 1, 2024
@zelinms zelinms marked this pull request as ready for review August 1, 2024 05:05
@zelinms zelinms marked this pull request as draft August 1, 2024 10:45
@zelinms zelinms closed this Aug 1, 2024
@zelinms zelinms deleted the fit-cluster-tests-fp8-longrope branch August 1, 2024 14:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants