-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Deduce the number of KV heads and head_size from the model #24400
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
...mon/transformations/src/transformations/sdpa_to_paged_attention/state_management_pattern.cpp
Outdated
Show resolved
Hide resolved
...mon/transformations/src/transformations/sdpa_to_paged_attention/state_management_pattern.cpp
Outdated
Show resolved
Hide resolved
...mon/transformations/src/transformations/sdpa_to_paged_attention/state_management_pattern.cpp
Outdated
Show resolved
Hide resolved
|
@CuriousPanCake, Please provide the status for the same set of models that was mentioned in #24336. |
|
Blocks ilya-lavrenov/vllm#33. |
Yep, I've run the tests for all the models available in the testing script and that's what I've got. Some models have given a gibberish answer to the prompt.
|
|
Deduce the number of KV heads and head_size from the model without relying on HF config, and set the deduced values as KV cache input dimension. Applied HW specific layout rearagement based on the current expectations from CPU and GPU preserving those deduced dimensions.
Deduce the number of KV heads and head_size from the model without relying on HF config, and set the deduced values as KV cache input dimension. Applied HW specific layout rearagement based on the current expectations from CPU and GPU preserving those deduced dimensions.
Yep, I've run the tests for all the models available in the testing script and that's what I've got. Some models have given a gibberish answer to the prompt.
I'll attach the logs in a second.
RuntimeError: Check '(axis_range_min <= axis) && (axis <= axis_range_max)' failed at src/core/src/validation_util.cpp:386:
Concat Parameter axis 2 out of the tensor rank range [0, 0].
ValueError: The model's max seq len (32768) is larger than the maximum number of tokens that can be stored in KV cache (8192). Try increasing
gpu_memory_utilizationor decreasingmax_model_lenwhen initializing the engine.ValueError: Loading baichuan-inc/Baichuan2-7B-Base requires you to execute the configuration file in that repo on your local machine. Make sure you have read the code there to avoid malicious use, then set the option
trust_remote_code=Trueto remove this error.The same config issue
The same config issue
There's a good answer, but then some other config issue appears.
The same config issue
The same config issue
Tickets: