-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Port SDPA to PagedAttention transformation #24336
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Port SDPA to PagedAttention transformation #24336
Conversation
src/bindings/python/src/pyopenvino/core/offline_transformations.cpp
Outdated
Show resolved
Hide resolved
|
@ilya-lavrenov @slyalin please take a look |
|
@itikhono, have you compared IRs produced by Python and C++ paths for all models from the list? |
We agreed to run this model list: And compare the response generated for the dedicated prompt. Comparing IRs is another task |
|
As I can see, Jenkins and ARM jobs fail in other PRs and in the post-commit . |
Deduce the number of KV heads and head_size from the model without relying on HF config, and set the deduced values as KV cache input dimension. Applied HW specific layout rearagement based on the current expectations from CPU and GPU preserving those deduced dimensions. > @CuriousPanCake, Please provide the status for the same set of models that was mentioned in #24336. Yep, I've run the tests for all the models available in the testing script and that's what I've got. Some models have given a gibberish answer to the prompt. I'll attach the logs in a second. - [x] hf-internal-testing/tiny-random-BloomForCausalLM - [x] hf-internal-testing/tiny-random-FalconForCausalLM - [x] hf-internal-testing/tiny-random-Starcoder2ForCausalLM - [x] hf-internal-testing/tiny-random-GPTJForCausalLM - [x] hf-internal-testing/tiny-random-StableLmForCausalLM - [x] hf-internal-testing/tiny-random-LlamaForCausalLM - [x] hf-internal-testing/tiny-random-MistralForCausalLM - [ ] hf-internal-testing/tiny-random-MptForCausalLM _RuntimeError: Check '(axis_range_min <= axis) && (axis <= axis_range_max)' failed at src/core/src/validation_util.cpp:386: Concat Parameter axis 2 out of the tensor rank range [0, 0]._ - [x] hf-internal-testing/tiny-random-OPTForCausalLM - [x] hf-internal-testing/tiny-random-PhiForCausalLM - [x] hf-internal-testing/tiny-random-StableLmForCausalLM - [x] facebook/opt-125m (Not a gebberish answer) - [ ] Qwen/Qwen1.5-7B _ValueError: The model's max seq len (32768) is larger than the maximum number of tokens that can be stored in KV cache (8192). Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine._ - [x] bigcode/starcoder2-7b (Not a gebbrish answer) - [ ] baichuan-inc/Baichuan2-7B-Base _ValueError: Loading baichuan-inc/Baichuan2-7B-Base requires you to execute the configuration file in that repo on your local machine. Make sure you have read the code there to avoid malicious use, then set the option `trust_remote_code=True` to remove this error._ - [ ] allenai/OLMo-7B _The same config issue_ - [ ] internlm/internlm2-7b _The same config issue_ - [x] stabilityai/stablelm-tuned-alpha-7b - [x] EleutherAI/gpt-j-6b (Not a gebberish answer) - [x] openai-community/gpt2 (Not a gebberish answer) - [ ] google/gemma-7b _There's a good answer, but then some other config issue appears._ - [ ] Deci/DeciLM-7B _The same config issue_ - [ ] THUDM/chatglm3-6b _The same config issue_ ### Tickets: - CVS-140707
Details:
Ported SDPA to PagedAttention transformation from python to C++ code.
the related PRs:
#24127
#24177
Tested model scope:
Issue: RuntimeError: Check '(axis_range_min <= axis) && (axis <= axis_range_max)' failed at src/core/src/validation_util.cpp:386:
Concat Parameter axis 2 out of the tensor rank range [0, 0].
Tickets: