Port SDPA to PagedAttention transformation #24336

itikhono · 2024-05-01T20:54:43Z

Details:

Ported SDPA to PagedAttention transformation from python to C++ code.

the related PRs:
#24127
#24177

Tested model scope:

Tickets:

CVS-138664

…ransformation

src/bindings/python/src/pyopenvino/core/offline_transformations.cpp

itikhono · 2024-05-02T08:15:05Z

@ilya-lavrenov @slyalin please take a look

slyalin · 2024-05-02T08:18:09Z

@itikhono, have you compared IRs produced by Python and C++ paths for all models from the list?

itikhono · 2024-05-02T08:39:17Z

ave you compared IRs produced by Python and C++ paths for all models from the list?

We agreed to run this model list:
"hf-internal-testing/tiny-random-BloomForCausalLM",
"hf-internal-testing/tiny-random-FalconForCausalLM",
"hf-internal-testing/tiny-random-Starcoder2ForCausalLM",
"hf-internal-testing/tiny-random-GPTJForCausalLM",
"hf-internal-testing/tiny-random-StableLmForCausalLM",
"hf-internal-testing/tiny-random-LlamaForCausalLM",
"hf-internal-testing/tiny-random-MistralForCausalLM",
"hf-internal-testing/tiny-random-OPTForCausalLM",
"hf-internal-testing/tiny-random-PhiForCausalLM",
"hf-internal-testing/tiny-random-StableLmForCausalLM",
"facebook/opt-125m",
"llama2",
"bigcode/starcoder2-7b"

And compare the response generated for the dedicated prompt.
No diffs between responses when using py and c++ impls were found.

Comparing IRs is another task
We can do it but it will require more time

itikhono · 2024-05-02T08:51:05Z

As I can see, Jenkins and ARM jobs fail in other PRs and in the post-commit .
Not related to these changes.

@CuriousPanCake

Deduce the number of KV heads and head_size from the model without relying on HF config, and set the deduced values as KV cache input dimension. Applied HW specific layout rearagement based on the current expectations from CPU and GPU preserving those deduced dimensions. > @CuriousPanCake, Please provide the status for the same set of models that was mentioned in #24336. Yep, I've run the tests for all the models available in the testing script and that's what I've got. Some models have given a gibberish answer to the prompt. I'll attach the logs in a second. - [x] hf-internal-testing/tiny-random-BloomForCausalLM - [x] hf-internal-testing/tiny-random-FalconForCausalLM - [x] hf-internal-testing/tiny-random-Starcoder2ForCausalLM - [x] hf-internal-testing/tiny-random-GPTJForCausalLM - [x] hf-internal-testing/tiny-random-StableLmForCausalLM - [x] hf-internal-testing/tiny-random-LlamaForCausalLM - [x] hf-internal-testing/tiny-random-MistralForCausalLM - [ ] hf-internal-testing/tiny-random-MptForCausalLM _RuntimeError: Check '(axis_range_min <= axis) && (axis <= axis_range_max)' failed at src/core/src/validation_util.cpp:386: Concat Parameter axis 2 out of the tensor rank range [0, 0]._ - [x] hf-internal-testing/tiny-random-OPTForCausalLM - [x] hf-internal-testing/tiny-random-PhiForCausalLM - [x] hf-internal-testing/tiny-random-StableLmForCausalLM - [x] facebook/opt-125m (Not a gebberish answer) - [ ] Qwen/Qwen1.5-7B _ValueError: The model's max seq len (32768) is larger than the maximum number of tokens that can be stored in KV cache (8192). Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine._ - [x] bigcode/starcoder2-7b (Not a gebbrish answer) - [ ] baichuan-inc/Baichuan2-7B-Base _ValueError: Loading baichuan-inc/Baichuan2-7B-Base requires you to execute the configuration file in that repo on your local machine. Make sure you have read the code there to avoid malicious use, then set the option `trust_remote_code=True` to remove this error._ - [ ] allenai/OLMo-7B _The same config issue_ - [ ] internlm/internlm2-7b _The same config issue_ - [x] stabilityai/stablelm-tuned-alpha-7b - [x] EleutherAI/gpt-j-6b (Not a gebberish answer) - [x] openai-community/gpt2 (Not a gebberish answer) - [ ] google/gemma-7b _There's a good answer, but then some other config issue appears._ - [ ] Deci/DeciLM-7B _The same config issue_ - [ ] THUDM/chatglm3-6b _The same config issue_ ### Tickets: - CVS-140707

slyalin and others added 10 commits April 29, 2024 12:21

PagedAttention transformation placeholder

b7785ec

wip: added the setup part and half of the first matcher

facf469

finished first matcher

268392a

rewrote base implementation of patterns

b69c8ce

misc changes

c6cc882

debugging

2f79649

wip

d6730d9

working

daef5f5

Merge remote-tracking branch 'upstream/master' into paged_attention_t…

0467c60

…ransformation

Port SDPA to PagedAttention transformation

d825cc6

itikhono added the category: transformations OpenVINO Runtime library - Transformations label May 1, 2024

itikhono added this to the 2024.2 milestone May 1, 2024

itikhono requested review from CuriousPanCake, ilya-lavrenov and slyalin May 1, 2024 20:54

itikhono requested review from a team as code owners May 1, 2024 20:54

github-actions bot added category: Core OpenVINO Core (aka ngraph) category: Python API OpenVINO Python bindings category: CPP API OpenVINO CPP API bindings labels May 1, 2024

itikhono mentioned this pull request May 1, 2024

Paged attention transformation #24177

Closed

ilya-lavrenov reviewed May 1, 2024

View reviewed changes

src/bindings/python/src/pyopenvino/core/offline_transformations.cpp Outdated Show resolved Hide resolved

ilya-lavrenov assigned slyalin May 1, 2024

delete debug print; fix warning

f827aec

itikhono requested a review from ilya-lavrenov May 1, 2024 21:06

itikhono added 3 commits May 2, 2024 01:09

codestyle

ef82db6

fix warning

7f27c10

Merge branch 'master' into paged_attention_transformation

d798152

ilya-lavrenov approved these changes May 2, 2024

View reviewed changes

slyalin approved these changes May 2, 2024

View reviewed changes

ilya-lavrenov merged commit 55c11c6 into openvinotoolkit:master May 2, 2024

slyalin mentioned this pull request May 7, 2024

Deduce the number of KV heads and head_size from the model #24400

Merged

23 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Port SDPA to PagedAttention transformation #24336

Port SDPA to PagedAttention transformation #24336

Uh oh!

itikhono commented May 1, 2024 •

edited by CuriousPanCake

Loading

Uh oh!

Uh oh!

itikhono commented May 2, 2024

Uh oh!

slyalin commented May 2, 2024

Uh oh!

itikhono commented May 2, 2024 •

edited

Loading

Uh oh!

itikhono commented May 2, 2024 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Port SDPA to PagedAttention transformation #24336

Port SDPA to PagedAttention transformation #24336

Uh oh!

Conversation

itikhono commented May 1, 2024 • edited by CuriousPanCake Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Details:

Tickets:

Uh oh!

Uh oh!

itikhono commented May 2, 2024

Uh oh!

slyalin commented May 2, 2024

Uh oh!

itikhono commented May 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

itikhono commented May 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

itikhono commented May 1, 2024 •

edited by CuriousPanCake

Loading

itikhono commented May 2, 2024 •

edited

Loading

itikhono commented May 2, 2024 •

edited

Loading