-
Notifications
You must be signed in to change notification settings - Fork 31.7k
🔴 VLM: compile compatibility #35724
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🔴 VLM: compile compatibility #35724
Conversation
| if past_key_value is not None: | ||
| if not isinstance(past_key_value, EncoderDecoderCache): | ||
| curr_past_key_value = past_key_value | ||
| else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dont know why but OPT model works as decoder-only but the attention is written as cross-attention (not used anywhere in codebase). So we need to support somehow BC while using the new DynamicCache
As a workaround I simply added a check on cache instance. Another possibility is to accept and return only the correct cache (self or cross attn) but that means all encoder-decoder models will need a change thus breaking BC
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very much copy-paste from somewhere else, see this comment
IMO, we can make our maintenance easier and assume no encoder-decoder stuff :) But don't spend more time here, eventually this will be rewritten with modular
|
Ready for review failing test is flaky otherwise everything is passing on my end, including slow test for compile/StaticCache |
7a36786 to
41b50d8
Compare
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
|
In the PR header we can read
|
|
The failing test seems related to this PR :D |
ArthurZucker
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice but 🔴 there are a few breaking changes so let's be careful!
And do you have some benches / perf imporvements to share (making sure reduce overhead is working etc_
| query = query.reshape(batch_size * num_attention_heads, query_length, attn_head_size) | ||
| key = key.reshape(batch_size * num_attention_heads, key_length, attn_head_size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, tho calling .continguous works as well
|
Soooo, here is the correct eval with llava-ov-7b. One thing to note is that VLMs will not benefit from torch when the context length is very high, like in case of videos or high-res images. That's the reason vanilla llava got speed up from first try, while llava-ov needed a few runs to notice how many input tokens we had I will make fixup and merge this, because VLMs with less tokens per image get speedups |
Since the latest transformers release of v4.49.0, X-LoRA tests are broken. The PR that caused it was: huggingface/transformers#35724 For the time being, let's skip the X-LoRA tests if this transformers version is detected and also advice users against using X-LoRA with this transformers version.
X-LoRA tests started failing after this transformers PR: huggingface/transformers#35724 The solution appears to be to disable caching completely when calling generate on the X-LoRA model. This also makes some previously xfail-ing tests pass. I tested this locally with transformers checked out before and after the mentioned PR and the tests pass in both circumstances. I also tested changing the base model from "facebook/opt-125m" to "trl-internal-testing/tiny-random-LlamaForCausalLM" and the tests passed with both.
X-LoRA tests started failing after this transformers PR: huggingface/transformers#35724 The solution appears to be to disable caching completely when calling generate on the X-LoRA model. This also makes some previously xfail-ing tests pass. I tested this locally with transformers checked out before and after the mentioned PR and the tests pass in both circumstances. I also tested changing the base model from "facebook/opt-125m" to "trl-internal-testing/tiny-random-LlamaForCausalLM" and the tests passed with both. Also, mark X-LoRA save_load_function test as flaky. It was marked as xfail beforehand, but it is in fact just flaky.
* llavas * add mroe models * fix `compile_forward` test for all models * fix copies * make style * also doesn't support cache class * fix some tests * not copied from * ci green? * fix tests * fix copies * fix tests * check with `numel` and remove `item` * fix copies * fix copies * Update src/transformers/models/cohere2/modeling_cohere2.py Co-authored-by: Arthur <[email protected]> * opt remove cross attn * gemma2 * fixup * fixup * fix newly added test * maybe fixed? * green please? --------- Co-authored-by: Arthur <[email protected]>
X-LoRA tests started failing after this transformers PR: huggingface/transformers#35724 The solution appears to be to disable caching completely when calling generate on the X-LoRA model. This also makes some previously xfail-ing tests pass. I tested this locally with transformers checked out before and after the mentioned PR and the tests pass in both circumstances. I also tested changing the base model from "facebook/opt-125m" to "trl-internal-testing/tiny-random-LlamaForCausalLM" and the tests passed with both. Also, mark X-LoRA save_load_function test as flaky. It was marked as xfail beforehand, but it is in fact just flaky.

What does this PR do?
As per title, adds flags in VLMs when needed, removes test skips and makes sure VLMs are compile compatible. Also for BLIP models adds new cache format in OPT which is one of backbones. Now all official BLIP models can support static cache and thus compile
NOTE:
-k compile_forwardand-k static_were run for all models and are passingHow to run compile and export for VLMs:
Benchmark on "llava-hf/llava-onevision-qwen2-7b-ov-hf" using the same script we use for LLMs + dummy image in inputs
Fixes #29891