🔴 VLM: compile compatibility #35724

zucchini-nlp · 2025-01-16T10:59:07Z

What does this PR do?

As per title, adds flags in VLMs when needed, removes test skips and makes sure VLMs are compile compatible. Also for BLIP models adds new cache format in OPT which is one of backbones. Now all official BLIP models can support static cache and thus compile

NOTE:

Tests with -k compile_forward and -k static_ were run for all models and are passing
Regarding executorch which I also checked, the model can be exported and run a forward pass. But the generation won't work and probably would need smth similar to what we do when exporting VLMs in ONNX. Still need to dig more into that in later PRs

How to run compile and export for VLMs:

import requests
from PIL import Image

import torch
from transformers import AutoProcessor, LlavaForConditionalGeneration
from transformers.generation import GenerationConfig
from transformers.cache_utils import StaticCache
from transformers.integrations.executorch import (
    TorchExportableModuleWithStaticCache,
    convert_and_export_with_cache,
)

model_id = "llava-hf/llava-interleave-qwen-0.5b-hf"
model = LlavaForConditionalGeneration.from_pretrained(
    model_id, 
    torch_dtype="float16", 
    device_map="cuda:0",
)
processor = AutoProcessor.from_pretrained(model_id)

conversation = [
    {

      "role": "user",
      "content": [
          {"type": "text", "text": "What are these?"},
          {"type": "image"},
        ],
    },
]
prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
image_file = "http://images.cocodataset.org/val2017/000000039769.jpg"
raw_image = Image.open(requests.get(image_file, stream=True).raw)
inputs = processor(images=raw_image, text=prompt, return_tensors='pt').to(0, torch.float16)

# Run with static cache which compiles the forward in decoding phase for you
output = model.generate(**inputs, max_new_tokens=20, cache_implementation="static")
print(processor.decode(output[0][2:], skip_special_tokens=True))



# Try to export with `torch.export`. NOTE: TorchExportableModuleWithStaticCache is not ready for VLMs
# and as mentioned above, VLMs might need to export 3 different modules as in ONNX. One for text embedding,
# one for vision backbone and one for the LM backbone with simple decoding token-by-token
max_generation_length = 1000
model = LlavaForConditionalGeneration.from_pretrained(
    model_id,
    device_map="cuda:0",
    torch_dtype="float16",
    attn_implementation="sdpa",
    generation_config=GenerationConfig(
        use_cache=True,
        cache_implementation="static",
        max_length=max_generation_length,
        cache_config={
            "batch_size": 1,
            "max_cache_len": max_generation_length,
        },
    ),
)

# Adapted from `TorchExportableModuleWithStaticCache` with minor changes
class TorchExportableModuleWithStaticCacheForVLM(torch.nn.Module):
    def __init__(self, model):
        super().__init__()
        self.model = model
        self.static_cache = StaticCache(
            config=self.model.config.get_text_config(),
            batch_size=self.model.generation_config.cache_config.batch_size,
            max_cache_len=self.model.generation_config.cache_config.max_cache_len,
            dtype=self.model.dtype,
            device=model.device,
        )
        self.is_causal = any(("CausalLM" in arch or "ConditionalGeneration" in arch) for arch in self.model.config.architectures)
        if self.is_causal:
            causal_mask = torch.triu(
                torch.full(
                    (
                        self.model.generation_config.cache_config.batch_size,
                        1,
                        self.static_cache.max_cache_len,
                        self.static_cache.max_cache_len
                    ),
                    fill_value=torch.finfo(self.model.dtype).min,
                    dtype=self.model.dtype,
                    device=model.device,
                )
            )
            self.register_buffer("mask", causal_mask, persistent=False)

    def forward(
        self,
        input_ids: torch.Tensor,
        cache_position: torch.Tensor,
        pixel_values: torch.Tensor,
    ):
        _, seqlen = input_ids.shape
        attn_mask = self.mask[:, :, cache_position, :] if self.is_causal else None
        outs = self.model(
            input_ids=input_ids,
            attention_mask=attn_mask,
            position_ids=cache_position.unsqueeze(0),
            pixel_values=pixel_values,
            cache_position=cache_position,
            past_key_values=self.static_cache,
            use_cache=True,
        )
        return outs.logits

cache_position = torch.arange(inputs.input_ids.shape[1], dtype=torch.long, device=model.device)
export_inputs = {"input_ids": inputs.input_ids, "cache_position": cache_position, "pixel_values": inputs.pixel_values}

with torch.no_grad():
    exported_program = torch.export.export(
        TorchExportableModuleWithStaticCacheForVLM(model),
        args=(),
        kwargs=export_inputs,
        strict=True,
    )

torch.export.save(exported_program, "exported_llava.pt2")
exported_program = torch.export.load("exported_llava.pt2")
out = exported_program.module().forward(
    input_ids=inputs.input_ids,
    pixel_values=inputs.pixel_values,
    cache_position=cache_position,
)

Benchmark on "llava-hf/llava-onevision-qwen2-7b-ov-hf" using the same script we use for LLMs + dummy image in inputs

Fixes #29891

zucchini-nlp · 2025-01-16T11:00:37Z

src/transformers/models/opt/modeling_opt.py

+        if past_key_value is not None:
+            if not isinstance(past_key_value, EncoderDecoderCache):
+                curr_past_key_value = past_key_value
+            else:


I dont know why but OPT model works as decoder-only but the attention is written as cross-attention (not used anywhere in codebase). So we need to support somehow BC while using the new DynamicCache

As a workaround I simply added a check on cache instance. Another possibility is to accept and return only the correct cache (self or cross attn) but that means all encoder-decoder models will need a change thus breaking BC

Very much copy-paste from somewhere else, see this comment

IMO, we can make our maintenance easier and assume no encoder-decoder stuff :) But don't spend more time here, eventually this will be rewritten with modular

src/transformers/models/llava/configuration_llava.py

tests/generation/test_utils.py

tests/models/aria/test_modeling_aria.py

zucchini-nlp · 2025-01-17T11:05:27Z

Ready for review failing test is flaky otherwise everything is passing on my end, including slow test for compile/StaticCache

HuggingFaceDocBuilderDev · 2025-01-30T15:04:01Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

gante · 2025-01-30T16:45:07Z

@zucchini-nlp

In the PR header we can read

all VLMs have dynamic control in prepare_inputs_for_generation and thus skip test_compile_forward which compiles the model for pre-fill phase. But the test for decoding stage compile is green therefore I'm leaving the flag as True

test_compile_forward was the old name for test_generate_compile_model_forward if I'm not mistaken, back when it also did end-to-end compilation tests. We no longer have the end-to-end compilation tests, so this part of the PR header is no longer accurate, correct?

gante · 2025-01-30T17:48:26Z

The failing test seems related to this PR :D

ArthurZucker

Very nice but 🔴 there are a few breaking changes so let's be careful!
And do you have some benches / perf imporvements to share (making sure reduce overhead is working etc_

src/transformers/models/chameleon/modeling_chameleon.py

ArthurZucker · 2025-02-06T14:19:35Z

src/transformers/models/gpt_neox_japanese/modeling_gpt_neox_japanese.py

+        query = query.reshape(batch_size * num_attention_heads, query_length, attn_head_size)
+        key = key.reshape(batch_size * num_attention_heads, key_length, attn_head_size)


yeah, tho calling .continguous works as well

src/transformers/models/llava/modeling_llava.py

src/transformers/models/llava_next/configuration_llava_next.py

src/transformers/models/llava_next/modeling_llava_next.py

src/transformers/models/cohere2/modeling_cohere2.py

src/transformers/models/opt/modeling_opt.py

Co-authored-by: Arthur <[email protected]>

zucchini-nlp · 2025-02-13T14:08:37Z

Soooo, here is the correct eval with llava-ov-7b. One thing to note is that VLMs will not benefit from torch when the context length is very high, like in case of videos or high-res images. That's the reason vanilla llava got speed up from first try, while llava-ov needed a few runs to notice how many input tokens we had

I will make fixup and merge this, because VLMs with less tokens per image get speedups

Since the latest transformers release of v4.49.0, X-LoRA tests are broken. The PR that caused it was: huggingface/transformers#35724 For the time being, let's skip the X-LoRA tests if this transformers version is detected and also advice users against using X-LoRA with this transformers version.

X-LoRA tests started failing after this transformers PR: huggingface/transformers#35724 The solution appears to be to disable caching completely when calling generate on the X-LoRA model. This also makes some previously xfail-ing tests pass. I tested this locally with transformers checked out before and after the mentioned PR and the tests pass in both circumstances. I also tested changing the base model from "facebook/opt-125m" to "trl-internal-testing/tiny-random-LlamaForCausalLM" and the tests passed with both.

X-LoRA tests started failing after this transformers PR: huggingface/transformers#35724 The solution appears to be to disable caching completely when calling generate on the X-LoRA model. This also makes some previously xfail-ing tests pass. I tested this locally with transformers checked out before and after the mentioned PR and the tests pass in both circumstances. I also tested changing the base model from "facebook/opt-125m" to "trl-internal-testing/tiny-random-LlamaForCausalLM" and the tests passed with both. Also, mark X-LoRA save_load_function test as flaky. It was marked as xfail beforehand, but it is in fact just flaky.

* llavas * add mroe models * fix `compile_forward` test for all models * fix copies * make style * also doesn't support cache class * fix some tests * not copied from * ci green? * fix tests * fix copies * fix tests * check with `numel` and remove `item` * fix copies * fix copies * Update src/transformers/models/cohere2/modeling_cohere2.py Co-authored-by: Arthur <[email protected]> * opt remove cross attn * gemma2 * fixup * fixup * fix newly added test * maybe fixed? * green please? --------- Co-authored-by: Arthur <[email protected]>

X-LoRA tests started failing after this transformers PR: huggingface/transformers#35724 The solution appears to be to disable caching completely when calling generate on the X-LoRA model. This also makes some previously xfail-ing tests pass. I tested this locally with transformers checked out before and after the mentioned PR and the tests pass in both circumstances. I also tested changing the base model from "facebook/opt-125m" to "trl-internal-testing/tiny-random-LlamaForCausalLM" and the tests passed with both. Also, mark X-LoRA save_load_function test as flaky. It was marked as xfail beforehand, but it is in fact just flaky.

zucchini-nlp added 2 commits January 15, 2025 16:25

llavas

5d6c5e1

add mroe models

b500dcf

zucchini-nlp requested review from ArthurZucker, Rocketknight1 and molbap as code owners January 16, 2025 10:59

zucchini-nlp commented Jan 16, 2025

View reviewed changes

zucchini-nlp added 4 commits January 16, 2025 14:46

fix compile_forward test for all models

b56b40c

fix copies

040a83c

make style

4c8e6ab

also doesn't support cache class

b72d845

zucchini-nlp changed the title ~~[WIP] VLM: compile compatibility~~ VLM: compile compatibility Jan 16, 2025

zucchini-nlp added 2 commits January 16, 2025 17:11

fix some tests

70a0510

not copied from

17b0c8f

zucchini-nlp changed the title ~~VLM: compile compatibility~~ [WIP] VLM: compile compatibility Jan 16, 2025

ci green?

8ddee32

zucchini-nlp commented Jan 17, 2025

View reviewed changes

src/transformers/models/llava/configuration_llava.py Show resolved Hide resolved

zucchini-nlp commented Jan 17, 2025

View reviewed changes

tests/generation/test_utils.py Outdated Show resolved Hide resolved

zucchini-nlp commented Jan 17, 2025

View reviewed changes

tests/models/aria/test_modeling_aria.py Show resolved Hide resolved

fix tests

370c9d2

zucchini-nlp changed the title ~~[WIP] VLM: compile compatibility~~ VLM: compile compatibility Jan 17, 2025

zucchini-nlp added 2 commits January 30, 2025 15:12

Merge remote-tracking branch 'upstream/main' into compile-llava-enable

91d268d

fix copies

fcc6454

zucchini-nlp requested review from gante and removed request for Rocketknight1 and molbap January 30, 2025 14:22

fix tests

41b50d8

zucchini-nlp force-pushed the compile-llava-enable branch from 7a36786 to 41b50d8 Compare January 30, 2025 14:37

ArthurZucker reviewed Feb 6, 2025

View reviewed changes

zucchini-nlp mentioned this pull request Feb 10, 2025

tracker: generate compatibility with torch.compile #28981

Closed

33 tasks

zucchini-nlp added 3 commits February 10, 2025 10:35

check with numel and remove item

2b602ba

merge main

4a3ff89

fix copies

4e9cd52

zucchini-nlp changed the title ~~VLM: compile compatibility~~ 🔴 VLM: compile compatibility Feb 10, 2025

zucchini-nlp added 2 commits February 10, 2025 12:04

fix copies

1776f0f

Merge remote-tracking branch 'upstream/main' into compile-llava-enable

e906616

ArthurZucker approved these changes Feb 10, 2025

View reviewed changes

src/transformers/models/cohere2/modeling_cohere2.py Outdated Show resolved Hide resolved

src/transformers/models/opt/modeling_opt.py Show resolved Hide resolved

zucchini-nlp and others added 3 commits February 13, 2025 12:45

Update src/transformers/models/cohere2/modeling_cohere2.py

2232f62

Co-authored-by: Arthur <[email protected]>

merge main

f84242e

opt remove cross attn

e089e34

zucchini-nlp added 8 commits February 13, 2025 15:09

gemma2

210bb5f

fixup

7271490

Merge branch 'main' into compile-llava-enable

496fc05

fixup

45ad329

Merge branch 'main' into compile-llava-enable

7a79bac

fix newly added test

2f219eb

maybe fixed?

0cf1cfe

green please?

eccc5fa

zucchini-nlp merged commit 0c78ef6 into huggingface:main Feb 14, 2025
25 checks passed

BenjaminBossan mentioned this pull request Feb 18, 2025

FIX: Skip X-LoRA tests for transformers >= 4.49.0 huggingface/peft#2383

Closed

BenjaminBossan mentioned this pull request Feb 18, 2025

FIX: Avoid caching in X-LoRA generate huggingface/peft#2384

Merged

		query = query.reshape(batch_size * num_attention_heads, query_length, attn_head_size)
		key = key.reshape(batch_size * num_attention_heads, key_length, attn_head_size)

🔴 VLM: compile compatibility #35724

🔴 VLM: compile compatibility #35724

Uh oh!

Conversation

zucchini-nlp commented Jan 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Benchmark on "llava-hf/llava-onevision-qwen2-7b-ov-hf" using the same script we use for LLMs + dummy image in inputs

Uh oh!

zucchini-nlp Jan 16, 2025

Choose a reason for hiding this comment

Uh oh!

gante Jan 30, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zucchini-nlp commented Jan 17, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Jan 30, 2025

Uh oh!

gante commented Jan 30, 2025

Uh oh!

gante commented Jan 30, 2025

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ArthurZucker Feb 6, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zucchini-nlp commented Feb 13, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

zucchini-nlp commented Jan 16, 2025 •

edited

Loading