Skip to content

Conversation

@ugolowic
Copy link
Collaborator

This commit fixes errors in video-comprehension example involving VideLLava and CLIP models and caused by latest transformers upgrade.

  • Adapt modeling_video_llava to the transformers upgrade by adding GaudiVideoLlavaModel.forward
  • Fix mismatched matrix sizes in CLIP attention.
  • Fix access to non-existing attribute in GaudiGenerationMixin.

This is a fix for failing README example here: https://github.com/huggingface/optimum-habana/tree/main/examples/video-comprehension

@github-actions
Copy link

The code quality check failed, please run make style.

@ugolowic ugolowic force-pushed the video-comprehension-fix branch from 5572c8f to f7cd918 Compare October 20, 2025 10:31
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@astachowiczhabana
Copy link
Collaborator

LGTM

@karol-brejna-i karol-brejna-i self-assigned this Oct 21, 2025
elif cache_position is None:
past_length = past_key_values[0][0].shape[2] if past_key_values is not None else 0
cache_position = torch.arange(past_length, input_ids.shape[1], dtype=torch.long, device=input_ids.device)
model_inputs["cache_position"] = cache_position
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure this doesn't break generation for other models?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point.
This change is in line with transformers upgrade, but I'm running text generation slow tests. It looks fine for now, I'll update when they're done.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ugolowic can you submit the results?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran text generation slow tests on main branch and on this change rebased onto the main branch and the results are exactly the same.

This commit fixes errors in video-comprehension example
involving VideLLava and CLIP models and caused by latest
transformers upgrade.
* Adapt modeling_video_llava to the transformers upgrade
  by adding GaudiVideoLlavaModel.forward
* Fix mismatched matrix sizes in CLIP attention.
* Fix access to non-existing attribute in GaudiGenerationMixin.

Signed-off-by: Urszula <[email protected]>
@ugolowic ugolowic force-pushed the video-comprehension-fix branch from f7cd918 to 316601d Compare November 5, 2025 08:55
elif cache_position is None:
past_length = past_key_values[0][0].shape[2] if past_key_values is not None else 0
cache_position = torch.arange(past_length, input_ids.shape[1], dtype=torch.long, device=input_ids.device)
model_inputs["cache_position"] = cache_position
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ugolowic can you submit the results?

Copy link
Collaborator

@regisss regisss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@regisss regisss merged commit 33536f1 into huggingface:main Nov 6, 2025
3 of 5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants