Skip to content

Conversation

@ReinforcedKnowledge
Copy link

@ReinforcedKnowledge ReinforcedKnowledge commented Oct 30, 2025

TL;DR

When loading Qwen2VL image processor with from_pretrained(max_pixels=X), the parameter is stored as an attribute but not synchronized with the size dict that controls actual image resizing behavior. This causes images to be resized using default max_pixels=16,777,216 instead of the user's specified value, resulting in excessive token counts.

This fix implements max_pixels and min_pixels as properties that automatically synchronize with the size dict when set via setattr() during from_pretrained() loading. The properties ensure size['longest_edge'] and size['shortest_edge'] stay in sync with max_pixels and min_pixels attributes respectively.

The property-based approach is safe and localized to Qwen2VL, avoiding potential impacts on other image processors but this issue might concern other processors and I'm not totally fan of the property lookup.

We could argue there is no need for two sources of truth for the same thing (size['longest_edge'] and max_pixels) but we can also argue that the base class should / might be more clever about these user kwargs that don't synchronize with processor configs around their __init__ logic.

Since it's my first PR / issue to transformers I thought I'd give a more detailed explanation:

Issue

When loading Qwen2VL/Qwen3VL image processor with:

processor = Qwen2VLImageProcessorFast.from_pretrained(
    'Qwen/Qwen3-VL-2B-Instruct',
    max_pixels=200_000
)

The expected behavior:

  • processor.max_pixels = 200,000
  • processor.size['longest_edge'] = 200,000

That way the images get resized to ~200k pixels giving us around ~195 tokens.

Actual behavior (before fix):

  • processor.max_pixels = 200,000 which is correct
  • processor.size['longest_edge'] = 16,777,216 which is incorrect (loaded from config)
  • Images resized to default 16M pixels so around ~3,844 tokens

Root cause analysis

The Qwen2VL processor uses dual representation for pixel constraints:

  1. self.max_pixels: an attribute
  2. self.size['longest_edge']: the actual value used during image processing

The __init__ method is designed to keep these synchronized:

def __init__(self, **kwargs):
    max_pixels = kwargs.pop("max_pixels", None)
    # ...
    if max_pixels is not None:
        size["longest_edge"] = max_pixels

However, when loading via from_pretrained() we lose this synchronization because:

  1. ImageProcessingMixin.from_dict() calls [cls(**config_dict)](https://github.com/huggingface/transformers/blob/02c324f43fe0ef5d484e846417e5f3bf4484524c/src/transformers/image_processing_base.py#L374) without max_pixels
  2. Synchronization logic skips (max_pixels=None)
  3. Later, from_dict() sets max_pixels via setattr() which is too late, just sets the value to the attribute

So as a result: max_pixels is stored correctly but size['longest_edge'] is never updated

Some other fixes IMHO

The "obvious" fix would be to add special handling in image_processing_base.py:

# In ImageProcessingMixin.from_dict()
if "max_pixels" in kwargs:
    image_processor_dict["max_pixels"] = kwargs.pop("max_pixels")

This would forward max_pixels to __init__() before the synchronization logic runs. This would allow us not to track every past and future processor that has similar logic as Qwen2VL's image processor. But the issue is that:

  1. Not all image processors may accept max_pixels in __init__(). If a processor's __init__ doesn't have **kwargs, it would raise a TypeError I guess and we can't guarantee all processors handle this parameter.
  2. The changes to the base class affect all processors so it'll require extensive testing across all image processor implementations and a higher risk of unintended side effects
  3. The existing pattern is incomplete: current code only forwards params already in config:
if "size" in kwargs and "size" in image_processor_dict:  # requires the parameter to be both in the user kwargs and the config
    image_processor_dict["size"] = kwargs.pop("size")

But Qwen2VL's config doesn't have max_pixels in the dict, so this pattern wouldn't help.

That's why I went with the properties instead, minimal scope, max_pixels (or min_pixels) and size['longest_edge'] (or size['shortest_edge']) stay synchronized wether the class is instantiated directly, or with from_pretrained and wether the user down the line changes the attribute directly or not.

This PR implements a safer, processor-specific fix using Python

Future Considerations

While this PR fixes the issue for Qwen2VL/Qwen3VL, the underlying issue in from_dict() could affect other processors with similar synchronization patterns.

Related issue

#41955

ReinforcedKnowledge and others added 2 commits October 30, 2025 16:35
…retrained()

When loading Qwen2VL image processor with from_pretrained(max_pixels=X), the parameter was stored as an attribute but not synchronized with the size dict that controls actual image resizing behavior. This caused images to be resized using default max_pixels=16,777,216 instead of the user's specified value, resulting in excessive token counts.

This fix implements max_pixels and min_pixels as properties that automatically synchronize with the size dict when set via setattr() during from_pretrained() loading. The properties ensure size['longest_edge'] and size['shortest_edge'] stay in sync with max_pixels and min_pixels attributes respectively.

The property-based approach is safe and localized to Qwen2VL, avoiding potential impacts on other image processors.
@ReinforcedKnowledge
Copy link
Author

ReinforcedKnowledge commented Nov 3, 2025

Hi everyone, if anyone is reviewing this PR or wants to review it, do not hesitate to tell me if there's anything I can do to get the rest of the tests to go through (first time contributing to HF but I made sure that the bug exists and is solved with my PR).

@github-actions
Copy link
Contributor

github-actions bot commented Nov 3, 2025

[For maintainers] Suggested jobs to run (before merge)

run-slow: qwen2_vl

@Rocketknight1
Copy link
Member

cc @yonigozlan @molbap

Copy link
Member

@yonigozlan yonigozlan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @ReinforcedKnowledge ! Thanks a lot for raising this issue, I was able to reproduce it, but the issue actually comes from bad logic in from_pretrained, and can affect other image processors, so I opened another PR to fix it here #41997.
Thanks again for flagging this!

@ReinforcedKnowledge
Copy link
Author

@yonigozlan Thank you for your response! I wanted to update the from_pretrained as I explained in the body of the PR but doing so impacts a lot of stuff and I didn't want to bear that responsibility as it's my first contribution. I have one quick question but I'll ask it on your PR 😄

@yonigozlan
Copy link
Member

Sorry I missed that you had already pointed out the issue with from_pretrained!

@ReinforcedKnowledge
Copy link
Author

It's okay! Don't hesitate to close this PR when you merge yours if needed

@yonigozlan
Copy link
Member

Closing this as the fix PR is merged :)

@yonigozlan yonigozlan closed this Nov 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants