Skip to content

Conversation

@BlueDyee
Copy link

@BlueDyee BlueDyee commented Nov 5, 2025

Problem

This PR fixes two critical issues that prevent inference at resolutions different from the pretrained model's training resolution (432x432):

1. Dependency Conflict in Environment Setup

Issue: Installing flash-attn overwrites PyTorch version

  • torch==2.5.1 is installed first
  • flash-attn installation downgrades to torch==2.9.0 (incompatible)

Root Cause: flash-attn has strict PyTorch version requirements that conflict with the specified version.

Solution: Update requirements.txt with compatible versions and installation order.

2. Position Embedding Dimension Mismatch

Issue: RuntimeError when using resolutions different from pretrained model

RuntimeError: The size of tensor a (1024) must match the size of tensor b (729) at non-singleton dimension 1

Root Cause:

  • Pretrained model config (show-o2-1.5B-HQ/config.json) contains fixed resolution settings:
    "image_latent_height": 27,
    "image_latent_width": 27,
  • __init__ loads model with these hardcoded values (27×27=729 tokens)
    self.register_buffer("image_position_ids",
                       torch.arange(image_latent_height * image_latent_width).expand((1, -1)),
                       persistent=False)
    
  • When inferring at 1024×1024 resolution, actual input has 64×64=4096 tokens
  • The image_position_ids buffer (initialized at Line 81) remains at 729 tokens
  • Position embedding fails when trying to add embeddings of different sizes

Solution:
Add dynamic position ID recreation in the forward() method to match actual input shape:

# Check if actual input shape matches buffered position IDs
actual_seq_len = h_ * w_
if actual_seq_len != self.image_position_ids.shape[-1]:
    # Dynamically recreate position IDs to match input
    self.register_buffer(
        "image_position_ids",
        torch.arange(actual_seq_len, device=self.image_position_ids.device).expand((1, -1)),
        persistent=False
    )

This allows the model to handle any input resolution dynamically, utilizing the existing position embedding interpolation logic when needed.

Changes

Files Modified

  1. requirements.txt (NEW)

    • Add comprehensive dependency list with compatible versions
    • Specify installation order to avoid conflicts
    • Include notes for flash-attn installation
  2. models/modeling_showo2_qwen2_5.py

    • Add dynamic image_position_ids recreation in forward() method (Line ~310)
    • Ensures position embeddings match actual input dimensions
    • Preserves existing interpolation logic for non-standard resolutions

Testing

Tested configurations:

  • ✅ 432×432 (original resolution)
  • ✅ 1024×1024

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant