[showo2] Fix env setup and T2I generation failure when resolution > 432 #104

BlueDyee · 2025-11-05T05:51:24Z

Problem

This PR fixes two critical issues that prevent inference at resolutions different from the pretrained model's training resolution (432x432):

1. Dependency Conflict in Environment Setup

Issue: Installing flash-attn overwrites PyTorch version

torch==2.5.1 is installed first
flash-attn installation downgrades to torch==2.9.0 (incompatible)

Root Cause: flash-attn has strict PyTorch version requirements that conflict with the specified version.

Solution: Update requirements.txt with compatible versions and installation order.

2. Position Embedding Dimension Mismatch

Issue: RuntimeError when using resolutions different from pretrained model

RuntimeError: The size of tensor a (1024) must match the size of tensor b (729) at non-singleton dimension 1

Root Cause:

Pretrained model config (show-o2-1.5B-HQ/config.json) contains fixed resolution settings:
```
"image_latent_height": 27,
"image_latent_width": 27,
```

__init__ loads model with these hardcoded values (27×27=729 tokens)

self.register_buffer("image_position_ids",
                   torch.arange(image_latent_height * image_latent_width).expand((1, -1)),
                   persistent=False)

When inferring at 1024×1024 resolution, actual input has 64×64=4096 tokens
The image_position_ids buffer (initialized at Line 81) remains at 729 tokens
Position embedding fails when trying to add embeddings of different sizes

Solution:
Add dynamic position ID recreation in the forward() method to match actual input shape:

# Check if actual input shape matches buffered position IDs
actual_seq_len = h_ * w_
if actual_seq_len != self.image_position_ids.shape[-1]:
    # Dynamically recreate position IDs to match input
    self.register_buffer(
        "image_position_ids",
        torch.arange(actual_seq_len, device=self.image_position_ids.device).expand((1, -1)),
        persistent=False
    )

This allows the model to handle any input resolution dynamically, utilizing the existing position embedding interpolation logic when needed.

Changes

Files Modified

requirements.txt (NEW)
- Add comprehensive dependency list with compatible versions
- Specify installation order to avoid conflicts
- Include notes for flash-attn installation
models/modeling_showo2_qwen2_5.py
- Add dynamic image_position_ids recreation in forward() method (Line ~310)
- Ensures position embeddings match actual input dimensions
- Preserves existing interpolation logic for non-standard resolutions

Testing

Tested configurations:

✅ 432×432 (original resolution)
✅ 1024×1024

…uirements.txt

BlueDyee added 2 commits November 5, 2025 05:27

refactor: replace build_env.sh with conda setup in README and add req…

dca810f

…uirements.txt

monkey patch to fix position encoding error

398d2f0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[showo2] Fix env setup and T2I generation failure when resolution > 432 #104

[showo2] Fix env setup and T2I generation failure when resolution > 432 #104

Uh oh!

BlueDyee commented Nov 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

[showo2] Fix env setup and T2I generation failure when resolution > 432 #104

Are you sure you want to change the base?

[showo2] Fix env setup and T2I generation failure when resolution > 432 #104

Uh oh!

Conversation

BlueDyee commented Nov 5, 2025

Problem

1. Dependency Conflict in Environment Setup

2. Position Embedding Dimension Mismatch

Changes

Files Modified

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant