Skip to content

Comments

[Model] add flux2 klein#809

Merged
hsliuustc0106 merged 12 commits intovllm-project:mainfrom
david6666666:flux2_klein
Jan 16, 2026
Merged

[Model] add flux2 klein#809
hsliuustc0106 merged 12 commits intovllm-project:mainfrom
david6666666:flux2_klein

Conversation

@david6666666
Copy link
Collaborator

@david6666666 david6666666 commented Jan 16, 2026

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

support black-forest-labs/FLUX.2-klein

Test Plan

vLLM-Omni:

python examples/offline_inference/text_to_image/text_to_image.py \
  --model black-forest-labs/FLUX.2-klein-9B \
  --prompt "a photo of a forest with mist swirling around the tree trunks. The word 'FLUX.2' is painted over it in big, red brush strokes with visible texture" \
  --seed 42 \
  --cfg_scale 4.0 \
  --num_images_per_prompt 1 \
  --num_inference_steps 4 \
  --guidance_scale 1.0 \
  --height 768 \
  --width 1360 \
  --output outputs/output_9b_omni.png

online serving:

vllm serve black-forest-labs/FLUX.2-klein-9B --omni --port 8091

curl -s http://localhost:8091/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {"role": "user", "content": "A beautiful landscape painting"}
    ],
    "extra_body": {
      "height": 1024,
      "width": 1024,
      "num_inference_steps": 4,
      "true_cfg_scale": 4.0,
      "seed": 42
    }
  }' | jq -r '.choices[0].message.content[0].image_url.url' | cut -d',' -f2- | base64 -d > klein.png

Test Result

vLLM-Omni:
output_9b_omni

python examples/offline_inference/image_to_image/image_edit.py \
  --model black-forest-labs/FLUX.2-klein-9B \
  --image output_9b_omni.png \
  --prompt "replace the trees in the image with cyberpunk buildings." \
  --output output_image_edit.png \
  --num_inference_steps 4 \
  --cfg_scale 4.0
output_image_edit

diffusers:

import torch
from diffusers import Flux2KleinPipeline

dtype = torch.bfloat16
device = "cuda"

height = 768
width = 1360
prompt = "a photo of a forest with mist swirling around the tree trunks. The word 'FLUX.2' is painted over it in big, r
ed brush strokes with visible texture"

repo_id = "black-forest-labs/FLUX.2-klein-9B"
pipeline = Flux2KleinPipeline.from_pretrained(repo_id, torch_dtype=dtype)
pipeline.to(device)
num_inference_steps = 4
guidance_scale = 1.0
generator = torch.Generator(device=device).manual_seed(42)

out = pipeline(
    prompt=prompt,
    height=height,
    width=width,
    num_inference_steps=num_inference_steps,
    guidance_scale=guidance_scale,
    generator=generator
).images[0]

out.save("output_9b.png")
output_9b
Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

Signed-off-by: David Chen <530634352@qq.com>
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 44f82a7549

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +873 to +877
if self.do_classifier_free_guidance:
negative_prompt = ""
if prompt is not None and isinstance(prompt, list):
negative_prompt = [negative_prompt] * len(prompt)
negative_prompt_embeds, negative_text_ids = self.encode_prompt(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Honor negative_prompt when building CFG embeddings

In Flux2KleinPipeline.forward, the CFG branch always sets negative_prompt to an empty string and never reads req.negative_prompt, so any user-provided negative prompt is silently ignored. This means CFG outputs cannot reflect requested negative constraints (e.g., prompts to avoid artifacts), which can change or invalidate experiment results. Consider threading negative_prompt from the request/args (as other pipelines do) and using it when encoding the negative prompt embeddings.

Useful? React with 👍 / 👎.

Signed-off-by: David Chen <530634352@qq.com>
Signed-off-by: David Chen <530634352@qq.com>
Signed-off-by: David Chen <530634352@qq.com>
Signed-off-by: David Chen <530634352@qq.com>
@david6666666 david6666666 added the ready label to trigger buildkite CI label Jan 16, 2026
Signed-off-by: David Chen <530634352@qq.com>
@david6666666
Copy link
Collaborator Author

@ZJY0516 @SamitHuang @hsliuustc0106 PTAL thanks.

Copy link
Collaborator

@ZJY0516 ZJY0516 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for contribution

query = apply_rotary_emb(query, image_rotary_emb, sequence_dim=1)
key = apply_rotary_emb(key, image_rotary_emb, sequence_dim=1)

attn_output = _scaled_dot_product_attention(query, key, value, attention_mask=attention_mask)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please use attention layer in vllm_omni/diffusion

value = torch.cat([encoder_value, value], dim=1)

if image_rotary_emb is not None:
query = apply_rotary_emb(query, image_rotary_emb, sequence_dim=1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have rope layer in vllm-omni


# Additional text-related parameters
max_sequence_length: int | None = None
text_encoder_out_layers: tuple[int, ...] | None = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why we need this?

return refresh_cache_context


def enable_cache_for_flux2_klein(pipeline: Any, cache_config: Any) -> Callable[[int], None]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if cache-dit has already supported it, do we still need this? @SamitHuang

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes because it's not a regular dit. but we need to test it. @david6666666

@ZJY0516 ZJY0516 requested a review from SamitHuang January 16, 2026 05:37
Signed-off-by: David Chen <530634352@qq.com>


# Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion_img2img.retrieve_latents
def retrieve_latents(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not just import



# Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.retrieve_timesteps
def retrieve_timesteps(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Signed-off-by: David Chen <530634352@qq.com>
Signed-off-by: David Chen <530634352@qq.com>
self.dropout = dropout
self.added_kv_proj_dim = added_kv_proj_dim

self.to_q = nn.Linear(query_dim, self.inner_dim, bias=bias)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not use from vllm.model_executor.layers.linear import QKVParallelLinear, ReplicatedLinear?

self.to_out = nn.ModuleList([nn.Linear(self.inner_dim, self.out_dim, bias=out_bias), nn.Dropout(dropout)])

if added_kv_proj_dim is not None:
self.norm_added_q = nn.RMSNorm(dim_head, eps=eps)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not use RMSNorm from vllm?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to use norm op in vllm here. native torch op + torch.compile will be better

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vllm RMSNorm doesn't support torch.compile?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does support. But the dispatch logic in vLLM is a little different with omni diffusion's custom op. So I think decounpling will better. And in most cases, vllm doesn't use custom kernel of RMSNorm

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I test native torch op + torch.compile is better

@hsliuustc0106
Copy link
Collaborator

in terms of the seed, I remember @SamitHuang changed the default seed to 0 as it will influence some acc

@SamitHuang
Copy link
Collaborator

in terms of the seed, I remember @SamitHuang changed the default seed to 0 as it will influence some acc

yes, but it doesn't influence this PR

Signed-off-by: David Chen <530634352@qq.com>
Signed-off-by: David Chen <530634352@qq.com>
@hsliuustc0106
Copy link
Collaborator

any benchmark result for latency?

Signed-off-by: David Chen <530634352@qq.com>
@david6666666
Copy link
Collaborator Author

any benchmark result for latency?

H800 9B 0.8s, 4B 0.5s

@david6666666
Copy link
Collaborator Author

PR is ready

@david6666666
Copy link
Collaborator Author

vllm-omni-amd-ci failed is not related

@hsliuustc0106 hsliuustc0106 merged commit a7f9926 into vllm-project:main Jan 16, 2026
6 of 7 checks passed
erfgss pushed a commit to erfgss/vllm-omni that referenced this pull request Jan 19, 2026
Signed-off-by: David Chen <530634352@qq.com>
erfgss added a commit to erfgss/vllm-omni that referenced this pull request Jan 19, 2026
Signed-off-by: David Chen <530634352@qq.com>
Signed-off-by: Chen Yang <2082464740@qq.com>

# Conflicts:
#	vllm_omni/diffusion/registry.py
@david6666666 david6666666 mentioned this pull request Jan 20, 2026
55 tasks
@jannikstdl
Copy link

jannikstdl commented Jan 20, 2026

Will is someday support Image editing in the API?

@ZJY0516
Copy link
Collaborator

ZJY0516 commented Jan 20, 2026

Will is someday support Image editing in the API?

I think we already support online seving for image edit, see https://github.com/vllm-project/vllm-omni/tree/main/examples/online_serving/image_to_image

@jannikstdl
Copy link

@ZJY0516 Uhm ok, will try in the latest VLLM Omni Docker Image

Could not find anything regarding to that in the docs.

@david6666666 david6666666 deleted the flux2_klein branch January 22, 2026 01:57
@kolmogorov-quyet
Copy link

They don't work with parallelism acceleration. Can anyone write a code that supports one of the four Ulysses-SP, Ring-SP, CFG-Parallel, or Tensor-Parallel architectures? I've tried one H2O GPU vs two H2O GPUs and the speed remains the same. Model FLUX.2-klein-9B, tested --usp 2, --ring 2(hybrid), Tensor-Parallel

@hsliuustc0106
Copy link
Collaborator

They don't work with parallelism acceleration. Can anyone write a code that supports one of the four Ulysses-SP, Ring-SP, CFG-Parallel, or Tensor-Parallel architectures? I've tried one H2O GPU vs two H2O GPUs and the speed remains the same. Model FLUX.2-klein-9B, tested --usp 2, --ring 2(hybrid), Tensor-Parallel

@wtomin @ZJY0516 PTAL

@ZJY0516
Copy link
Collaborator

ZJY0516 commented Jan 27, 2026

They don't work with parallelism acceleration. Can anyone write a code that supports one of the four Ulysses-SP, Ring-SP, CFG-Parallel, or Tensor-Parallel architectures? I've tried one H2O GPU vs two H2O GPUs and the speed remains the same. Model FLUX.2-klein-9B, tested --usp 2, --ring 2(hybrid), Tensor-Parallel

Could you please try #973?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready label to trigger buildkite CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants