Skip to content

Conversation

@shen-shanshan
Copy link
Contributor

@shen-shanshan shen-shanshan commented Nov 11, 2025

Purpose

Extract ConvLayer as CustomOp for better management and extensibility (especially for plugin devices).

Find more details at the comments below.

Test Plan

I have tested this PR on ascend platform and it worked well, even getting a better performance.

Test Result

See: vllm-project/vllm-ascend#4198.


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@shen-shanshan shen-shanshan marked this pull request as draft November 11, 2025 09:08
@mergify mergify bot added the qwen Related to Qwen models label Nov 11, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request successfully extracts the VisionPatchEmbed layer as a CustomOp, improving modularity and extensibility across various vision models. The refactoring in ernie45_vl.py, glm4_1v.py, moonvit.py, qwen2_5_vl.py, qwen2_vl.py, qwen3_omni_moe_thinker.py, and qwen3_vl.py correctly utilizes the new factory function get_vision_patch_embed to instantiate the appropriate patch embedding layers. This change streamlines the codebase by centralizing the logic for creating vision patch embedding layers.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@shen-shanshan shen-shanshan marked this pull request as ready for review November 11, 2025 12:10
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Copy link
Member

@Isotr0py Isotr0py left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, I think we can finish the convolution layer implementation in this PR and only modify several typical model's implementation. Then migrate remained models in following PR.

Otherwise you have to modify all models' implementation in one time when refine layer implementation each time, which is quite inefficient and easy to make mistake.

Comment on lines 51 to 61
self.proj = nn.Conv2d(
in_channels=in_channels,
out_channels=out_channels,
kernel_size=kernel_size,
stride=stride,
padding=padding,
dilation=dilation,
groups=groups,
bias=bias,
padding_mode=padding_mode,
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self.proj = nn.Conv2d(
in_channels=in_channels,
out_channels=out_channels,
kernel_size=kernel_size,
stride=stride,
padding=padding,
dilation=dilation,
groups=groups,
bias=bias,
padding_mode=padding_mode,
)
self.weight = nn.Parameter(
torch.empty(
(in_channels, out_channels // groups, *kernel_size),
)
)

This will create a nn.Conv2d as a submodule inside Conv2dLayer, let's initialize weight and bias in Conv2dLayer directly.

Comment on lines 64 to 65
x = self.proj(x)
return x
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
x = self.proj(x)
return x
return F.conv2d(
input, self.weight, self.bias, self.stride, self.padding, self.dilation, self.groups
)

Then we can control the dispatch at ops-level like F.conv2d etc.

Comment on lines 129 to 144
class LinearConvLayer(ConvLayerBase):
"""Conv layer with linear module."""

def __init__(
self,
input_size: int,
output_size: int,
bias: bool = True,
skip_bias_add: bool = False,
params_dtype: torch.dtype | None = None,
quant_config: QuantizationConfig | None = None,
prefix: str = "",
*,
return_bias: bool = True,
disable_tp: bool = False,
) -> None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
class LinearConvLayer(ConvLayerBase):
"""Conv layer with linear module."""
def __init__(
self,
input_size: int,
output_size: int,
bias: bool = True,
skip_bias_add: bool = False,
params_dtype: torch.dtype | None = None,
quant_config: QuantizationConfig | None = None,
prefix: str = "",
*,
return_bias: bool = True,
disable_tp: bool = False,
) -> None:
class LinearConvLayer(ConvLayerBase):
"""Conv layer with linear module."""
def __init__(
self,
in_channels: int,
out_channels: int,
kernel_size: int,
stride: int = 1,
padding: int | tuple | str = 0,
dilation: int | tuple = 1,
groups: int = 1,
bias: bool = True,
padding_mode: str = "zeros",
) -> None:

Hmmm, we only use linear to replace convolution when kernel_size == stride, which is a special case for nn.Conv2d/nn.Conv3d. I think we can implement the linearized convolution inside Conv2dLayer, if not we should at least keep args consistent with ConvLayer. WDYT?

Copy link
Contributor Author

@shen-shanshan shen-shanshan Nov 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, we only use linear to replace convolution when kernel_size == stride, which is a special case for nn.Conv2d/nn.Conv3d. I think we can implement the linearized convolution inside Conv2dLayer, if not we should at least keep args consistent with ConvLayer. WDYT?

Oh, that's a good suggestion, maybe we can merge these two types into Conv2dLayer and add some checks to dispatch the forward.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only replace conv3d with linear currently because of a performance regression in torch2.9 (see #27406).

BTW, I remember that linear is faster than convolution for a point-wise convolution computation in early pytorch, perhaps other platforms (like CPU platform) can benefit from linearized conv2d/3d because they have custom optimized gemm ops.

I think we can implement both conv layer like this to automatically linearize conv ops:

class Conv2dLayer(CustomOp):
    """Conv layer with Conv2d."""

    def __init__(
        self,
        in_channels: int,
        out_channels: int,
        kernel_size: int | tuple,
        stride: int | tuple | None,
        padding: int | tuple | str | None,
        dilation: int | tuple | None,
        groups: int | None,
        bias: bool | None,
        padding_mode: str | None,
    ) -> None:
        super().__init__()

        self.in_channels = in_channels
        self.out_channels = out_channels
        self.kernel_size = kernel_size
        self.stride = stride
        ...

        self.can_linearize = (self.kernel_size == self.stride and not self.padding)

        if self.can_linearize:
            self.weight = nn.Parameter(
                torch.empty(out_channels, in_channels * math.prod(kernel_size))
            )
        else:
            self.weight = nn.Parameter(
                torch.empty(out_channels, in_channels, *kernel_size)
            )
        ...
    
    def forward_native(self, x: torch.Tensor) -> torch.Tensor:
        .... # <- do some reshape here
        if self.can_linearize:
            return F.linear(x, self.weight, self.bias)
        else:
            return F.conv2d(
                x,
                self.weight,
                bias=self.bias,
                stride=self.stride,
                padding=self.padding,
                dilation=self.dilation,
                groups=self.groups,
            )

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Isotr0py Currently, I find that vllm only convert Conv3D to Linear with conv3d_to_linear_weight() in vision.py and don't apply this optimization for Conv2D. Could we directly implement it as below?

  • Conv2dLayer -> only use F.conv2d(). (because we don't convert the Conv2D weight when loading)
  • Conv3dLayer -> directly use F.linear() when kernel_size==stride for better performance.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(because we don't convert the Conv2D weight when loading)

In fact, conv3d_to_linear_weight is just a workaround for nn.Conv3d when fixing the performance regression. Since we decide to expose conv layer as CustomOP, we can optimize its usage with a better implementation.

We can implement weight_loader for conv layer like linear layer so that it can automatically convert weight when calling weight_loader(param, loaded_weight):

set_weight_attrs(weight, {"weight_loader": self.weight_loader})
def weight_loader(self, param: Parameter, loaded_weight: torch.Tensor):
    if self.can_linearize:
        out_channels = loaded_weight.shape[0]
        loaded_weight = loaded_weight.reshape(out_channels, -1)
    param.data.copy_(loaded_weight)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, that's a good idea, I will follow this to implement it today.

@shen-shanshan shen-shanshan changed the title [Model][MM] Extract VisionPatchEmbed as CustomOp [Model][MM] Extract conv layer as CustomOp Nov 12, 2025
@Isotr0py
Copy link
Member

BTW, let me update the layer implementation for you tonight. :)

Signed-off-by: shen-shanshan <[email protected]>
@shen-shanshan
Copy link
Contributor Author

BTW, let me update the layer implementation for you tonight. :)

Oh sorry I didn't notice that and have just force pushed a new commit.. 😂

Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
@Isotr0py
Copy link
Member

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request successfully extracts ConvLayer as CustomOp, which improves modularity and extensibility. The changes correctly integrate the new Conv2dLayer and Conv3dLayer into existing models, replacing direct nn.Conv2d or ReplicatedLinear calls and removing the conv3d_to_linear_weight utility function. This is a positive step towards better management of convolution operations within the framework. However, there are a few areas in the new conv.py file that require attention to ensure robustness and correctness, particularly regarding the handling of kernel_size and stride in CausalConv2dLayer when they are provided as tuples.

@Isotr0py Isotr0py added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 13, 2025
Copy link
Member

@Isotr0py Isotr0py left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's see if the CI can pass now.

@shen-shanshan
Copy link
Contributor Author

shen-shanshan commented Nov 14, 2025

I have also tested this PR on ascend platform and it worked well, even getting a better performance.

see: vllm-project/vllm-ascend#4198.

@Isotr0py Isotr0py merged commit 41b92f7 into vllm-project:main Nov 14, 2025
57 checks passed
@Isotr0py
Copy link
Member

Also cc @jikunshang @bigPYJ1151. CPU/XPU platform may need to update forward_cpu/forward_xpu for convolution layer optimization as well.

geodavic pushed a commit to geodavic/vllm that referenced this pull request Nov 16, 2025
Signed-off-by: shen-shanshan <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Co-authored-by: Isotr0py <[email protected]>
Signed-off-by: George D. Torres <[email protected]>
@jikunshang
Copy link
Collaborator

@Isotr0py thanks for reminder! for XPU we will call forward_native by default. and for CPU, it will call forward_cuda. ideally both should work since it's just torch operator calls here. we will check whether this affect functionality/perf on XPU&CPU.

bwasti pushed a commit to bwasti/vllm that referenced this pull request Nov 17, 2025
Signed-off-by: shen-shanshan <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Co-authored-by: Isotr0py <[email protected]>
Signed-off-by: Bram Wasti <[email protected]>
wangxiyuan added a commit to vllm-project/vllm-ascend that referenced this pull request Nov 26, 2025
Bump vLLM version to v0.11.2

What's broken and changed by vLLM:
1. structured_output is broken by
vllm-project/vllm#26866
2. get_mrope_input_positions is broken by
vllm-project/vllm#28399
3. graph mode is broken by
vllm-project/vllm#25110 we'll upgrade torch to
2.8 to fix the problem later
4. embedding is broken by
vllm-project/vllm#27583
5. `get_attn_backend_cls` and attention backend is broken are broken by
vllm-project/vllm#28534
6. spec decode is broken by
vllm-project/vllm#28771
7. sp feature is broken by
vllm-project/vllm#27126
8. mtp is broken by vllm-project/vllm#27922
9. lora is broken by vllm-project/vllm#21068
10. execute_model is broken by
vllm-project/vllm#26866
11. `VLLM_DISABLE_SHARED_EXPERTS_STREAM` env is broken by
vllm-project/vllm#28159
12. kv cahe is broken by vllm-project/vllm#27753
13. dp is broken by vllm-project/vllm#25110

 
What's broken and changed by ourself:
1. qwen vl is broken by vllm-project/vllm#28455
We'll remove model files in the future to avoid this kind of error
2. Engine core is broken by
vllm-project/vllm#23691 We'll remove the patch
file in the future.
3. Ascend scheduler is broken by
vllm-project/vllm#28733 We'll remove ascend
scheudler later.
4. qwen3-next is broken by
vllm-project/vllm#28083 We'll remove model files
in the future to avoid this kind of error
5. qwen vl is broken by vllm-project/vllm#27764.
We'll remove model files in the future

Known issue:
1. ray doesn't work 
2. the accuracy of qwen3-next is not correct
3. qwen3-vl is broken
4. prefix cache+ ascend scheduler + deepseek v2 lite is broken.

Co-authored-by: MengqingCao <[email protected]>
Co-authored-by: hfadzxy <[email protected]>
Co-authored-by: leo-pony <[email protected]>
Co-authored-by: 22dimensions <[email protected]>
Co-authored-by: shen-shanshan <[email protected]>


- vLLM version: v0.11.2

---------

Signed-off-by: wangxiyuan <[email protected]>
Signed-off-by: MengqingCao <[email protected]>
Signed-off-by: hfadzxy <[email protected]>
Signed-off-by: leo-pony <[email protected]>
Co-authored-by: MengqingCao <[email protected]>
Co-authored-by: hfadzxy <[email protected]>
Co-authored-by: leo-pony <[email protected]>
Kurumi5210 pushed a commit to lidenghui1110/vllm-ascend that referenced this pull request Nov 26, 2025
Bump vLLM version to v0.11.2

What's broken and changed by vLLM:
1. structured_output is broken by
vllm-project/vllm#26866
2. get_mrope_input_positions is broken by
vllm-project/vllm#28399
3. graph mode is broken by
vllm-project/vllm#25110 we'll upgrade torch to
2.8 to fix the problem later
4. embedding is broken by
vllm-project/vllm#27583
5. `get_attn_backend_cls` and attention backend is broken are broken by
vllm-project/vllm#28534
6. spec decode is broken by
vllm-project/vllm#28771
7. sp feature is broken by
vllm-project/vllm#27126
8. mtp is broken by vllm-project/vllm#27922
9. lora is broken by vllm-project/vllm#21068
10. execute_model is broken by
vllm-project/vllm#26866
11. `VLLM_DISABLE_SHARED_EXPERTS_STREAM` env is broken by
vllm-project/vllm#28159
12. kv cahe is broken by vllm-project/vllm#27753
13. dp is broken by vllm-project/vllm#25110

What's broken and changed by ourself:
1. qwen vl is broken by vllm-project/vllm#28455
We'll remove model files in the future to avoid this kind of error
2. Engine core is broken by
vllm-project/vllm#23691 We'll remove the patch
file in the future.
3. Ascend scheduler is broken by
vllm-project/vllm#28733 We'll remove ascend
scheudler later.
4. qwen3-next is broken by
vllm-project/vllm#28083 We'll remove model files
in the future to avoid this kind of error
5. qwen vl is broken by vllm-project/vllm#27764.
We'll remove model files in the future

Known issue:
1. ray doesn't work
2. the accuracy of qwen3-next is not correct
3. qwen3-vl is broken
4. prefix cache+ ascend scheduler + deepseek v2 lite is broken.

Co-authored-by: MengqingCao <[email protected]>
Co-authored-by: hfadzxy <[email protected]>
Co-authored-by: leo-pony <[email protected]>
Co-authored-by: 22dimensions <[email protected]>
Co-authored-by: shen-shanshan <[email protected]>

- vLLM version: v0.11.2

---------

Signed-off-by: wangxiyuan <[email protected]>
Signed-off-by: MengqingCao <[email protected]>
Signed-off-by: hfadzxy <[email protected]>
Signed-off-by: leo-pony <[email protected]>
Co-authored-by: MengqingCao <[email protected]>
Co-authored-by: hfadzxy <[email protected]>
Co-authored-by: leo-pony <[email protected]>
Signed-off-by: Kurumi5210 <[email protected]>
devpatelio pushed a commit to SumanthRH/vllm that referenced this pull request Nov 29, 2025
Signed-off-by: shen-shanshan <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Co-authored-by: Isotr0py <[email protected]>
845473182 pushed a commit to 845473182/vllm-ascend that referenced this pull request Nov 29, 2025
Bump vLLM version to v0.11.2

What's broken and changed by vLLM:
1. structured_output is broken by
vllm-project/vllm#26866
2. get_mrope_input_positions is broken by
vllm-project/vllm#28399
3. graph mode is broken by
vllm-project/vllm#25110 we'll upgrade torch to
2.8 to fix the problem later
4. embedding is broken by
vllm-project/vllm#27583
5. `get_attn_backend_cls` and attention backend is broken are broken by
vllm-project/vllm#28534
6. spec decode is broken by
vllm-project/vllm#28771
7. sp feature is broken by
vllm-project/vllm#27126
8. mtp is broken by vllm-project/vllm#27922
9. lora is broken by vllm-project/vllm#21068
10. execute_model is broken by
vllm-project/vllm#26866
11. `VLLM_DISABLE_SHARED_EXPERTS_STREAM` env is broken by
vllm-project/vllm#28159
12. kv cahe is broken by vllm-project/vllm#27753
13. dp is broken by vllm-project/vllm#25110

 
What's broken and changed by ourself:
1. qwen vl is broken by vllm-project/vllm#28455
We'll remove model files in the future to avoid this kind of error
2. Engine core is broken by
vllm-project/vllm#23691 We'll remove the patch
file in the future.
3. Ascend scheduler is broken by
vllm-project/vllm#28733 We'll remove ascend
scheudler later.
4. qwen3-next is broken by
vllm-project/vllm#28083 We'll remove model files
in the future to avoid this kind of error
5. qwen vl is broken by vllm-project/vllm#27764.
We'll remove model files in the future

Known issue:
1. ray doesn't work 
2. the accuracy of qwen3-next is not correct
3. qwen3-vl is broken
4. prefix cache+ ascend scheduler + deepseek v2 lite is broken.

Co-authored-by: MengqingCao <[email protected]>
Co-authored-by: hfadzxy <[email protected]>
Co-authored-by: leo-pony <[email protected]>
Co-authored-by: 22dimensions <[email protected]>
Co-authored-by: shen-shanshan <[email protected]>


- vLLM version: v0.11.2

---------

Signed-off-by: wangxiyuan <[email protected]>
Signed-off-by: MengqingCao <[email protected]>
Signed-off-by: hfadzxy <[email protected]>
Signed-off-by: leo-pony <[email protected]>
Co-authored-by: MengqingCao <[email protected]>
Co-authored-by: hfadzxy <[email protected]>
Co-authored-by: leo-pony <[email protected]>
kitaekatt pushed a commit to kitaekatt/vllm that referenced this pull request Dec 1, 2025
Signed-off-by: shen-shanshan <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Co-authored-by: Isotr0py <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

qwen Related to Qwen models ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants