Handling k-Dimension Divisibility in Backward Matrix Multiplication for kv_a_proj_with_mqa (din=7168, dout=576)

In the attn part of dsv3, there is a linear module with a shape of `din = 7168` and `dout = 576`. When calculating dx in the backward pass, the matrix multiplication involves shapes like `(seq_len, 576) @ (576, 7168)`. However, DeepGEMM imposes a restriction on the k dimension, requiring it to be divisible by 128 (reference: [DeepGEMM gemm.py#L192](https://github.com/deepseek-ai/DeepGEMM/blob/main/deep_gemm/jit_kernels/gemm.py#L192)).

I’d like to ask whether padding is necessary for the matrix multiplication to meet this requirement.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handling k-Dimension Divisibility in Backward Matrix Multiplication for kv_a_proj_with_mqa (din=7168, dout=576) #93

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Handling k-Dimension Divisibility in Backward Matrix Multiplication for kv_a_proj_with_mqa (din=7168, dout=576) #93

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions