In the attn part of dsv3, there is a linear module with a shape of din = 7168 and dout = 576. When calculating dx in the backward pass, the matrix multiplication involves shapes like (seq_len, 576) @ (576, 7168). However, DeepGEMM imposes a restriction on the k dimension, requiring it to be divisible by 128 (reference: DeepGEMM gemm.py#L192).
I’d like to ask whether padding is necessary for the matrix multiplication to meet this requirement.