[Feature]: Need a Gather-GEMM kernel before MHA for DeepSeeek model

### Suggestion Description

For the scenarios of Prefix cache and Chunk Prefill,  the latent KV generated by the previous part of the tokens in the sequence is in the KV cache, it needs to be gathered according to the index. A linear layer is then required to revert back to the dimension of normal K V and concatenate it with the position encoding part. In this way, the K V obtained can participate in the subsequent MHA calculation.

Kernel input:  
 1. kv cache [num_block, block_size, 576] bf16/fp8
 2. kv_indptr [batch_size+1] int32
 3. kv_indices[xxx] int32
 4. cu_seqlens_k [batch_size+1] int32
 5. kv_b_proj.weight [2*128/TP * 128, 512] fp8
 6. kv_b_proj.scale [2*128/TP, 4] fp32
 7. kv_cache_scale [1] None/fp32
 
Kernel output:
 1. K [toekn_num, 128/TP * 192] bf16
 2. V [toekn_num, 128/TP * 128] bf16
 
Performance: Normal fp8 blockscale gemm performance

### Operating System

_No response_

### GPU

MI308

### ROCm Component

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature]: Need a Gather-GEMM kernel before MHA for DeepSeeek model #1354

Suggestion Description

Operating System

GPU

ROCm Component

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature]: Need a Gather-GEMM kernel before MHA for DeepSeeek model #1354

Description

Suggestion Description

Operating System

GPU

ROCm Component

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions