You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
For the scenarios of Prefix cache and Chunk Prefill, the latent KV generated by the previous part of the tokens in the sequence is in the KV cache, it needs to be gathered according to the index. A linear layer is then required to revert back to the dimension of normal K V and concatenate it with the position encoding part. In this way, the K V obtained can participate in the subsequent MHA calculation.
Kernel input:
kv cache [num_block, block_size, 576] bf16/fp8
kv_indptr [batch_size+1] int32
kv_indices[xxx] int32
cu_seqlens_k [batch_size+1] int32
kv_b_proj.weight [2*128/TP * 128, 512] fp8
kv_b_proj.scale [2*128/TP, 4] fp32
kv_cache_scale [1] None/fp32
Kernel output:
K [toekn_num, 128/TP * 192] bf16
V [toekn_num, 128/TP * 128] bf16
Performance: Normal fp8 blockscale gemm performance