-
Notifications
You must be signed in to change notification settings - Fork 846
[BUG] Support 32 head size for SGLang #1048
Copy link
Copy link
Open
Labels
Description
Recently, SGLang supports Bert Encoder Models with head size of 32, but only for Torch and Triton backend. I'm trying to use the Flashinfer backend but encountered:
Error: Invalid configuration : NUM_MMA_Q=2 NUM_MMA_D_QK=2 NUM_MMA_D_VO=2 NUM_MMA_KV=4 NUM_WARPS_Q=4
NUM_WARPS_KV=1 please create an issue (https://github.com/flashinfer-ai/flashinfer/issues) and report the issue to the developers.I'm using JIT mode and the cuda version is 12.4, torch version is 2.5, flashinfer version is 0.2.5 and sglang version is 0.4.6.
Reactions are currently unavailable