fix: Add glm4_moe_lite to MLA detection#32614
Conversation
GLM-4.7-Flash (glm4_moe_lite) and GLM-4.6 (glm4_moe) use the same Multi-head Latent Attention (MLA) architecture as DeepSeek models but were not included in the is_deepseek_mla() check. This caused vLLM to fall back to standard KV caching instead of efficient MLA caching, resulting in significantly higher memory usage (4x more KV cache than necessary). Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Signed-off-by: marksverdhei <marksverdhei@hotmail.com>
There was a problem hiding this comment.
Code Review
This pull request aims to enable Multi-head Latent Attention (MLA) for GLM-4 MoE models. The change correctly adds glm4_moe_lite to the MLA detection logic, which is consistent with its implementation that supports MLA. However, adding glm4_moe is problematic as its current implementation in vLLM does not seem to support MLA, which could lead to runtime issues. I've provided a critical comment with a suggestion to only include glm4_moe_lite for now.
GLM-4.7-Flash (glm4_moe_lite) uses the same Multi-head Latent Attention (MLA) architecture as DeepSeek models but was not included in the is_deepseek_mla() check. This caused vLLM to fall back to standard KV caching instead of efficient MLA caching, resulting in significantly higher memory usage (4x more KV cache than necessary). Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Signed-off-by: marksverdhei <marksverdhei@hotmail.com>
09521b2 to
edaf4f7
Compare
MatthewBonanni
left a comment
There was a problem hiding this comment.
LGTM, thanks for catching this!
|
@LucasWilkinson @MatthewBonanni When I run with this PR on B200, I get this error. I think we need to change the kernel support registration |
Signed-off-by: Markus / Mark <46672778+marksverdhei@users.noreply.github.com>
Head branch was pushed to by a user without write access
Could it be because of the known looping issues that Lama CPP also had? Like 0% because it doesn't arrive at an answer and only generates think tokens or did you control for it already? I have never contributed to a vLLM before, so I'm not completely familiar with the testing procedures |
|
LGTM! Works well on RTX 6000 Blackwell, model is unusable without this. Nobody is going to run this small model on B200 anyway, merge it and fix B200 mañana! |
Signed-off-by: mgoin <mgoin64@gmail.com>
mgoin
left a comment
There was a problem hiding this comment.
Should be good to go now with the Blackwell fixes, thanks for kicking this off @marksverdhei !
|
Thank you for merging! i can now call myself a proud contributor of vllm, with those two lines of strings! 😆 Way to kick off my 2026 new years resolutions! |
Signed-off-by: marksverdhei <marksverdhei@hotmail.com> Signed-off-by: Markus / Mark <46672778+marksverdhei@users.noreply.github.com> Signed-off-by: mgoin <mgoin64@gmail.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> Co-authored-by: mgoin <mgoin64@gmail.com> Signed-off-by: 陈建华 <1647430658@qq.com>
|
Is there a new version released after this fix, or only be installed through the latest source code now. |
Signed-off-by: marksverdhei <marksverdhei@hotmail.com> Signed-off-by: Markus / Mark <46672778+marksverdhei@users.noreply.github.com> Signed-off-by: mgoin <mgoin64@gmail.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> Co-authored-by: mgoin <mgoin64@gmail.com>
|
@Xiaojinhua You can install the nightly version: |
|
@esmeetu @marksverdhei even with these changes weeks ago, the glm-4.7-flash model does not start because Tldr: using nightly docker image |
|
@gaby yes we still recommend in the recipe for the model to install transformers from source https://docs.vllm.ai/projects/recipes/en/latest/GLM/GLM.html |
|
@mgoin Sadly that doesn't work using vLLM with Docker. Since we rely on the image provided by the vLLM team. |
|
@gaby you can extend the existing image like this https://x.com/thezachmueller/status/2014354173492432942?s=46&t=jLcDgQXDbYe6HgFmTNYgpg |
|
@mgoin Will give that a try, thanks! |
|
@gaby You can try this image: vllm/vllm-openai:glm5 |
|
@esmeetu thanks! |
Signed-off-by: marksverdhei <marksverdhei@hotmail.com> Signed-off-by: Markus / Mark <46672778+marksverdhei@users.noreply.github.com> Signed-off-by: mgoin <mgoin64@gmail.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> Co-authored-by: mgoin <mgoin64@gmail.com>

Summary
glm4_moe_liteandglm4_moe_lite_mtptois_deepseek_mla()check inmodel_arch_config_convertor.pyGLM-4.7-Flash (
glm4_moe_lite) uses Multi-head Latent Attention (MLA) viaGlm4MoeLiteMLAAttention(which inherits fromDeepseekV2MLAAttention) but was missing from the MLA detection.Without this fix, vLLM falls back to standard KV caching instead of efficient MLA caching, resulting in ~4x higher KV cache memory usage.
Note:
glm4_moeis intentionally NOT included as it uses standard attention (Glm4MoeAttentionwithvllm.attention.layer.Attention).Co-authored with @mgoin
NOTE: SM100 has issues with support for this model with various MLA decode and prefill kernels, so the following changes were made to support default inference there:
Test Plan
marksverdhei/GLM-4.7-Flash-fp8on 2x RTX 3090