fix: Add glm4_moe_lite to MLA detection by marksverdhei · Pull Request #32614 · vllm-project/vllm

marksverdhei · 2026-01-19T20:00:25Z

Summary

Add glm4_moe_lite and glm4_moe_lite_mtp to is_deepseek_mla() check in model_arch_config_convertor.py

GLM-4.7-Flash (glm4_moe_lite) uses Multi-head Latent Attention (MLA) via Glm4MoeLiteMLAAttention (which inherits from DeepseekV2MLAAttention) but was missing from the MLA detection.

Without this fix, vLLM falls back to standard KV caching instead of efficient MLA caching, resulting in ~4x higher KV cache memory usage.

Note: glm4_moe is intentionally NOT included as it uses standard attention (Glm4MoeAttention with vllm.attention.layer.Attention).

Co-authored with @mgoin

NOTE: SM100 has issues with support for this model with various MLA decode and prefill kernels, so the following changes were made to support default inference there:

Disable trtllm prefill and flashinfer prefill if we didn't find DeepSeek R1 compatible MLA dimensions
Disable flashinfer mla if we didn't find DeepSeek R1 compatible MLA dimensions
Explicitly enable cutlass mla so the block_size=128 gets enforced
This means that on SM100 we will run with CUTLASS_MLA decode and FA2 prefill by default for this model, I also tested that TRITON_MLA works.

Test Plan

Tested with marksverdhei/GLM-4.7-Flash-fp8 on 2x RTX 3090
Verified MLA is detected and efficient KV caching is used
Model runs with 14.7 GB VRAM per GPU

GLM-4.7-Flash (glm4_moe_lite) and GLM-4.6 (glm4_moe) use the same Multi-head Latent Attention (MLA) architecture as DeepSeek models but were not included in the is_deepseek_mla() check. This caused vLLM to fall back to standard KV caching instead of efficient MLA caching, resulting in significantly higher memory usage (4x more KV cache than necessary). Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Signed-off-by: marksverdhei <marksverdhei@hotmail.com>

gemini-code-assist

Code Review

This pull request aims to enable Multi-head Latent Attention (MLA) for GLM-4 MoE models. The change correctly adds glm4_moe_lite to the MLA detection logic, which is consistent with its implementation that supports MLA. However, adding glm4_moe is problematic as its current implementation in vLLM does not seem to support MLA, which could lead to runtime issues. I've provided a critical comment with a suggestion to only include glm4_moe_lite for now.

vllm/transformers_utils/model_arch_config_convertor.py

GLM-4.7-Flash (glm4_moe_lite) uses the same Multi-head Latent Attention (MLA) architecture as DeepSeek models but was not included in the is_deepseek_mla() check. This caused vLLM to fall back to standard KV caching instead of efficient MLA caching, resulting in significantly higher memory usage (4x more KV cache than necessary). Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Signed-off-by: marksverdhei <marksverdhei@hotmail.com>

MatthewBonanni

LGTM, thanks for catching this!

LucasWilkinson

LGTM

mgoin · 2026-01-20T17:02:08Z

@LucasWilkinson @MatthewBonanni When I run with this PR on B200, I get this error. I think we need to change the kernel support registration

(EngineCore_DP0 pid=1261690)   File "/home/mgoin/code/vllm/.venv/lib/python3.12/site-packages/flashinfer/decode.py", line 2491, in _check_trtllm_gen_mla_shape
(EngineCore_DP0 pid=1261690)     raise ValueError(f"Expected qk_nope_head_dim == 128, got {qk_nope_head_dim}")
(EngineCore_DP0 pid=1261690) ValueError: Expected qk_nope_head_dim == 128, got 192

mgoin

Blocking merge while we investigate which MLA backends are/can be supported for this model. For instance forcing TRITON_MLA on B200 results in 0% on GSM8k

vllm/transformers_utils/model_arch_config_convertor.py

Signed-off-by: Markus / Mark <46672778+marksverdhei@users.noreply.github.com>

marksverdhei · 2026-01-22T14:02:53Z

Blocking merge while we investigate which MLA backends are/can be supported for this model. For instance forcing TRITON_MLA on B200 results in 0% on GSM8k

Could it be because of the known looping issues that Lama CPP also had? Like 0% because it doesn't arrive at an answer and only generates think tokens or did you control for it already? I have never contributed to a vLLM before, so I'm not completely familiar with the testing procedures

mdierolf · 2026-01-23T17:04:38Z

LGTM!

Works well on RTX 6000 Blackwell, model is unusable without this.

Nobody is going to run this small model on B200 anyway, merge it and fix B200 mañana!

Signed-off-by: mgoin <mgoin64@gmail.com>

mgoin

Should be good to go now with the Blackwell fixes, thanks for kicking this off @marksverdhei !

marksverdhei · 2026-01-24T09:17:48Z

Thank you for merging! i can now call myself a proud contributor of vllm, with those two lines of strings! 😆 Way to kick off my 2026 new years resolutions!

Signed-off-by: marksverdhei <marksverdhei@hotmail.com> Signed-off-by: Markus / Mark <46672778+marksverdhei@users.noreply.github.com> Signed-off-by: mgoin <mgoin64@gmail.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> Co-authored-by: mgoin <mgoin64@gmail.com> Signed-off-by: 陈建华 <1647430658@qq.com>

Xiaojinhua · 2026-01-27T03:04:52Z

Is there a new version released after this fix, or only be installed through the latest source code now.

Signed-off-by: marksverdhei <marksverdhei@hotmail.com> Signed-off-by: Markus / Mark <46672778+marksverdhei@users.noreply.github.com> Signed-off-by: mgoin <mgoin64@gmail.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> Co-authored-by: mgoin <mgoin64@gmail.com>

esmeetu · 2026-01-28T00:59:51Z

@Xiaojinhua You can install the nightly version:

uv pip install -U vllm --pre \ 
--extra-index-url https://wheels.vllm.ai/nightly/cu129 \
--extra-index-url https://download.pytorch.org/whl/cu129 \
--index-strategy unsafe-best-match

gaby · 2026-02-18T11:55:34Z

@esmeetu @marksverdhei even with these changes weeks ago, the glm-4.7-flash model does not start because transformers is pinned to <5.0.0.

Tldr: using nightly docker image

mgoin · 2026-02-18T13:09:32Z

@gaby yes we still recommend in the recipe for the model to install transformers from source https://docs.vllm.ai/projects/recipes/en/latest/GLM/GLM.html

gaby · 2026-02-18T13:20:55Z

@mgoin Sadly that doesn't work using vLLM with Docker. Since we rely on the image provided by the vLLM team.

mgoin · 2026-02-18T13:23:51Z

@gaby you can extend the existing image like this

https://x.com/thezachmueller/status/2014354173492432942?s=46&t=jLcDgQXDbYe6HgFmTNYgpg

gaby · 2026-02-18T13:26:39Z

@mgoin Will give that a try, thanks!

esmeetu · 2026-02-18T13:35:33Z

@gaby You can try this image: vllm/vllm-openai:glm5

gaby · 2026-02-18T13:46:51Z

@esmeetu thanks!

Signed-off-by: marksverdhei <marksverdhei@hotmail.com> Signed-off-by: Markus / Mark <46672778+marksverdhei@users.noreply.github.com> Signed-off-by: mgoin <mgoin64@gmail.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> Co-authored-by: mgoin <mgoin64@gmail.com>

gemini-code-assist bot reviewed Jan 19, 2026

View reviewed changes

vllm/transformers_utils/model_arch_config_convertor.py Outdated Show resolved Hide resolved

marksverdhei force-pushed the fix/glm4-moe-mla-detection branch from 09521b2 to edaf4f7 Compare January 19, 2026 20:04

marksverdhei changed the title ~~fix: Add GLM-4 MoE models to MLA detection~~ fix: Add glm4_moe_lite to MLA detection Jan 19, 2026

marksverdhei added 2 commits January 20, 2026 00:19

Merge branch 'main' into fix/glm4-moe-mla-detection

f61353f

Merge branch 'main' into fix/glm4-moe-mla-detection

05425fa

MatthewBonanni approved these changes Jan 20, 2026

View reviewed changes

LucasWilkinson approved these changes Jan 20, 2026

View reviewed changes

LucasWilkinson enabled auto-merge (squash) January 20, 2026 15:49

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Jan 20, 2026

mgoin approved these changes Jan 20, 2026

View reviewed changes

Merge branch 'main' into fix/glm4-moe-mla-detection

965507e

mgoin requested changes Jan 20, 2026

View reviewed changes

zRzRzRzRzRzRzR reviewed Jan 21, 2026

View reviewed changes

vllm/transformers_utils/model_arch_config_convertor.py Show resolved Hide resolved

marksverdhei added 2 commits January 21, 2026 16:04

Merge branch 'main' into fix/glm4-moe-mla-detection

2c356ae

Add glm4.7 flash mtp model to mla check

48170ff

Signed-off-by: Markus / Mark <46672778+marksverdhei@users.noreply.github.com>

auto-merge was automatically disabled January 21, 2026 15:17
Head branch was pushed to by a user without write access

Merge branch 'main' into fix/glm4-moe-mla-detection

be9b35a

JanOwiesniak mentioned this pull request Jan 22, 2026

Enable MLA detection for GLM-4.7 Flash (glm4_moe_lite) eugr/spark-vllm-docker#17

Closed

mgoin mentioned this pull request Jan 23, 2026

[Bugfix] Add glm4_moe_lite to MLA model list to fix excessive KV cache memory usage #32759

Closed

Add changes for Blackwell SM100 support

7d9b6a8

Signed-off-by: mgoin <mgoin64@gmail.com>

mgoin requested a review from pavanimajety as a code owner January 23, 2026 17:52

mergify bot added nvidia v1 labels Jan 23, 2026

github-project-automation bot added this to NVIDIA Jan 23, 2026

github-project-automation bot moved this to In review in NVIDIA Jan 23, 2026

Merge branch 'main' into fix/glm4-moe-mla-detection

130f50a

mgoin mentioned this pull request Jan 23, 2026

Fix/glm4 moe mla detection #32893

Closed

5 tasks

mgoin approved these changes Jan 23, 2026

View reviewed changes

github-project-automation bot moved this from In review to Ready in NVIDIA Jan 23, 2026

vllm-bot merged commit 586a57a into vllm-project:main Jan 23, 2026
55 of 57 checks passed

github-project-automation bot moved this from Ready to Done in NVIDIA Jan 23, 2026

ir1ka mentioned this pull request Jan 24, 2026

[Bug]: CompressedTensorsW4A16Fp4 is not support on turing #32838

Closed

1 task

zRzRzRzRzRzRzR mentioned this pull request Jan 25, 2026

GLM4.7-30B推理速度问题，并且倾向于输出很长的回答 zai-org/GLM-4.5#131

Open

2 tasks

ai-infos mentioned this pull request Jan 25, 2026

[Doc]: Share Working / Failed Models nlzy/vllm-gfx906#29

Open

Uh oh!

Conversation

marksverdhei commented Jan 19, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test Plan

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

MatthewBonanni left a comment

Choose a reason for hiding this comment

Uh oh!

LucasWilkinson left a comment

Choose a reason for hiding this comment

Uh oh!

mgoin commented Jan 20, 2026

Uh oh!

mgoin left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

marksverdhei commented Jan 22, 2026

Uh oh!

mdierolf commented Jan 23, 2026

Uh oh!

mgoin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

marksverdhei commented Jan 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Xiaojinhua commented Jan 27, 2026

Uh oh!

esmeetu commented Jan 28, 2026

Uh oh!

gaby commented Feb 18, 2026

Uh oh!

mgoin commented Feb 18, 2026

Uh oh!

gaby commented Feb 18, 2026

Uh oh!

mgoin commented Feb 18, 2026

Uh oh!

gaby commented Feb 18, 2026

Uh oh!

esmeetu commented Feb 18, 2026

Uh oh!

gaby commented Feb 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants

marksverdhei commented Jan 19, 2026 •

edited by github-actions bot

Loading

mgoin left a comment •

edited

Loading

marksverdhei commented Jan 24, 2026 •

edited

Loading