UPSTREAM PR #17901: [model] add glm-asr support by loci-dev · Pull Request #508 · auroralabs-loci/llama.cpp

loci-dev · 2025-12-10T03:49:04Z

Make sure to read the contributing guidelines before submitting a PR

This PR adds support for the GLM-ASR architecture, specifically validating with the zai-org/GLM-ASR-Nano-2512 model.

Key Changes:

Model Support: Implemented necessary logic to support GLM-ASR models.
Conversion Script: Updated convert_hf_to_gguf.py to handle dynamic configuration keys (glm-asr use "lm_config" instead of text_config). It now correctly identifies the config section by checking:
llm_config_key = "lm_config" if "lm_config" in self.hparams else "text_config"

Result

loci-review · 2025-12-10T04:45:15Z

Explore the complete analysis inside the Version Insights

Performance Analysis Summary: PR #508 - GLM-ASR Support

Overview

PR #508 adds GLM-ASR (Automatic Speech Recognition) architecture support to llama.cpp, introducing a new audio encoder model type. The changes span 4 files with 123 additions and 7 deletions, primarily affecting the multimodal library (libmtmd.so) and conversion scripts.

Key Findings

Performance-Critical Areas Impact:

The analysis reveals changes concentrated in the multimodal processing layer, not in core inference functions. The 10 functions with highest response time changes are all located in libmtmd.so:

std::vector::end: +145,364 ns response time increase
clip_has_whisper_encoder: +21,332 ns response time increase
ma_dr_wav__on_seek_memory: -45,772 ns response time improvement
ma_default_vfs_close__stdio: -14,701 ns response time improvement
ma_lpf_calculate_sub_lpf_counts: -24,884 ns response time improvement

These functions handle audio preprocessing and multimodal tensor operations, not token generation or language model inference.

Tokens Per Second Impact:

No impact on tokens per second is expected. The core inference functions (llama_decode, llama_encode, llama_tokenize) show no modifications in this PR. The changes exclusively affect audio encoder initialization and multimodal projection, which occur before text generation begins. The GLM-ASR projector implements 8x sequence length reduction through downsampling, which should improve downstream LLM throughput by reducing input token count.

Power Consumption Analysis:

libmtmd.so shows 0.68% power consumption reduction (884 nJ decrease from 130,976 nJ to 130,091 nJ). All other binaries (libllama.so, llama-run, llama-bench, etc.) show zero change, confirming the changes are isolated to multimodal processing. The reduction stems from optimized audio processing functions in the miniaudio library.

Code Changes:

The PR introduces GLMA projector type with Whisper encoder integration. Key implementations include dynamic configuration key detection (lm_config vs text_config), special token handling for multi-EOS models, and a 2-layer MLP projector with layer normalization. One assertion was removed (post_ln_w validation) to accommodate models without post-layer normalization, which may reduce validation coverage but enables GLMA compatibility.

loci-review · 2025-12-10T05:33:18Z

Explore the complete analysis inside the Version Insights

Performance Review Summary - PR #508: GLM-ASR Model Support

Project: llama.cpp
PR: #508 - Add GLM-ASR support
Binary: build.bin.libmtmd.so

Overview

This PR adds support for the GLM-ASR audio model architecture by introducing a new projector type (PROJECTOR_TYPE_GLMA) and associated conversion logic. The changes are localized to multimodal projection components and do not affect core inference paths.

Key Findings

Code Changes Analysis

Primary Modifications:

New Projector Type Implementation - Added PROJECTOR_TYPE_GLMA enum and registration in clip-impl.h and clip.cpp
Whisper Encoder Integration - Extended clip_has_whisper_encoder() function to include GLMA projector type
Graph Builder Extension - Added GLMA case to clip_image_build_graph() switch statement routing to build_whisper_enc()
Tensor Loading Logic - Implemented GLMA-specific tensor loading in clip_model_loader for conv1d layers, MLP weights, normalization parameters, and BOI/EOI tokens
Token Calculation - Added GLMA case in clip_n_output_tokens() with reshape factor of 4 and BOI/EOI token additions
Projection Embedding - Added GLMA case in clip_n_mmproj_embd() returning mm_2_w dimensions
Python Conversion Script - Updated convert_hf_to_gguf.py with GlmASRWhisperEncoderModel class and dynamic config key handling (lm_config vs text_config)

Structural Impact:

The clip_has_whisper_encoder() function shows a 28% throughput increase (12 ns absolute change from 46 ns to 58 ns) and 30% response time increase (22 ns absolute change from 71 ns to 93 ns). This is caused by adding one additional OR condition to the return statement, expanding from 2 projector type comparisons to 3. The compiler generated additional comparison and branch instructions, resulting in a slightly longer execution path.

Removed Assertion:

Line 1833 removed GGML_ASSERT(model.post_ln_w && model.post_ln_b) from build_whisper_enc(). This relaxes validation requirements, allowing GLMA models without post-layer normalization bias to load successfully. The assertion removal has negligible performance impact but broadens model compatibility.

Inference Performance Impact

Core Inference Functions: No changes detected in llama_decode, llama_encode, or llama_tokenize functions. These functions remain unmodified in this PR.

Tokens Per Second Impact: Zero impact on tokens per second. The modified functions are part of the multimodal projection layer used during audio encoding preprocessing, not the token generation inference path. Audio encoding occurs once per audio input during preprocessing, while token generation happens iteratively during inference.

Affected Component: The changes impact only the CLIP/multimodal encoder initialization and audio feature extraction pipeline. The 12 ns throughput increase in clip_has_whisper_encoder() is negligible given this function executes once during model initialization to determine encoder type.

Power Consumption Analysis

Power consumption analysis is not applicable for this PR as the changes introduce new functionality rather than modifying existing computational paths. The GLMA projector type is an additive feature that does not alter the execution profile of existing model types.

Conclusion

This PR successfully extends multimodal support to GLM-ASR models through well-isolated changes in the projection layer. The 12 ns throughput increase in encoder detection is negligible and does not affect inference performance or tokens per second metrics.

…build_stack for padding and review

[model] add glm-asr support

f432a5c

loci-dev temporarily deployed to PROD__AL_DEMO December 10, 2025 03:49 — with GitHub Actions Inactive

piDack added 2 commits December 10, 2025 03:54

fix format for ci

c382d64

fix convert format for ci

e8a1ec5

loci-dev temporarily deployed to PROD__AL_DEMO December 10, 2025 04:42 — with GitHub Actions Inactive

loci-dev force-pushed the main branch 13 times, most recently from 78ff3d3 to 117bfc3 Compare December 11, 2025 18:11

update glm_asr convert script & use build_ffn for glm_asr clip & use …

103e894

…build_stack for padding and review

loci-dev force-pushed the main branch 9 times, most recently from bf57f2c to af1ee09 Compare December 12, 2025 21:08

loci-dev force-pushed the main branch 30 times, most recently from 02b3f55 to ba4079a Compare December 17, 2025 21:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UPSTREAM PR #17901: [model] add glm-asr support#508

UPSTREAM PR #17901: [model] add glm-asr support#508
loci-dev wants to merge 7 commits intomainfrom
upstream-PR17901-branch_piDack-glm_asr_support

loci-dev commented Dec 10, 2025

Uh oh!

loci-review bot commented Dec 10, 2025

Uh oh!

loci-review bot commented Dec 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

loci-dev commented Dec 10, 2025

Uh oh!

loci-review bot commented Dec 10, 2025

Performance Analysis Summary: PR #508 - GLM-ASR Support

Overview

Key Findings

Uh oh!

loci-review bot commented Dec 10, 2025

Performance Review Summary - PR #508: GLM-ASR Model Support

Overview

Key Findings

Code Changes Analysis

Inference Performance Impact

Power Consumption Analysis

Conclusion

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants