Skip to content

UPSTREAM PR #17901: [model] add glm-asr support#508

Open
loci-dev wants to merge 7 commits intomainfrom
upstream-PR17901-branch_piDack-glm_asr_support
Open

UPSTREAM PR #17901: [model] add glm-asr support#508
loci-dev wants to merge 7 commits intomainfrom
upstream-PR17901-branch_piDack-glm_asr_support

Conversation

@loci-dev
Copy link

Mirrored from ggml-org/llama.cpp#17901

Make sure to read the contributing guidelines before submitting a PR

This PR adds support for the GLM-ASR architecture, specifically validating with the zai-org/GLM-ASR-Nano-2512 model.

Key Changes:

  • Model Support: Implemented necessary logic to support GLM-ASR models.
  • Conversion Script: Updated convert_hf_to_gguf.py to handle dynamic configuration keys (glm-asr use "lm_config" instead of text_config). It now correctly identifies the config section by checking:
    llm_config_key = "lm_config" if "lm_config" in self.hparams else "text_config"

Result

img_v3_02sr_36e8953d-e10a-4165-b587-5759da7d2deg

@loci-review
Copy link

loci-review bot commented Dec 10, 2025

Explore the complete analysis inside the Version Insights

Performance Analysis Summary: PR #508 - GLM-ASR Support

Overview

PR #508 adds GLM-ASR (Automatic Speech Recognition) architecture support to llama.cpp, introducing a new audio encoder model type. The changes span 4 files with 123 additions and 7 deletions, primarily affecting the multimodal library (libmtmd.so) and conversion scripts.

Key Findings

Performance-Critical Areas Impact:

The analysis reveals changes concentrated in the multimodal processing layer, not in core inference functions. The 10 functions with highest response time changes are all located in libmtmd.so:

  • std::vector::end: +145,364 ns response time increase
  • clip_has_whisper_encoder: +21,332 ns response time increase
  • ma_dr_wav__on_seek_memory: -45,772 ns response time improvement
  • ma_default_vfs_close__stdio: -14,701 ns response time improvement
  • ma_lpf_calculate_sub_lpf_counts: -24,884 ns response time improvement

These functions handle audio preprocessing and multimodal tensor operations, not token generation or language model inference.

Tokens Per Second Impact:

No impact on tokens per second is expected. The core inference functions (llama_decode, llama_encode, llama_tokenize) show no modifications in this PR. The changes exclusively affect audio encoder initialization and multimodal projection, which occur before text generation begins. The GLM-ASR projector implements 8x sequence length reduction through downsampling, which should improve downstream LLM throughput by reducing input token count.

Power Consumption Analysis:

libmtmd.so shows 0.68% power consumption reduction (884 nJ decrease from 130,976 nJ to 130,091 nJ). All other binaries (libllama.so, llama-run, llama-bench, etc.) show zero change, confirming the changes are isolated to multimodal processing. The reduction stems from optimized audio processing functions in the miniaudio library.

Code Changes:

The PR introduces GLMA projector type with Whisper encoder integration. Key implementations include dynamic configuration key detection (lm_config vs text_config), special token handling for multi-EOS models, and a 2-layer MLP projector with layer normalization. One assertion was removed (post_ln_w validation) to accommodate models without post-layer normalization, which may reduce validation coverage but enables GLMA compatibility.

@loci-review
Copy link

loci-review bot commented Dec 10, 2025

Explore the complete analysis inside the Version Insights

Performance Review Summary - PR #508: GLM-ASR Model Support

Project: llama.cpp
PR: #508 - Add GLM-ASR support
Binary: build.bin.libmtmd.so


Overview

This PR adds support for the GLM-ASR audio model architecture by introducing a new projector type (PROJECTOR_TYPE_GLMA) and associated conversion logic. The changes are localized to multimodal projection components and do not affect core inference paths.


Key Findings

Code Changes Analysis

Primary Modifications:

  1. New Projector Type Implementation - Added PROJECTOR_TYPE_GLMA enum and registration in clip-impl.h and clip.cpp
  2. Whisper Encoder Integration - Extended clip_has_whisper_encoder() function to include GLMA projector type
  3. Graph Builder Extension - Added GLMA case to clip_image_build_graph() switch statement routing to build_whisper_enc()
  4. Tensor Loading Logic - Implemented GLMA-specific tensor loading in clip_model_loader for conv1d layers, MLP weights, normalization parameters, and BOI/EOI tokens
  5. Token Calculation - Added GLMA case in clip_n_output_tokens() with reshape factor of 4 and BOI/EOI token additions
  6. Projection Embedding - Added GLMA case in clip_n_mmproj_embd() returning mm_2_w dimensions
  7. Python Conversion Script - Updated convert_hf_to_gguf.py with GlmASRWhisperEncoderModel class and dynamic config key handling (lm_config vs text_config)

Structural Impact:

The clip_has_whisper_encoder() function shows a 28% throughput increase (12 ns absolute change from 46 ns to 58 ns) and 30% response time increase (22 ns absolute change from 71 ns to 93 ns). This is caused by adding one additional OR condition to the return statement, expanding from 2 projector type comparisons to 3. The compiler generated additional comparison and branch instructions, resulting in a slightly longer execution path.

Removed Assertion:

Line 1833 removed GGML_ASSERT(model.post_ln_w && model.post_ln_b) from build_whisper_enc(). This relaxes validation requirements, allowing GLMA models without post-layer normalization bias to load successfully. The assertion removal has negligible performance impact but broadens model compatibility.

Inference Performance Impact

Core Inference Functions: No changes detected in llama_decode, llama_encode, or llama_tokenize functions. These functions remain unmodified in this PR.

Tokens Per Second Impact: Zero impact on tokens per second. The modified functions are part of the multimodal projection layer used during audio encoding preprocessing, not the token generation inference path. Audio encoding occurs once per audio input during preprocessing, while token generation happens iteratively during inference.

Affected Component: The changes impact only the CLIP/multimodal encoder initialization and audio feature extraction pipeline. The 12 ns throughput increase in clip_has_whisper_encoder() is negligible given this function executes once during model initialization to determine encoder type.

Power Consumption Analysis

Power consumption analysis is not applicable for this PR as the changes introduce new functionality rather than modifying existing computational paths. The GLMA projector type is an additive feature that does not alter the execution profile of existing model types.


Conclusion

This PR successfully extends multimodal support to GLM-ASR models through well-isolated changes in the projection layer. The 12 ns throughput increase in encoder detection is negligible and does not affect inference performance or tokens per second metrics.

@loci-dev loci-dev force-pushed the main branch 13 times, most recently from 78ff3d3 to 117bfc3 Compare December 11, 2025 18:11
@loci-dev loci-dev force-pushed the main branch 9 times, most recently from bf57f2c to af1ee09 Compare December 12, 2025 21:08
@loci-dev loci-dev force-pushed the main branch 30 times, most recently from 02b3f55 to ba4079a Compare December 17, 2025 21:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants