UPSTREAM PR #17901: [model] add glm-asr support#508
Conversation
|
Explore the complete analysis inside the Version Insights Performance Analysis Summary: PR #508 - GLM-ASR SupportOverviewPR #508 adds GLM-ASR (Automatic Speech Recognition) architecture support to llama.cpp, introducing a new audio encoder model type. The changes span 4 files with 123 additions and 7 deletions, primarily affecting the multimodal library (libmtmd.so) and conversion scripts. Key FindingsPerformance-Critical Areas Impact: The analysis reveals changes concentrated in the multimodal processing layer, not in core inference functions. The 10 functions with highest response time changes are all located in libmtmd.so:
These functions handle audio preprocessing and multimodal tensor operations, not token generation or language model inference. Tokens Per Second Impact: No impact on tokens per second is expected. The core inference functions (llama_decode, llama_encode, llama_tokenize) show no modifications in this PR. The changes exclusively affect audio encoder initialization and multimodal projection, which occur before text generation begins. The GLM-ASR projector implements 8x sequence length reduction through downsampling, which should improve downstream LLM throughput by reducing input token count. Power Consumption Analysis: libmtmd.so shows 0.68% power consumption reduction (884 nJ decrease from 130,976 nJ to 130,091 nJ). All other binaries (libllama.so, llama-run, llama-bench, etc.) show zero change, confirming the changes are isolated to multimodal processing. The reduction stems from optimized audio processing functions in the miniaudio library. Code Changes: The PR introduces GLMA projector type with Whisper encoder integration. Key implementations include dynamic configuration key detection (lm_config vs text_config), special token handling for multi-EOS models, and a 2-layer MLP projector with layer normalization. One assertion was removed (post_ln_w validation) to accommodate models without post-layer normalization, which may reduce validation coverage but enables GLMA compatibility. |
|
Explore the complete analysis inside the Version Insights Performance Review Summary - PR #508: GLM-ASR Model SupportProject: llama.cpp OverviewThis PR adds support for the GLM-ASR audio model architecture by introducing a new projector type (PROJECTOR_TYPE_GLMA) and associated conversion logic. The changes are localized to multimodal projection components and do not affect core inference paths. Key FindingsCode Changes AnalysisPrimary Modifications:
Structural Impact: The Removed Assertion: Line 1833 removed Inference Performance ImpactCore Inference Functions: No changes detected in llama_decode, llama_encode, or llama_tokenize functions. These functions remain unmodified in this PR. Tokens Per Second Impact: Zero impact on tokens per second. The modified functions are part of the multimodal projection layer used during audio encoding preprocessing, not the token generation inference path. Audio encoding occurs once per audio input during preprocessing, while token generation happens iteratively during inference. Affected Component: The changes impact only the CLIP/multimodal encoder initialization and audio feature extraction pipeline. The 12 ns throughput increase in Power Consumption AnalysisPower consumption analysis is not applicable for this PR as the changes introduce new functionality rather than modifying existing computational paths. The GLMA projector type is an additive feature that does not alter the execution profile of existing model types. ConclusionThis PR successfully extends multimodal support to GLM-ASR models through well-isolated changes in the projection layer. The 12 ns throughput increase in encoder detection is negligible and does not affect inference performance or tokens per second metrics. |
78ff3d3 to
117bfc3
Compare
…build_stack for padding and review
bf57f2c to
af1ee09
Compare
02b3f55 to
ba4079a
Compare
Mirrored from ggml-org/llama.cpp#17901
Make sure to read the contributing guidelines before submitting a PR
This PR adds support for the GLM-ASR architecture, specifically validating with the zai-org/GLM-ASR-Nano-2512 model.
Key Changes:
convert_hf_to_gguf.pyto handle dynamic configuration keys (glm-asr use "lm_config" instead of text_config). It now correctly identifies the config section by checking:llm_config_key = "lm_config" if "lm_config" in self.hparams else "text_config"Result