Skip to content

UPSTREAM PR #17694: model : add ASR support for LFM2-Audio-1.5B#578

Open
loci-dev wants to merge 5 commits intomainfrom
upstream-PR17694-branch_Liquid4All-tarek/feat/lfm2-asr-upstream
Open

UPSTREAM PR #17694: model : add ASR support for LFM2-Audio-1.5B#578
loci-dev wants to merge 5 commits intomainfrom
upstream-PR17694-branch_Liquid4All-tarek/feat/lfm2-asr-upstream

Conversation

@loci-dev
Copy link

Mirrored from ggml-org/llama.cpp#17694

LFM2-Audio-1.5B supports audio input and audio output.

PR adds only ASR support. To perform ASR invoke CLI with

bin/llama-mtmd-cli -m LFM2-Audio-1.5B-F32.gguf --mmproj mmproj-LFM2-Audio-1.5b-F32.gguf -n 30 --audio input.wav -sys "Perform ASR." -p "<__media__>"

Changes to existing code:

  • model requires system prompt, -sys enabled for llama-mtmd-cli
  • mel bins generation reworked, now it is generated dynamically and supports different n_fft values
  • OP_SSM_CONV for CUDA backend is extended to support kernel size 9

cc: @ngxson

@loci-review
Copy link

loci-review bot commented Dec 15, 2025

Explore the complete analysis inside the Version Insights

Performance Analysis Summary: PR #578 - LFM2-Audio ASR Support

Overview

PR #578 introduces ASR support for the LFM2-Audio-1.5B multimodal model through 678 additions across 16 files. The changes implement a new conformer-based audio encoder architecture, extend CUDA SSM convolution kernels to support kernel size 9, and add dynamic mel spectrogram generation with configurable FFT parameters.

Key Findings

Performance-Critical Function Impact

The analysis reveals no modifications to core inference functions (llama_decode, llama_encode, llama_tokenize) that drive token generation throughput. All changes are isolated to the multimodal audio processing pipeline, specifically:

CUDA SSM Convolution Kernel (ggml-cuda/ssm-conv.cu):

  • Refactored dispatch logic to support kernel size 9 alongside existing sizes 3 and 4
  • Lambda-based template instantiation eliminates code duplication while maintaining compile-time optimization
  • Register allocation increases from 32 bytes to 72 bytes per thread for size 9 kernels (9 floats × 2 arrays)
  • No impact on text-only inference paths

Audio Preprocessing (mtmd-audio.cpp):

  • New mtmd_audio_preprocessor_lfm2 class implements 512-point FFT (vs 400-point in Whisper)
  • Per-feature normalization and natural log computation added
  • Processing overhead estimated at 15-20 ns per audio frame for FFT increase, 5-10 ns for normalization
  • Affects only audio input processing, not token generation

LFM2 Encoder Graph (lfm2-audio-enc.cpp):

  • 267-line conformer architecture with 7-layer convolutional subsampling, multi-head attention with relative positional encoding, and dual feed-forward networks
  • Graph contains 200+ operations including 8-10 matrix multiplications per conformer layer
  • Self-attention includes 6+ permute operations requiring memory copies
  • Depthwise convolution uses kernel size 9 via ggml_ssm_conv

Tokens Per Second Impact

No degradation expected for text inference. The reference metric (7% TPS reduction for 2000 ns slower llama_decode on smollm:135m/i7-1255U/CPU) does not apply because:

  • llama_decode, llama_encode, and llama_tokenize remain unmodified
  • All changes target audio encoder path executed only during multimodal ASR tasks
  • Text-only workloads bypass the new audio processing pipeline entirely

For audio-to-text inference, the new LFM2 encoder introduces expected computational overhead inherent to the conformer architecture, but this represents new functionality rather than regression of existing capabilities.

Power Consumption Analysis

Binary-level analysis shows no changes to core llama inference binaries. The PR adds new code paths within the mtmd (multimodal) tooling:

  • llama-mtmd-cli binary: Includes new LFM2 audio encoder graph and preprocessing logic
  • ggml CUDA library: Extended SSM convolution kernel instantiation increases binary size by approximately 10-20 KB for kernel size 9 templates

Power consumption impact is limited to audio processing workloads. Text inference power draw remains unchanged as the execution path does not invoke audio encoder operations.

Code Implementation Assessment

The changes implement well-structured additions:

  • CUDA kernel refactoring uses modern C++14 patterns (std::integral_constant) for type-safe template dispatch
  • Audio preprocessing follows existing polymorphic interface design
  • Conformer graph construction uses established ggml operation primitives
  • System prompt handling integrates cleanly with existing message evaluation infrastructure

The implementation represents a feature addition rather than modification of existing inference paths, explaining the absence of performance impact on core text generation metrics.

@loci-dev loci-dev force-pushed the main branch 5 times, most recently from 4664cb4 to 799183f Compare December 15, 2025 21:08
@loci-dev loci-dev force-pushed the upstream-PR17694-branch_Liquid4All-tarek/feat/lfm2-asr-upstream branch from 5044ab6 to ba9e597 Compare December 15, 2025 21:36
@loci-dev loci-dev force-pushed the main branch 16 times, most recently from 193b250 to 88be9c1 Compare December 17, 2025 13:19
@loci-dev loci-dev force-pushed the main branch 30 times, most recently from 43ae401 to 37b9287 Compare December 23, 2025 08:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants