UPSTREAM PR #17694: model : add ASR support for LFM2-Audio-1.5B by loci-dev · Pull Request #578 · auroralabs-loci/llama.cpp

loci-dev · 2025-12-15T13:46:00Z

LFM2-Audio-1.5B supports audio input and audio output.

PR adds only ASR support. To perform ASR invoke CLI with

bin/llama-mtmd-cli -m LFM2-Audio-1.5B-F32.gguf --mmproj mmproj-LFM2-Audio-1.5b-F32.gguf -n 30 --audio input.wav -sys "Perform ASR." -p "<__media__>"

Changes to existing code:

model requires system prompt, -sys enabled for llama-mtmd-cli
mel bins generation reworked, now it is generated dynamically and supports different n_fft values
OP_SSM_CONV for CUDA backend is extended to support kernel size 9

cc: @ngxson

loci-review · 2025-12-15T14:43:13Z

Explore the complete analysis inside the Version Insights

Performance Analysis Summary: PR #578 - LFM2-Audio ASR Support

Overview

PR #578 introduces ASR support for the LFM2-Audio-1.5B multimodal model through 678 additions across 16 files. The changes implement a new conformer-based audio encoder architecture, extend CUDA SSM convolution kernels to support kernel size 9, and add dynamic mel spectrogram generation with configurable FFT parameters.

Key Findings

Performance-Critical Function Impact

The analysis reveals no modifications to core inference functions (llama_decode, llama_encode, llama_tokenize) that drive token generation throughput. All changes are isolated to the multimodal audio processing pipeline, specifically:

CUDA SSM Convolution Kernel (ggml-cuda/ssm-conv.cu):

Refactored dispatch logic to support kernel size 9 alongside existing sizes 3 and 4
Lambda-based template instantiation eliminates code duplication while maintaining compile-time optimization
Register allocation increases from 32 bytes to 72 bytes per thread for size 9 kernels (9 floats × 2 arrays)
No impact on text-only inference paths

Audio Preprocessing (mtmd-audio.cpp):

New mtmd_audio_preprocessor_lfm2 class implements 512-point FFT (vs 400-point in Whisper)
Per-feature normalization and natural log computation added
Processing overhead estimated at 15-20 ns per audio frame for FFT increase, 5-10 ns for normalization
Affects only audio input processing, not token generation

LFM2 Encoder Graph (lfm2-audio-enc.cpp):

267-line conformer architecture with 7-layer convolutional subsampling, multi-head attention with relative positional encoding, and dual feed-forward networks
Graph contains 200+ operations including 8-10 matrix multiplications per conformer layer
Self-attention includes 6+ permute operations requiring memory copies
Depthwise convolution uses kernel size 9 via ggml_ssm_conv

Tokens Per Second Impact

No degradation expected for text inference. The reference metric (7% TPS reduction for 2000 ns slower llama_decode on smollm:135m/i7-1255U/CPU) does not apply because:

llama_decode, llama_encode, and llama_tokenize remain unmodified
All changes target audio encoder path executed only during multimodal ASR tasks
Text-only workloads bypass the new audio processing pipeline entirely

For audio-to-text inference, the new LFM2 encoder introduces expected computational overhead inherent to the conformer architecture, but this represents new functionality rather than regression of existing capabilities.

Power Consumption Analysis

Binary-level analysis shows no changes to core llama inference binaries. The PR adds new code paths within the mtmd (multimodal) tooling:

llama-mtmd-cli binary: Includes new LFM2 audio encoder graph and preprocessing logic
ggml CUDA library: Extended SSM convolution kernel instantiation increases binary size by approximately 10-20 KB for kernel size 9 templates

Power consumption impact is limited to audio processing workloads. Text inference power draw remains unchanged as the execution path does not invoke audio encoder operations.

Code Implementation Assessment

The changes implement well-structured additions:

CUDA kernel refactoring uses modern C++14 patterns (std::integral_constant) for type-safe template dispatch
Audio preprocessing follows existing polymorphic interface design
Conformer graph construction uses established ggml operation primitives
System prompt handling integrates cleanly with existing message evaluation infrastructure

The implementation represents a feature addition rather than modification of existing inference paths, explaining the absence of performance impact on core text generation metrics.

loci-dev temporarily deployed to PROD__AL_DEMO December 15, 2025 13:46 — with GitHub Actions Inactive

loci-dev force-pushed the main branch 5 times, most recently from 4664cb4 to 799183f Compare December 15, 2025 21:08

tdakhran added 5 commits December 15, 2025 22:14

ASR with LFM2-Audio-1.5B

145b628

Set rope_theta

4f5d521

Fix comment

0e8779a

Remove rope_theta setting

f5b132a

Address PR feedback

ba9e597

loci-dev force-pushed the upstream-PR17694-branch_Liquid4All-tarek/feat/lfm2-asr-upstream branch from 5044ab6 to ba9e597 Compare December 15, 2025 21:36

loci-dev had a problem deploying to PROD__AL_DEMO December 15, 2025 21:36 — with GitHub Actions Failure

loci-dev force-pushed the main branch 16 times, most recently from 193b250 to 88be9c1 Compare December 17, 2025 13:19

loci-dev force-pushed the main branch 30 times, most recently from 43ae401 to 37b9287 Compare December 23, 2025 08:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UPSTREAM PR #17694: model : add ASR support for LFM2-Audio-1.5B#578

UPSTREAM PR #17694: model : add ASR support for LFM2-Audio-1.5B#578
loci-dev wants to merge 5 commits intomainfrom
upstream-PR17694-branch_Liquid4All-tarek/feat/lfm2-asr-upstream

loci-dev commented Dec 15, 2025

Uh oh!

loci-review bot commented Dec 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

loci-dev commented Dec 15, 2025

Uh oh!

loci-review bot commented Dec 15, 2025

Performance Analysis Summary: PR #578 - LFM2-Audio ASR Support

Overview

Key Findings

Performance-Critical Function Impact

Tokens Per Second Impact

Power Consumption Analysis

Code Implementation Assessment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants