model : add ASR support for LFM2-Audio-1.5B#17694
model : add ASR support for LFM2-Audio-1.5B#17694tdakhran wants to merge 5 commits intoggml-org:masterfrom
Conversation
|
tested that [
{"role": "system", "content": "Perform ASR."},
{
"role": "user",
"content": [
{
"type": "input_audio",
"input_audio": {
"format": "wav",
"data": base64.b64encode(pathlib.Path("/data/playground/issue_400/10.wav").read_bytes()).decode(
"utf-8"
),
},
},
],
},
] |
|
The code is tested, will wait for #17978 to be merged, and then rebase and mark it as "ready for review". |
50597aa to
5044ab6
Compare
|
The code is ready for review and is tested with produces valid results for the attached file |
tools/mtmd/models/lfm2-audio-enc.cpp
Outdated
| Kcur = ggml_cont(ctx0, ggml_permute(ctx0, Kcur, 0, 2, 1, 3)); | ||
| Q_bias_u = ggml_cont(ctx0, ggml_permute(ctx0, Q_bias_u, 0, 2, 1, 3)); | ||
| ggml_tensor * matrix_ac = ggml_mul_mat(ctx0, Q_bias_u, Kcur); | ||
| matrix_ac = ggml_cont(ctx0, ggml_permute(ctx0, matrix_ac, 1, 0, 2, 3)); | ||
| cb(matrix_ac, "conformer.layers.{}.self_attn.id3", il); | ||
|
|
||
| auto * p = ggml_mul_mat(ctx0, layer.linear_pos_w, pos_emb); | ||
| cb(p, "conformer.layers.{}.self_attn.linear_pos", il); | ||
| p = ggml_reshape_3d(ctx0, p, d_head, n_head, p->ne[1]); | ||
|
|
||
| Q_bias_v = ggml_cont(ctx0, ggml_permute(ctx0, Q_bias_v, 0, 2, 1, 3)); | ||
| cb(Q_bias_v, "conformer.layers.{}.self_attn.id0", il); | ||
| p = ggml_cont(ctx0, ggml_permute(ctx0, p, 1, 2, 0, 3)); |
There was a problem hiding this comment.
do you think we could replace this with build_attn?
the advantage of build_attn is that it supports flash attn which can significantly improve the performance, but I'm not sure if there is currently anything missing to make it work in this case
There was a problem hiding this comment.
I saw some extra stuff like biases, matrix_ac, matrix_bd, it scared me followed Python implementation as is, will give it a second look
There was a problem hiding this comment.
looked into it, build_attn won't fit, too many customizations to attention.
tools/mtmd/models/lfm2-audio-enc.cpp
Outdated
| matrix_bd = ggml_reshape_3d(ctx0, matrix_bd, q_len, pos_len + 1, h); | ||
| matrix_bd = ggml_cont(ctx0, ggml_view_3d(ctx0, matrix_bd, | ||
| q_len, pos_len, h, | ||
| matrix_bd->nb[1], matrix_bd->nb[2], matrix_bd->nb[0] * q_len)); | ||
| matrix_bd = ggml_reshape_3d(ctx0, matrix_bd, pos_len, q_len, h); | ||
| } | ||
|
|
||
| matrix_bd = ggml_cont(ctx0, ggml_view_3d(ctx0, matrix_bd, |
There was a problem hiding this comment.
a bit strange that we're having these 4 reshapes / view without any permutations. can we collapse this into one single ggml_reshape_3d?
There was a problem hiding this comment.
If it were a plain view, reshapes could be simplified. There is a crop happening inside ggml_view_3d.
There was a problem hiding this comment.
hmm yeah interesting. not very important to optimize this, so I'll have a look later to see if there is another way
tools/mtmd/models/lfm2-audio-enc.cpp
Outdated
| x = ggml_cont(ctx0, ggml_transpose(ctx0, x)); | ||
| x = ggml_add(ctx0, ggml_mul(ctx0, x, layer.conv_norm_w), layer.conv_norm_b); | ||
| x = ggml_cont(ctx0, ggml_transpose(ctx0, x)); |
There was a problem hiding this comment.
we may be able to remove of transposes if conv_norm_b is already transpose upon conversion?
tools/mtmd/models/lfm2-audio-enc.cpp
Outdated
| x = ggml_cont(ctx0, ggml_transpose(ctx0, x)); | ||
| auto * conv_pw2_w = ggml_reshape_2d(ctx0, layer.conv_pw2_w, layer.conv_pw2_w->ne[1], layer.conv_pw2_w->ne[2]); | ||
| x = ggml_mul_mat(ctx0, conv_pw2_w, x); | ||
| x = ggml_add(ctx0, x, layer.conv_pw2_b); | ||
| x = ggml_cont(ctx0, ggml_transpose(ctx0, x)); |
There was a problem hiding this comment.
(I'll have a look into this), I suspect that these 2 transposes can be removed too (or at worse, one can be a view)
There was a problem hiding this comment.
Many transposes here are following the Python code without optimization in mind. The objective was to get numerically close intermediates. I'll have a closer look to understand what can be optimized.
There was a problem hiding this comment.
removed most of transposes
tools/mtmd/models/lfm2-audio-enc.cpp
Outdated
| cur = ggml_mul_mat(ctx0, model.mm_1_w, cur); | ||
| cur = ggml_add(ctx0, cur, model.mm_1_b); | ||
| cb(cur, "audio_adapter.model.{}", 1); | ||
| cur = ggml_gelu_erf(ctx0, cur); | ||
| cb(cur, "audio_adapter.model.{}", 2); | ||
| cur = ggml_mul_mat(ctx0, model.mm_3_w, cur); | ||
| cur = ggml_add(ctx0, cur, model.mm_3_b); | ||
| cb(cur, "audio_adapter.model.{}", 3); |
There was a problem hiding this comment.
this can be replaced with build_ffn
There was a problem hiding this comment.
didn't recognize it, will replace
8ba4562 to
ba9e597
Compare
|
Rebased to incorporate #18061, now works as is. |
|
Thanks @tdakhran ! I'll do a final review tmr and will push commits directly here if needed. For now, my priority will be to make sure that the GGUF is ready for any possible optimizations in the future. We can then look deeper into these optimizations in a follow-up PR (so users won't have to re-generate the GGUF) |
This comment was marked as outdated.
This comment was marked as outdated.
|
nevermind, I can do a follow-up PR |
|
Superseded by #18106 |
|
@ngxson , my bad, I think I forgot to click "allow edits" when created PR |
|
Hello Tarek, Please, could you share the full command line to run llama-server to do the ASR with LFM2-Audio? I downloaded the following models: the correct parameters and the models to run are not clear to me. Thank you so much. |
|
Hi @elfarolab, and thank you. GGUFs are not yet updated in https://huggingface.co/LiquidAI/LFM2-Audio-1.5B-GGUF, they are used for LEAPl and will be updated together with LEAP. Use the latest master commit. This will create 2 GGUFs required for ASR Now launch From another terminal, post a request for ASR. Please let me know if this works. |
|
Wow wonderful detailed how-to! It would be great having a tutorial for ASR and TTS with llama-server into teh tutorial space at: I'll follow your tutorial, in a couple of hours I'll report back here. |
|
Hello, I followed the procedure and run the python script above. The audio fileI used a .WAV 1CH, mono, 16 bits, 24KHz (I know 24KHz looks weird but it is because it will be used later with libopus). ResultsIt is almost perfect, it added the extra: between the female and male voices. llama-server (router mode) debug output:ConclusionsI am looking into the RAM usage, LFM2-Audio-1.5B uses a lot, ~5.5GB, maybe because a bad configuration of model? if you need anything else, running tests or try patches, I will be very happy to help. Thank you so much to everybody. Full llama-server log |
|
thanks for testing it @elfarolab , to reduce RAM usage specify |
it is already set.. still checking. Thanks! |
|
another possibility is context length, maybe setting |
|
To reduce RAM usage, GGUFs can be quantized using |
OK now I've got ~4.2GB, better, with GGUF not yet quantized. my model.ini: What about these: |
|
text generation for ASR for LFM2-Audio-1.5B is greedy, params above look good. |

LFM2-Audio-1.5B supports audio input and audio output.
PR adds only ASR support. To perform ASR invoke CLI with
Changes to existing code:
-sysenabled forllama-mtmd-clin_fftvaluesOP_SSM_CONVfor CUDA backend is extended to support kernel size 9cc: @ngxson