model : add ASR support for LFM2-Audio-1.5B#17694

Closed

tdakhran wants to merge 5 commits intoggml-org:masterfrom

Liquid4All:tarek/feat/lfm2-asr-upstream

Contributor

tdakhran commented Dec 2, 2025

LFM2-Audio-1.5B supports audio input and audio output.

PR adds only ASR support. To perform ASR invoke CLI with

bin/llama-mtmd-cli -m LFM2-Audio-1.5B-F32.gguf --mmproj mmproj-LFM2-Audio-1.5b-F32.gguf -n 30 --audio input.wav -sys "Perform ASR." -p "<__media__>"

Changes to existing code:

model requires system prompt, -sys enabled for llama-mtmd-cli
mel bins generation reworked, now it is generated dynamically and supports different n_fft values
OP_SSM_CONV for CUDA backend is extended to support kernel size 9

Contributor Author

tdakhran commented Dec 2, 2025

tested that llama-server works as intended with input

[
        {"role": "system", "content": "Perform ASR."},
        {
            "role": "user",
            "content": [
                {
                    "type": "input_audio",
                    "input_audio": {
                        "format": "wav",
                        "data": base64.b64encode(pathlib.Path("/data/playground/issue_400/10.wav").read_bytes()).decode(
                            "utf-8"
                        ),
                    },
                },
            ],
        },
    ]

tdakhran changed the title ~~model : add LFM2-Audio-1.5B support~~ model : add ASR support for LFM2-Audio-1.5B

github-actions bot added testing Nvidia GPU examples python ggml labels

This was referenced Dec 12, 2025

clip: move model cgraphs into their own files #17965

Merged

mtmd: refactor audio preprocessing #17978

Merged

Contributor Author

tdakhran commented Dec 14, 2025

The code is tested, will wait for #17978 to be merged, and then rebase and mark it as "ready for review".

tdakhran mentioned this pull request

Rebased ASR for LFM2-Audio-1.5B ngxson/llama.cpp#58

Closed

tdakhran force-pushed the tarek/feat/lfm2-asr-upstream branch 2 times, most recently from 50597aa to 5044ab6 Compare

December 15, 2025 13:42

tdakhran marked this pull request as ready for review

December 15, 2025 13:44

tdakhran requested review from CISC, ggerganov and ngxson as code owners

December 15, 2025 13:44

loci-dev mentioned this pull request

UPSTREAM PR #17694: model : add ASR support for LFM2-Audio-1.5B auroralabs-loci/llama.cpp#578

Open

Contributor Author

tdakhran commented Dec 15, 2025

The code is ready for review and is tested with mtmd-cli and llama-server.

python convert_hf_to_gguf.py  /data/playground/checkpoints/LFM2-Audio-1.5B --outtype f32
python convert_hf_to_gguf.py  /data/playground/checkpoints/LFM2-Audio-1.5B --outtype f32 --mmproj

build/bin/llama-mtmd-cli -m /data/playground/checkpoints/LFM2-Audio-1.5B/LFM2-Audio-1.5B-F32.gguf --mmproj /data/playground/checkpoints/LFM2-Audio-1.5B/mmproj-LFM2-Audio-1.5b-F32.gguf -n 30 --audio /data/playground/issue_400/10.wav -sys "Perform ASR." -p "<__media__>" -v

produces valid results for the attached file
10.wav

encoding audio slice...
audio slice encoded in 39 ms
decoding audio batch 1/1, n_tokens_batch = 33
audio decoded (batch 1/1) in 109 ms

I need more air. Can you increase the fan speed?

CISC reviewed

View reviewed changes

convert_hf_to_gguf.py Outdated Show resolved Hide resolved

ngxson reviewed

View reviewed changes

tools/mtmd/models/lfm2-audio-enc.cpp Show resolved Hide resolved

tools/mtmd/models/lfm2-audio-enc.cpp Outdated Show resolved Hide resolved

tools/mtmd/models/lfm2-audio-enc.cpp Outdated

Comment on lines +114 to +126

+                          Kcur = ggml_cont(ctx0, ggml_permute(ctx0, Kcur, 0, 2, 1, 3));
+                          Q_bias_u = ggml_cont(ctx0, ggml_permute(ctx0, Q_bias_u, 0, 2, 1, 3));
+                          ggml_tensor * matrix_ac = ggml_mul_mat(ctx0, Q_bias_u, Kcur);
+                          matrix_ac = ggml_cont(ctx0, ggml_permute(ctx0, matrix_ac, 1, 0, 2, 3));
+                          cb(matrix_ac, "conformer.layers.{}.self_attn.id3", il);
+                          auto * p = ggml_mul_mat(ctx0, layer.linear_pos_w, pos_emb);
+                          cb(p, "conformer.layers.{}.self_attn.linear_pos", il);
+                          p = ggml_reshape_3d(ctx0, p, d_head, n_head, p->ne[1]);
+                          Q_bias_v = ggml_cont(ctx0, ggml_permute(ctx0, Q_bias_v, 0, 2, 1, 3));
+                          cb(Q_bias_v, "conformer.layers.{}.self_attn.id0", il);
+                          p = ggml_cont(ctx0, ggml_permute(ctx0, p, 1, 2, 0, 3));

Contributor

ngxson Dec 15, 2025

do you think we could replace this with build_attn?

the advantage of build_attn is that it supports flash attn which can significantly improve the performance, but I'm not sure if there is currently anything missing to make it work in this case

Contributor Author

tdakhran Dec 15, 2025

I saw some extra stuff like biases, matrix_ac, matrix_bd, it scared me followed Python implementation as is, will give it a second look

Contributor Author

tdakhran Dec 15, 2025

looked into it, build_attn won't fit, too many customizations to attention.

tools/mtmd/models/lfm2-audio-enc.cpp Outdated

Comment on lines +141 to +148

+                              matrix_bd = ggml_reshape_3d(ctx0, matrix_bd, q_len, pos_len + 1, h);
+                              matrix_bd = ggml_cont(ctx0, ggml_view_3d(ctx0, matrix_bd,
+                                          q_len, pos_len, h,
+                                          matrix_bd->nb[1], matrix_bd->nb[2], matrix_bd->nb[0] * q_len));
+                              matrix_bd = ggml_reshape_3d(ctx0, matrix_bd, pos_len, q_len, h);
+                          }
+                          matrix_bd = ggml_cont(ctx0, ggml_view_3d(ctx0, matrix_bd,

Contributor

ngxson Dec 15, 2025

a bit strange that we're having these 4 reshapes / view without any permutations. can we collapse this into one single ggml_reshape_3d?

Contributor Author

tdakhran Dec 15, 2025

If it were a plain view, reshapes could be simplified. There is a crop happening inside ggml_view_3d.

Contributor

ngxson Dec 15, 2025

hmm yeah interesting. not very important to optimize this, so I'll have a look later to see if there is another way

tools/mtmd/models/lfm2-audio-enc.cpp Outdated Show resolved Hide resolved

tools/mtmd/models/lfm2-audio-enc.cpp Outdated

Comment on lines +209 to +211

+                              x = ggml_cont(ctx0, ggml_transpose(ctx0, x));
+                              x = ggml_add(ctx0, ggml_mul(ctx0, x, layer.conv_norm_w), layer.conv_norm_b);
+                              x = ggml_cont(ctx0, ggml_transpose(ctx0, x));

Contributor

ngxson Dec 15, 2025

we may be able to remove of transposes if conv_norm_b is already transpose upon conversion?

tools/mtmd/models/lfm2-audio-enc.cpp Outdated

Comment on lines +217 to +221

+                          x = ggml_cont(ctx0, ggml_transpose(ctx0, x));
+                          auto * conv_pw2_w = ggml_reshape_2d(ctx0, layer.conv_pw2_w, layer.conv_pw2_w->ne[1], layer.conv_pw2_w->ne[2]);
+                          x = ggml_mul_mat(ctx0, conv_pw2_w, x);
+                          x = ggml_add(ctx0, x, layer.conv_pw2_b);
+                          x = ggml_cont(ctx0, ggml_transpose(ctx0, x));

Contributor

ngxson Dec 15, 2025

(I'll have a look into this), I suspect that these 2 transposes can be removed too (or at worse, one can be a view)

Contributor Author

tdakhran Dec 15, 2025

Many transposes here are following the Python code without optimization in mind. The objective was to get numerically close intermediates. I'll have a closer look to understand what can be optimized.

Contributor Author

tdakhran Dec 15, 2025

removed most of transposes

tools/mtmd/models/lfm2-audio-enc.cpp Outdated

Comment on lines +251 to +258

+                      cur = ggml_mul_mat(ctx0, model.mm_1_w, cur);
+                      cur = ggml_add(ctx0, cur, model.mm_1_b);
+                      cb(cur, "audio_adapter.model.{}", 1);
+                      cur = ggml_gelu_erf(ctx0, cur);
+                      cb(cur, "audio_adapter.model.{}", 2);
+                      cur = ggml_mul_mat(ctx0, model.mm_3_w, cur);
+                      cur = ggml_add(ctx0, cur, model.mm_3_b);
+                      cb(cur, "audio_adapter.model.{}", 3);

Contributor

ngxson Dec 15, 2025

this can be replaced with build_ffn

Contributor Author

tdakhran Dec 15, 2025

didn't recognize it, will replace

Contributor Author

tdakhran commented Dec 15, 2025

@ngxson , I addressed most of the feedback, added a comment explaining why build_attn cannot be used, removed unnecessary transposes, and simplified permutes. Applied the formatting as well.

PR requires #18061, otherwise rope_theta won't be set.

tdakhran force-pushed the tarek/feat/lfm2-asr-upstream branch from 8ba4562 to ba9e597 Compare

December 15, 2025 21:14

Contributor Author

tdakhran commented Dec 15, 2025

Rebased to incorporate #18061, now works as is.

tdakhran added 5 commits

December 15, 2025 22:14


          ASR with LFM2-Audio-1.5B

145b628


          Set rope_theta

4f5d521


          Fix comment

0e8779a


          Remove rope_theta setting

f5b132a


          Address PR feedback

ba9e597

Contributor

ngxson commented Dec 15, 2025 •

edited

Loading

Thanks @tdakhran ! I'll do a final review tmr and will push commits directly here if needed.

For now, my priority will be to make sure that the GGUF is ready for any possible optimizations in the future. We can then look deeper into these optimizations in a follow-up PR (so users won't have to re-generate the GGUF)

This comment was marked as outdated.

Sign in to view

Contributor

ngxson commented Dec 16, 2025

nevermind, I can do a follow-up PR

ngxson approved these changes

View reviewed changes

Contributor

ngxson commented Dec 16, 2025 •

edited

Loading

hein? I have no idea why github doesn't allow me to merge it 😂

I will copy the commit to another PR then

ngxson mentioned this pull request

model : add ASR support for LFM2-Audio-1.5B (conformer) #18106

Merged

Contributor

ngxson commented Dec 16, 2025

Superseded by #18106

ngxson closed this

loci-dev mentioned this pull request

UPSTREAM PR #18106: model : add ASR support for LFM2-Audio-1.5B (conformer) auroralabs-loci/llama.cpp#592

Open

Contributor Author

tdakhran commented Dec 16, 2025

@ngxson , my bad, I think I forgot to click "allow edits" when created PR

elfarolab commented Dec 29, 2025 •

edited

Loading

Hello Tarek,
I appreciate your work and sharing.

Please, could you share the full command line to run llama-server to do the ASR with LFM2-Audio?

I downloaded the following models:
audiodecoder-LFM2-Audio-1.5B-F16.gguf
LFM2-Audio-1.5B-F16.gguf
mmproj-audioencoder-LFM2-Audio-1.5B-F16.gguf

the correct parameters and the models to run are not clear to me.
It would be so much appreciated if you could share the full command line of llama-server.
I am using a custom client in C, I followed your JSON schema above.
I am sending speech WAV, 1CH mono, 16 bits, 22KHz, base64 encoded. Max 30 secs duration.

Thank you so much.

tdakhran deleted the tarek/feat/lfm2-asr-upstream branch

December 30, 2025 11:19

Contributor Author

tdakhran commented Dec 30, 2025

Hi @elfarolab, and thank you.

GGUFs are not yet updated in https://huggingface.co/LiquidAI/LFM2-Audio-1.5B-GGUF, they are used for LEAPl and will be updated together with LEAP.

Use the latest master commit.
Convert the checkpoint https://huggingface.co/LiquidAI/LFM2-Audio-1.5B manually into GGUFs

export CKPT=/data/playground/checkpoints/LFM2-Audio-1.5B
python convert_hf_to_gguf.py $CKPT
python convert_hf_to_gguf.py $CKPT --mmproj

This will create 2 GGUFs required for ASR

❯ (cd $CKPT && ls *.gguf)
LFM2-Audio-1.5B-BF16.gguf  mmproj-LFM2-Audio-1.5b-BF16.gguf

Now launch llama-server with the command

bin/llama-server -m $CKPT/LFM2-Audio-1.5B-BF16.gguf --mmproj $CKPT/mmproj-LFM2-Audio-1.5b-BF16.gguf

From another terminal, post a request for ASR.

import base64
import pathlib
from openai import OpenAI

wav_file = "/data/playground/issue_400/10.wav"

messages = [
    {"role": "system", "content": "Perform ASR."},
    {
        "role": "user",
        "content": [
            {
                "type": "input_audio",
                "input_audio": {
                    "format": "wav",
                    "data": base64.b64encode(
                        pathlib.Path(wav_file).read_bytes()
                    ).decode("utf-8"),
                },
            },
        ],
    },
]

host = "localhost"
port = 8080
client = OpenAI(base_url=f"http://{host}:{port}/v1", api_key="not-needed")
resp = client.chat.completions.create(
    model="",
    messages=messages,
)
content = resp.choices[0].message.content
print(content)

Please let me know if this works.

elfarolab commented Dec 30, 2025

Wow wonderful detailed how-to!

It would be great having a tutorial for ASR and TTS with llama-server into teh tutorial space at:
ggml-org : tutorials

I'll follow your tutorial, in a couple of hours I'll report back here.
Thank you so much for all valuable help!

elfarolab commented Dec 30, 2025 •

edited

Loading

Hello,

I followed the procedure and run the python script above.

The audio file

I used a .WAV 1CH, mono, 16 bits, 24KHz (I know 24KHz looks weird but it is because it will be used later with libopus).
it is composed by a male and female voice, alternating at every sentence, male voice start first. M..F..M..F
Sorry github doesn't allow attaching the .wav audio file.

mediainfo speech_orig_24000Hz.wav

General
Complete name                            : speech_orig_24000Hz.wav
Format                                   : Wave
File size                                : 506 KiB
Duration                                 : 10 s 800 ms
Overall bit rate mode                    : Constant
Overall bit rate                         : 384 kb/s
Writing application                      : Lavf62.3.100

Audio
Format                                   : PCM
Format settings                          : Little / Signed
Codec ID                                 : 1
Duration                                 : 10 s 800 ms
Bit rate mode                            : Constant
Bit rate                                 : 384 kb/s
Channel(s)                               : 1 channel
Sampling rate                            : 24.0 kHz
Bit depth                                : 16 bits
Stream size                              : 506 KiB (100%)

Results

python test_asr_direct_wav_Tarek.py
The birch canoe slid on the smooth planks. "Glue the sheet to the dark blue background", it said. "It's easy to tell the depth of a well. Four hours of steady work faced us."

It is almost perfect, it added the extra:

<.. ,it said. ..>

between the female and male voices.

llama-server (router mode) debug output:

...
Dec 30 14:49:46 xyz llama-server[1717]: [47365] srv  log_server_r: request: GET /models 127.0.0.1 200
Dec 30 14:50:30 xyz llama-server[1717]: srv  proxy_reques: proxying request to model asr on port 55121
Dec 30 14:50:30 xyz llama-server[1717]: [55121] common_chat_params_init_lfm2: Using content relying on the template
Dec 30 14:50:30 xyz llama-server[1717]: [55121] srv  params_from_: Chat format: Content-only
Dec 30 14:50:30 xyz llama-server[1717]: [55121] slot get_availabl: id  1 | task -1 | selected slot by LRU, t_last = -1
Dec 30 14:50:30 xyz llama-server[1717]: [55121] slot launch_slot_: id  1 | task -1 | sampler chain: logits -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
Dec 30 14:50:30 xyz llama-server[1717]: [55121] slot launch_slot_: id  1 | task 0 | processing task
Dec 30 14:50:30 xyz llama-server[1717]: [55121] slot update_slots: id  1 | task 0 | new prompt, n_ctx_slot = 64000, n_keep = 0, task.n_tokens = 155
Dec 30 14:50:30 xyz llama-server[1717]: [55121] slot update_slots: id  1 | task 0 | n_tokens = 0, memory_seq_rm [0, end)
Dec 30 14:50:30 xyz llama-server[1717]: [55121] slot update_slots: id  1 | task 0 | prompt processing progress, n_tokens = 14, batch.n_tokens = 14, progress = 0.090323
Dec 30 14:50:30 xyz llama-server[1717]: [55121] slot update_slots: id  1 | task 0 | n_tokens = 14, memory_seq_rm [14, end)
Dec 30 14:50:30 xyz llama-server[1717]: [55121] srv  process_chun: processing audio...
Dec 30 14:50:30 xyz llama-server[1717]: [55121] encoding audio slice...
Dec 30 14:50:31 xyz llama-server[1717]: [55121] audio slice encoded in 856 ms
Dec 30 14:50:31 xyz llama-server[1717]: [55121] decoding audio batch 1/1, n_tokens_batch = 136
Dec 30 14:50:31 xyz llama-server[1717]: [55121] audio decoded (batch 1/1) in 22 ms
Dec 30 14:50:31 xyz llama-server[1717]: [55121] srv  process_chun: audio processed in 878 ms
Dec 30 14:50:31 xyz llama-server[1717]: [55121] slot update_slots: id  1 | task 0 | prompt processing progress, n_tokens = 155, batch.n_tokens = 5, progress = 1.000000
Dec 30 14:50:31 xyz llama-server[1717]: [55121] slot update_slots: id  1 | task 0 | prompt done, n_tokens = 155, batch.n_tokens = 5
Dec 30 14:50:31 xyz llama-server[1717]: [55121] slot update_slots: id  1 | task 0 | created context checkpoint 1 of 8 (pos_min = 149, pos_max = 149, size = 0.156 MiB)
Dec 30 14:50:32 xyz llama-server[1717]: [55121] slot print_timing: id  1 | task 0 |
Dec 30 14:50:32 xyz llama-server[1717]: [55121] prompt eval time =    1040.77 ms /   155 tokens (    6.71 ms per token,   148.93 tokens per second)
Dec 30 14:50:32 xyz llama-server[1717]: [55121]        eval time =     755.80 ms /    47 tokens (   16.08 ms per token,    62.19 tokens per second)
Dec 30 14:50:32 xyz llama-server[1717]: [55121]       total time =    1796.57 ms /   202 tokens
Dec 30 14:50:32 xyz llama-server[1717]: [55121] slot      release: id  1 | task 0 | stop processing: n_tokens = 201, truncated = 0
Dec 30 14:50:32 xyz llama-server[1717]: [55121] srv  update_slots: all slots are idle
Dec 30 14:50:32 xyz llama-server[1717]: [55121] srv  log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
Dec 30 14:50:32 xyz llama-server[1717]: srv  log_server_r: request: POST /v1/chat/completions 127.0.0.1 200

Conclusions

I am looking into the RAM usage, LFM2-Audio-1.5B uses a lot, ~5.5GB, maybe because a bad configuration of model?
Checking..

if you need anything else, running tests or try patches, I will be very happy to help.

Thank you so much to everybody.

Full llama-server log

Dec 30 15:22:07 xyz systemd[1]: Started Llama.cpp Inference Server (GGUF models).
Dec 30 15:22:07 xyz llama-server[1368]: Starting llama-server with 16 arguments:
Dec 30 15:22:07 xyz llama-server[1368]:   '--host'
Dec 30 15:22:07 xyz llama-server[1368]:   '127.0.0.1'
Dec 30 15:22:07 xyz llama-server[1368]:   '--port'
Dec 30 15:22:07 xyz llama-server[1368]:   '8087'
Dec 30 15:22:07 xyz llama-server[1368]:   '--prio'
Dec 30 15:22:07 xyz llama-server[1368]:   '2'
Dec 30 15:22:07 xyz llama-server[1368]:   '--log-colors'
Dec 30 15:22:07 xyz llama-server[1368]:   'on'
Dec 30 15:22:07 xyz llama-server[1368]:   '--models-preset'
Dec 30 15:22:07 xyz llama-server[1368]:   '/opt/usbhd/llama.cpp/etc/models.ini'
Dec 30 15:22:07 xyz llama-server[1368]:   '--threads-http'
Dec 30 15:22:07 xyz llama-server[1368]:   '4'
Dec 30 15:22:07 xyz llama-server[1368]:   '--no-webui'
Dec 30 15:22:07 xyz llama-server[1368]:   '--offline'
Dec 30 15:22:07 xyz llama-server[1368]:   '--parallel'
Dec 30 15:22:07 xyz llama-server[1368]:   '2'
Dec 30 15:22:07 xyz llama-server[1368]: Full command:
Dec 30 15:22:07 xyz llama-server[1368]: /opt/usbhd/llama.cpp/bin/llama-server --host 127.0.0.1 --port 8087 --prio 2 --log-colors on --models-preset /opt/usbhd/llama.cpp/etc/models.ini --threads-http 4 --no-webui --offline --parallel 2
Dec 30 15:22:09 xyz llama-server[1368]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
Dec 30 15:22:09 xyz llama-server[1368]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: yes
Dec 30 15:22:09 xyz llama-server[1368]: ggml_cuda_init: found 1 CUDA devices:
Dec 30 15:22:09 xyz llama-server[1368]:   Device 0: Orin, compute capability 8.7, VMM: yes
Dec 30 15:22:09 xyz llama-server[1368]: build: 17 (06705fd) with GNU 11.4.0 for Linux aarch64
Dec 30 15:22:09 xyz llama-server[1368]: system info: n_threads = 12, n_threads_batch = 12, total_threads = 12
Dec 30 15:22:09 xyz llama-server[1368]: system_info: n_threads = 12 (n_threads_batch = 12) / 12 | CUDA : ARCHS = 870 | FORCE_CUBLAS = 1 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | FA_ALL_QUANTS = 1 | CPU : NEON = 1 | ARM_FMA = 1 | FP16_VA = 1 | DOTPROD = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
Dec 30 15:22:09 xyz llama-server[1368]: init: using 4 threads for HTTP server
Dec 30 15:22:09 xyz llama-server[1368]: Web UI is disabled
Dec 30 15:22:10 xyz llama-server[1368]: srv   load_models: Loaded 0 cached model presets
Dec 30 15:22:10 xyz llama-server[1368]: srv   load_models: Loaded 1 custom model presets from /opt/usbhd/llama.cpp/etc/models.ini
Dec 30 15:22:10 xyz llama-server[1368]: srv   load_models: Available models (1) (*: custom preset)
Dec 30 15:22:10 xyz llama-server[1368]: srv   load_models:   * asr
Dec 30 15:22:10 xyz llama-server[1368]: srv   load_models: (startup) loading model asr
Dec 30 15:22:10 xyz llama-server[1368]: srv          load: spawning server instance with name=asr on port 56595
Dec 30 15:22:10 xyz llama-server[1368]: srv          load: spawning server instance with args:
Dec 30 15:22:10 xyz llama-server[1368]: srv          load:   /opt/usbhd/llama.cpp/bin/llama-server
Dec 30 15:22:10 xyz llama-server[1368]: srv          load:   --host
Dec 30 15:22:10 xyz llama-server[1368]: srv          load:   127.0.0.1
Dec 30 15:22:10 xyz llama-server[1368]: srv          load:   --log-colors
Dec 30 15:22:10 xyz llama-server[1368]: srv          load:   on
Dec 30 15:22:10 xyz llama-server[1368]: srv          load:   --mlock
Dec 30 15:22:10 xyz llama-server[1368]: srv          load:   --no-mmap
Dec 30 15:22:10 xyz llama-server[1368]: srv          load:   --offline
Dec 30 15:22:10 xyz llama-server[1368]: srv          load:   --port
Dec 30 15:22:10 xyz llama-server[1368]: srv          load:   56595
Dec 30 15:22:10 xyz llama-server[1368]: srv          load:   --prio
Dec 30 15:22:10 xyz llama-server[1368]: srv          load:   2
Dec 30 15:22:10 xyz llama-server[1368]: srv          load:   --temp
Dec 30 15:22:10 xyz llama-server[1368]: srv          load:   0
Dec 30 15:22:10 xyz llama-server[1368]: srv          load:   --threads-http
Dec 30 15:22:10 xyz llama-server[1368]: srv          load:   4
Dec 30 15:22:10 xyz llama-server[1368]: srv          load:   --no-webui
Dec 30 15:22:10 xyz llama-server[1368]: srv          load:   --alias
Dec 30 15:22:10 xyz llama-server[1368]: srv          load:   asr
Dec 30 15:22:10 xyz llama-server[1368]: srv          load:   --flash-attn
Dec 30 15:22:10 xyz llama-server[1368]: srv          load:   on
Dec 30 15:22:10 xyz llama-server[1368]: srv          load:   --model
Dec 30 15:22:10 xyz llama-server[1368]: srv          load:   /opt/usbhd/models/LFM2-Audio-1.5B-GGUF_LiquidAI/LFM2-Audio-1.5B-BF16.gguf
Dec 30 15:22:10 xyz llama-server[1368]: srv          load:   --mmproj
Dec 30 15:22:10 xyz llama-server[1368]: srv          load:   /opt/usbhd/models/LFM2-Audio-1.5B-GGUF_LiquidAI/mmproj-LFM2-Audio-1.5b-BF16.gguf
Dec 30 15:22:10 xyz llama-server[1368]: srv          load:   --n-gpu-layers
Dec 30 15:22:10 xyz llama-server[1368]: srv          load:   -1
Dec 30 15:22:10 xyz llama-server[1368]: srv          load:   --parallel
Dec 30 15:22:10 xyz llama-server[1368]: srv          load:   2
Dec 30 15:22:10 xyz llama-server[1368]: srv          load:   --threads
Dec 30 15:22:10 xyz llama-server[1368]: srv          load:   4
Dec 30 15:22:10 xyz llama-server[1368]: main: starting router server, no model will be loaded in this process
Dec 30 15:22:10 xyz llama-server[1368]: start: binding port with default address family
Dec 30 15:22:10 xyz llama-server[1368]: main: router server is listening on http://127.0.0.1:8087
Dec 30 15:22:10 xyz llama-server[1368]: main: NOTE: router mode is experimental
Dec 30 15:22:10 xyz llama-server[1368]: main:       it is not recommended to use this mode in untrusted environments
Dec 30 15:22:10 xyz llama-server[1368]: [56595] ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
Dec 30 15:22:10 xyz llama-server[1368]: [56595] ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: yes
Dec 30 15:22:10 xyz llama-server[1368]: [56595] ggml_cuda_init: found 1 CUDA devices:
Dec 30 15:22:10 xyz llama-server[1368]: [56595]   Device 0: Orin, compute capability 8.7, VMM: yes
Dec 30 15:22:10 xyz llama-server[1368]: [56595] build: 17 (06705fd) with GNU 11.4.0 for Linux aarch64
Dec 30 15:22:10 xyz llama-server[1368]: [56595] system info: n_threads = 4, n_threads_batch = 4, total_threads = 12
Dec 30 15:22:10 xyz llama-server[1368]: [56595]
Dec 30 15:22:10 xyz llama-server[1368]: [56595] system_info: n_threads = 4 (n_threads_batch = 4) / 12 | CUDA : ARCHS = 870 | FORCE_CUBLAS = 1 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | FA_ALL_QUANTS = 1 | CPU : NEON = 1 | ARM_FMA = 1 | FP16_VA = 1 | DOTPROD = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
Dec 30 15:22:10 xyz llama-server[1368]: [56595]
Dec 30 15:22:10 xyz llama-server[1368]: [56595] init: using 4 threads for HTTP server
Dec 30 15:22:10 xyz llama-server[1368]: [56595] Web UI is disabled
Dec 30 15:22:10 xyz llama-server[1368]: [56595] start: binding port with default address family
Dec 30 15:22:10 xyz llama-server[1368]: [56595] main: loading model
Dec 30 15:22:10 xyz llama-server[1368]: [56595] srv    load_model: loading model '/opt/usbhd/models/LFM2-Audio-1.5B-GGUF_LiquidAI/LFM2-Audio-1.5B-BF16.gguf'
Dec 30 15:22:10 xyz llama-server[1368]: [56595] common_init_result: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on
Dec 30 15:22:10 xyz llama-server[1368]: [56595] llama_params_fit_impl: projected to use 5623 MiB of device memory vs. 29369 MiB of free device memory
Dec 30 15:22:10 xyz llama-server[1368]: [56595] llama_params_fit_impl: will leave 23745 >= 1024 MiB of free device memory, no changes needed
Dec 30 15:22:10 xyz llama-server[1368]: [56595] llama_params_fit: successfully fit params to free device memory
Dec 30 15:22:10 xyz llama-server[1368]: [56595] llama_params_fit: fitting params to free memory took 0.83 seconds
Dec 30 15:22:10 xyz llama-server[1368]: [56595] llama_model_load_from_file_impl: using device CUDA0 (Orin) (0000:00:00.0) - 29374 MiB free
Dec 30 15:22:10 xyz llama-server[1368]: [56595] llama_model_loader: loaded meta data with 38 key-value pairs and 148 tensors from /opt/usbhd/models/LFM2-Audio-1.5B-GGUF_LiquidAI/LFM2-Audio-1.5B-BF16.gguf (version GGUF V3 (latest))
Dec 30 15:22:10 xyz llama-server[1368]: [56595] llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
Dec 30 15:22:10 xyz llama-server[1368]: [56595] llama_model_loader: - kv   0:                       general.architecture str              = lfm2
Dec 30 15:22:10 xyz llama-server[1368]: [56595] llama_model_loader: - kv   1:                               general.type str              = model
Dec 30 15:22:10 xyz llama-server[1368]: [56595] llama_model_loader: - kv   2:                               general.name str              = LFM2 Audio 1.5B
Dec 30 15:22:10 xyz llama-server[1368]: [56595] llama_model_loader: - kv   3:                           general.basename str              = LFM2-Audio
Dec 30 15:22:10 xyz llama-server[1368]: [56595] llama_model_loader: - kv   4:                         general.size_label str              = 1.5B
Dec 30 15:22:10 xyz llama-server[1368]: [56595] llama_model_loader: - kv   5:                            general.license str              = other
Dec 30 15:22:10 xyz llama-server[1368]: [56595] llama_model_loader: - kv   6:                       general.license.name str              = lfm1.0
Dec 30 15:22:10 xyz llama-server[1368]: [56595] llama_model_loader: - kv   7:                       general.license.link str              = LICENSE
Dec 30 15:22:10 xyz llama-server[1368]: [56595] llama_model_loader: - kv   8:                   general.base_model.count u32              = 1
Dec 30 15:22:10 xyz llama-server[1368]: [56595] llama_model_loader: - kv   9:                  general.base_model.0.name str              = LFM2 1.2B
Dec 30 15:22:10 xyz llama-server[1368]: [56595] llama_model_loader: - kv  10:          general.base_model.0.organization str              = LiquidAI
Dec 30 15:22:10 xyz llama-server[1368]: [56595] llama_model_loader: - kv  11:              general.base_model.0.repo_url str              = https://huggingface.co/LiquidAI/LFM2-...
Dec 30 15:22:10 xyz llama-server[1368]: [56595] llama_model_loader: - kv  12:                               general.tags arr[str,7]       = ["liquid", "lfm2", "audio", "lfm2-aud...
Dec 30 15:22:10 xyz llama-server[1368]: [56595] llama_model_loader: - kv  13:                          general.languages arr[str,1]       = ["en"]
Dec 30 15:22:10 xyz llama-server[1368]: [56595] llama_model_loader: - kv  14:                           lfm2.block_count u32              = 16
Dec 30 15:22:10 xyz llama-server[1368]: [56595] llama_model_loader: - kv  15:                        lfm2.context_length u32              = 128000
Dec 30 15:22:10 xyz llama-server[1368]: [56595] llama_model_loader: - kv  16:                      lfm2.embedding_length u32              = 2048
Dec 30 15:22:10 xyz llama-server[1368]: [56595] llama_model_loader: - kv  17:                   lfm2.feed_forward_length u32              = 8192
Dec 30 15:22:10 xyz llama-server[1368]: [56595] llama_model_loader: - kv  18:                  lfm2.attention.head_count u32              = 32
Dec 30 15:22:10 xyz llama-server[1368]: [56595] llama_model_loader: - kv  19:               lfm2.attention.head_count_kv arr[i32,16]      = [0, 0, 8, 0, 0, 8, 0, 0, 8, 0, 8, 0, ...
Dec 30 15:22:10 xyz llama-server[1368]: [56595] llama_model_loader: - kv  20:                        lfm2.rope.freq_base f32              = 1000000.000000
Dec 30 15:22:10 xyz llama-server[1368]: [56595] llama_model_loader: - kv  21:      lfm2.attention.layer_norm_rms_epsilon f32              = 0.000010
Dec 30 15:22:10 xyz llama-server[1368]: [56595] llama_model_loader: - kv  22:                          general.file_type u32              = 32
Dec 30 15:22:10 xyz llama-server[1368]: [56595] llama_model_loader: - kv  23:                            lfm2.vocab_size u32              = 65536
Dec 30 15:22:10 xyz llama-server[1368]: [56595] llama_model_loader: - kv  24:                     lfm2.shortconv.l_cache u32              = 3
Dec 30 15:22:10 xyz llama-server[1368]: [56595] llama_model_loader: - kv  25:               general.quantization_version u32              = 2
Dec 30 15:22:10 xyz llama-server[1368]: [56595] llama_model_loader: - kv  26:                       tokenizer.ggml.model str              = gpt2
Dec 30 15:22:10 xyz llama-server[1368]: [56595] llama_model_loader: - kv  27:                         tokenizer.ggml.pre str              = lfm2
Dec 30 15:22:10 xyz llama-server[1368]: [56595] llama_model_loader: - kv  28:                      tokenizer.ggml.tokens arr[str,65536]   = ["<|pad|>", "<|startoftext|>", "<|end...
Dec 30 15:22:10 xyz llama-server[1368]: [56595] llama_model_loader: - kv  29:                  tokenizer.ggml.token_type arr[i32,65536]   = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
Dec 30 15:22:10 xyz llama-server[1368]: [140B blob data]
Dec 30 15:22:10 xyz llama-server[1368]: [56595] llama_model_loader: - kv  31:                tokenizer.ggml.bos_token_id u32              = 1
Dec 30 15:22:10 xyz llama-server[1368]: [56595] llama_model_loader: - kv  32:                tokenizer.ggml.eos_token_id u32              = 7
Dec 30 15:22:10 xyz llama-server[1368]: [56595] llama_model_loader: - kv  33:            tokenizer.ggml.padding_token_id u32              = 0
Dec 30 15:22:10 xyz llama-server[1368]: [56595] llama_model_loader: - kv  34:               tokenizer.ggml.add_bos_token bool             = true
Dec 30 15:22:10 xyz llama-server[1368]: [56595] llama_model_loader: - kv  35:               tokenizer.ggml.add_sep_token bool             = false
Dec 30 15:22:10 xyz llama-server[1368]: [56595] llama_model_loader: - kv  36:               tokenizer.ggml.add_eos_token bool             = false
Dec 30 15:22:10 xyz llama-server[1368]: [56595] llama_model_loader: - kv  37:                    tokenizer.chat_template str              = {{- bos_token -}}{%- set system_promp...
Dec 30 15:22:10 xyz llama-server[1368]: [56595] llama_model_loader: - type  f32:   55 tensors
Dec 30 15:22:10 xyz llama-server[1368]: [56595] llama_model_loader: - type bf16:   93 tensors
Dec 30 15:22:10 xyz llama-server[1368]: [56595] print_info: file format = GGUF V3 (latest)
Dec 30 15:22:10 xyz llama-server[1368]: [56595] print_info: file type   = BF16
Dec 30 15:22:10 xyz llama-server[1368]: [56595] print_info: file size   = 2.18 GiB (16.00 BPW)
Dec 30 15:22:11 xyz llama-server[1368]: [56595] load: printing all EOG tokens:
Dec 30 15:22:11 xyz llama-server[1368]: [56595] load:   - 2 ('<|endoftext|>')
Dec 30 15:22:11 xyz llama-server[1368]: [56595] load:   - 7 ('<|im_end|>')
Dec 30 15:22:11 xyz llama-server[1368]: [56595] load: special tokens cache size = 507
Dec 30 15:22:11 xyz llama-server[1368]: [56595] load: token to piece cache size = 0.3756 MB
Dec 30 15:22:11 xyz llama-server[1368]: [56595] print_info: arch             = lfm2
Dec 30 15:22:11 xyz llama-server[1368]: [56595] print_info: vocab_only       = 0
Dec 30 15:22:11 xyz llama-server[1368]: [56595] print_info: no_alloc         = 0
Dec 30 15:22:11 xyz llama-server[1368]: [56595] print_info: n_ctx_train      = 128000
Dec 30 15:22:11 xyz llama-server[1368]: [56595] print_info: n_embd           = 2048
Dec 30 15:22:11 xyz llama-server[1368]: [56595] print_info: n_embd_inp       = 2048
Dec 30 15:22:11 xyz llama-server[1368]: [56595] print_info: n_layer          = 16
Dec 30 15:22:11 xyz llama-server[1368]: [56595] print_info: n_head           = 32
Dec 30 15:22:11 xyz llama-server[1368]: [56595] print_info: n_head_kv        = [0, 0, 8, 0, 0, 8, 0, 0, 8, 0, 8, 0, 8, 0, 8, 0]
Dec 30 15:22:11 xyz llama-server[1368]: [56595] print_info: n_rot            = 64
Dec 30 15:22:11 xyz llama-server[1368]: [56595] print_info: n_swa            = 0
Dec 30 15:22:11 xyz llama-server[1368]: [56595] print_info: is_swa_any       = 0
Dec 30 15:22:11 xyz llama-server[1368]: [56595] print_info: n_embd_head_k    = 64
Dec 30 15:22:11 xyz llama-server[1368]: [56595] print_info: n_embd_head_v    = 64
Dec 30 15:22:11 xyz llama-server[1368]: [56595] print_info: n_gqa            = [0, 0, 4, 0, 0, 4, 0, 0, 4, 0, 4, 0, 4, 0, 4, 0]
Dec 30 15:22:11 xyz llama-server[1368]: [56595] print_info: n_embd_k_gqa     = [0, 0, 512, 0, 0, 512, 0, 0, 512, 0, 512, 0, 512, 0, 512, 0]
Dec 30 15:22:11 xyz llama-server[1368]: [56595] print_info: n_embd_v_gqa     = [0, 0, 512, 0, 0, 512, 0, 0, 512, 0, 512, 0, 512, 0, 512, 0]
Dec 30 15:22:11 xyz llama-server[1368]: [56595] print_info: f_norm_eps       = 0.0e+00
Dec 30 15:22:11 xyz llama-server[1368]: [56595] print_info: f_norm_rms_eps   = 1.0e-05
Dec 30 15:22:11 xyz llama-server[1368]: [56595] print_info: f_clamp_kqv      = 0.0e+00
Dec 30 15:22:11 xyz llama-server[1368]: [56595] print_info: f_max_alibi_bias = 0.0e+00
Dec 30 15:22:11 xyz llama-server[1368]: [56595] print_info: f_logit_scale    = 0.0e+00
Dec 30 15:22:11 xyz llama-server[1368]: [56595] print_info: f_attn_scale     = 0.0e+00
Dec 30 15:22:11 xyz llama-server[1368]: [56595] print_info: n_ff             = 8192
Dec 30 15:22:11 xyz llama-server[1368]: [56595] print_info: n_expert         = 0
Dec 30 15:22:11 xyz llama-server[1368]: [56595] print_info: n_expert_used    = 0
Dec 30 15:22:11 xyz llama-server[1368]: [56595] print_info: n_expert_groups  = 0
Dec 30 15:22:11 xyz llama-server[1368]: [56595] print_info: n_group_used     = 0
Dec 30 15:22:11 xyz llama-server[1368]: [56595] print_info: causal attn      = 1
Dec 30 15:22:11 xyz llama-server[1368]: [56595] print_info: pooling type     = 0
Dec 30 15:22:11 xyz llama-server[1368]: [56595] print_info: rope type        = 2
Dec 30 15:22:11 xyz llama-server[1368]: [56595] print_info: rope scaling     = linear
Dec 30 15:22:11 xyz llama-server[1368]: [56595] print_info: freq_base_train  = 1000000.0
Dec 30 15:22:11 xyz llama-server[1368]: [56595] print_info: freq_scale_train = 1
Dec 30 15:22:11 xyz llama-server[1368]: [56595] print_info: n_ctx_orig_yarn  = 128000
Dec 30 15:22:11 xyz llama-server[1368]: [56595] print_info: rope_yarn_log_mul= 0.0000
Dec 30 15:22:11 xyz llama-server[1368]: [56595] print_info: rope_finetuned   = unknown
Dec 30 15:22:11 xyz llama-server[1368]: [56595] print_info: model type       = 1.2B
Dec 30 15:22:11 xyz llama-server[1368]: [56595] print_info: model params     = 1.17 B
Dec 30 15:22:11 xyz llama-server[1368]: [56595] print_info: general.name     = LFM2 Audio 1.5B
Dec 30 15:22:11 xyz llama-server[1368]: [56595] print_info: vocab type       = BPE
Dec 30 15:22:11 xyz llama-server[1368]: [56595] print_info: n_vocab          = 65536
Dec 30 15:22:11 xyz llama-server[1368]: [56595] print_info: n_merges         = 63683
Dec 30 15:22:11 xyz llama-server[1368]: [56595] print_info: BOS token        = 1 '<|startoftext|>'
Dec 30 15:22:11 xyz llama-server[1368]: [56595] print_info: EOS token        = 7 '<|im_end|>'
Dec 30 15:22:11 xyz llama-server[1368]: [56595] print_info: EOT token        = 2 '<|endoftext|>'
Dec 30 15:22:11 xyz llama-server[1368]: [56595] print_info: PAD token        = 0 '<|pad|>'
Dec 30 15:22:11 xyz llama-server[1368]: [56595] print_info: LF token         = 708 'Ċ'
Dec 30 15:22:11 xyz llama-server[1368]: [56595] print_info: EOG token        = 2 '<|endoftext|>'
Dec 30 15:22:11 xyz llama-server[1368]: [56595] print_info: EOG token        = 7 '<|im_end|>'
Dec 30 15:22:11 xyz llama-server[1368]: [56595] print_info: max token length = 30
Dec 30 15:22:11 xyz llama-server[1368]: [56595] load_tensors: loading model tensors, this can take a while... (mmap = false)
Dec 30 15:22:11 xyz llama-server[1368]: [56595] load_tensors: offloading output layer to GPU
Dec 30 15:22:11 xyz llama-server[1368]: [56595] load_tensors: offloading 15 repeating layers to GPU
Dec 30 15:22:11 xyz llama-server[1368]: [56595] load_tensors: offloaded 17/17 layers to GPU
Dec 30 15:22:11 xyz llama-server[1368]: [56595] load_tensors:        CUDA0 model buffer size =  2232.50 MiB
Dec 30 15:22:11 xyz llama-server[1368]: [56595] load_tensors:    CUDA_Host model buffer size =   256.00 MiB
Dec 30 15:22:37 xyz llama-server[1368]: [56595] ..................................................................
Dec 30 15:22:37 xyz llama-server[1368]: [56595] common_init_result: added <|endoftext|> logit bias = -inf
Dec 30 15:22:37 xyz llama-server[1368]: [56595] common_init_result: added <|im_end|> logit bias = -inf
Dec 30 15:22:37 xyz llama-server[1368]: [56595] llama_context: constructing llama_context
Dec 30 15:22:37 xyz llama-server[1368]: [56595] llama_context: n_seq_max     = 2
Dec 30 15:22:37 xyz llama-server[1368]: [56595] llama_context: n_ctx         = 128000
Dec 30 15:22:37 xyz llama-server[1368]: [56595] llama_context: n_ctx_seq     = 64000
Dec 30 15:22:37 xyz llama-server[1368]: [56595] llama_context: n_batch       = 2048
Dec 30 15:22:37 xyz llama-server[1368]: [56595] llama_context: n_ubatch      = 512
Dec 30 15:22:37 xyz llama-server[1368]: [56595] llama_context: causal_attn   = 1
Dec 30 15:22:37 xyz llama-server[1368]: [56595] llama_context: flash_attn    = enabled
Dec 30 15:22:37 xyz llama-server[1368]: [56595] llama_context: kv_unified    = false
Dec 30 15:22:37 xyz llama-server[1368]: [56595] llama_context: freq_base     = 1000000.0
Dec 30 15:22:37 xyz llama-server[1368]: [56595] llama_context: freq_scale    = 1
Dec 30 15:22:37 xyz llama-server[1368]: [56595] llama_context: n_ctx_seq (64000) < n_ctx_train (128000) -- the full capacity of the model will not be utilized
Dec 30 15:22:37 xyz llama-server[1368]: [56595] llama_context:  CUDA_Host  output buffer size =     0.50 MiB
Dec 30 15:22:37 xyz llama-server[1368]: [56595] llama_kv_cache:      CUDA0 KV buffer size =  3000.00 MiB
Dec 30 15:22:37 xyz llama-server[1368]: [56595] llama_kv_cache: size = 3000.00 MiB (128000 cells,   6 layers,  2/2 seqs), K (f16): 1500.00 MiB, V (f16): 1500.00 MiB
Dec 30 15:22:37 xyz llama-server[1368]: [56595] llama_memory_recurrent:      CUDA0 RS buffer size =     0.31 MiB
Dec 30 15:22:37 xyz llama-server[1368]: [56595] llama_memory_recurrent: size =    0.31 MiB (     2 cells,  16 layers,  2 seqs), R (f32):    0.31 MiB, S (f32):    0.00 MiB
Dec 30 15:22:37 xyz llama-server[1368]: [56595] llama_context:      CUDA0 compute buffer size =   391.01 MiB
Dec 30 15:22:37 xyz llama-server[1368]: [56595] llama_context:  CUDA_Host compute buffer size =   254.01 MiB
Dec 30 15:22:37 xyz llama-server[1368]: [56595] llama_context: graph nodes  = 561
Dec 30 15:22:37 xyz llama-server[1368]: [56595] llama_context: graph splits = 2
Dec 30 15:22:37 xyz llama-server[1368]: [56595] common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
Dec 30 15:22:38 xyz llama-server[1368]: [56595] common_chat_params_init_lfm2: Using content relying on the template
Dec 30 15:22:38 xyz llama-server[1368]: [56595] clip_model_loader: model name:   LFM2 Audio 1.5B
Dec 30 15:22:38 xyz llama-server[1368]: [56595] clip_model_loader: description:
Dec 30 15:22:38 xyz llama-server[1368]: [56595] clip_model_loader: GGUF version: 3
Dec 30 15:22:38 xyz llama-server[1368]: [56595] clip_model_loader: alignment:    32
Dec 30 15:22:38 xyz llama-server[1368]: [56595] clip_model_loader: n_tensors:    650
Dec 30 15:22:38 xyz llama-server[1368]: [56595] clip_model_loader: n_kv:         26
Dec 30 15:22:38 xyz llama-server[1368]: [56595]
Dec 30 15:22:38 xyz llama-server[1368]: [56595] clip_model_loader: has audio encoder
Dec 30 15:22:38 xyz llama-server[1368]: [56595] clip_ctx: CLIP using CUDA0 backend
Dec 30 15:22:38 xyz llama-server[1368]: [56595] load_hparams: projector:          lfm2a
Dec 30 15:22:38 xyz llama-server[1368]: [56595] load_hparams: n_embd:             512
Dec 30 15:22:38 xyz llama-server[1368]: [56595] load_hparams: n_head:             8
Dec 30 15:22:38 xyz llama-server[1368]: [56595] load_hparams: n_ff:               512
Dec 30 15:22:38 xyz llama-server[1368]: [56595] load_hparams: n_layer:            17
Dec 30 15:22:38 xyz llama-server[1368]: [56595] load_hparams: ffn_op:             gelu_quick
Dec 30 15:22:38 xyz llama-server[1368]: [56595] load_hparams: projection_dim:     2048
Dec 30 15:22:38 xyz llama-server[1368]: [56595]
Dec 30 15:22:38 xyz llama-server[1368]: [56595] --- audio hparams ---
Dec 30 15:22:38 xyz llama-server[1368]: [56595] load_hparams: n_mel_bins:         128
Dec 30 15:22:38 xyz llama-server[1368]: [56595] load_hparams: proj_stack_factor:  0
Dec 30 15:22:38 xyz llama-server[1368]: [56595] load_hparams: audio_chunk_len:    1
Dec 30 15:22:38 xyz llama-server[1368]: [56595] load_hparams: audio_sample_rate:  16000
Dec 30 15:22:38 xyz llama-server[1368]: [56595] load_hparams: audio_n_fft:        512
Dec 30 15:22:38 xyz llama-server[1368]: [56595] load_hparams: audio_window_len:   400
Dec 30 15:22:38 xyz llama-server[1368]: [56595] load_hparams: audio_hop_len:      160
Dec 30 15:22:38 xyz llama-server[1368]: [56595]
Dec 30 15:22:38 xyz llama-server[1368]: [56595] load_hparams: model size:         437.52 MiB
Dec 30 15:22:38 xyz llama-server[1368]: [56595] load_hparams: metadata size:      0.23 MiB
Dec 30 15:22:43 xyz llama-server[1368]: [56595] warmup: warmup with audio size = 3000
Dec 30 15:22:43 xyz llama-server[1368]: [56595] alloc_compute_meta:      CUDA0 compute buffer size =   195.19 MiB
Dec 30 15:22:43 xyz llama-server[1368]: [56595] alloc_compute_meta:        CPU compute buffer size =     2.93 MiB
Dec 30 15:22:43 xyz llama-server[1368]: [56595] alloc_compute_meta: graph splits = 35, nodes = 1547
Dec 30 15:22:43 xyz llama-server[1368]: [56595] warmup: flash attention is enabled
Dec 30 15:22:43 xyz llama-server[1368]: [56595] warmup: *****************************************************************
Dec 30 15:22:43 xyz llama-server[1368]: [56595] warmup: WARNING: the CLIP graph uses unsupported operators by the backend
Dec 30 15:22:43 xyz llama-server[1368]: [56595] warmup:          the performance will be suboptimal
Dec 30 15:22:43 xyz llama-server[1368]: [56595] warmup:          list of unsupported ops (backend=CUDA0):
Dec 30 15:22:43 xyz llama-server[1368]: [56595] warmup:            UNARY: type = f32, ne = [512 375 1 1]
Dec 30 15:22:43 xyz llama-server[1368]: [56595] warmup:            UNARY: type = f32, ne = [512 375 1 1]
Dec 30 15:22:43 xyz llama-server[1368]: [56595] warmup:            UNARY: type = f32, ne = [512 375 1 1]
Dec 30 15:22:43 xyz llama-server[1368]: [56595] warmup:            UNARY: type = f32, ne = [512 375 1 1]
Dec 30 15:22:43 xyz llama-server[1368]: [56595] warmup:            UNARY: type = f32, ne = [512 375 1 1]
Dec 30 15:22:43 xyz llama-server[1368]: [56595] warmup:            UNARY: type = f32, ne = [512 375 1 1]
Dec 30 15:22:43 xyz llama-server[1368]: [56595] warmup:            UNARY: type = f32, ne = [512 375 1 1]
Dec 30 15:22:43 xyz llama-server[1368]: [56595] warmup:            UNARY: type = f32, ne = [512 375 1 1]
Dec 30 15:22:43 xyz llama-server[1368]: [56595] warmup:            UNARY: type = f32, ne = [512 375 1 1]
Dec 30 15:22:43 xyz llama-server[1368]: [56595] warmup:            UNARY: type = f32, ne = [512 375 1 1]
Dec 30 15:22:43 xyz llama-server[1368]: [56595] warmup:            UNARY: type = f32, ne = [512 375 1 1]
Dec 30 15:22:43 xyz llama-server[1368]: [56595] warmup:            UNARY: type = f32, ne = [512 375 1 1]
Dec 30 15:22:43 xyz llama-server[1368]: [56595] warmup:            UNARY: type = f32, ne = [512 375 1 1]
Dec 30 15:22:43 xyz llama-server[1368]: [56595] warmup:            UNARY: type = f32, ne = [512 375 1 1]
Dec 30 15:22:43 xyz llama-server[1368]: [56595] warmup:            UNARY: type = f32, ne = [512 375 1 1]
Dec 30 15:22:43 xyz llama-server[1368]: [56595] warmup:            UNARY: type = f32, ne = [512 375 1 1]
Dec 30 15:22:43 xyz llama-server[1368]: [56595] warmup:            UNARY: type = f32, ne = [512 375 1 1]
Dec 30 15:22:43 xyz llama-server[1368]: [56595] warmup: flash attention is enabled
Dec 30 15:22:43 xyz llama-server[1368]: [56595] warmup: please report this on github as an issue
Dec 30 15:22:43 xyz llama-server[1368]: [56595] warmup: ref: https://github.com/ggml-org/llama.cpp/pull/16837#issuecomment-3461676118
Dec 30 15:22:43 xyz llama-server[1368]: [56595] warmup: *****************************************************************
Dec 30 15:22:43 xyz llama-server[1368]: [56595] init_audio: audio input is in experimental stage and may have reduced quality:
Dec 30 15:22:43 xyz llama-server[1368]: [56595]     https://github.com/ggml-org/llama.cpp/discussions/13759
Dec 30 15:22:43 xyz llama-server[1368]: [56595] srv    load_model: loaded multimodal model, '/opt/usbhd/models/LFM2-Audio-1.5B-GGUF_LiquidAI/mmproj-LFM2-Audio-1.5b-BF16.gguf'
Dec 30 15:22:43 xyz llama-server[1368]: [56595] srv    load_model: initializing slots, n_slots = 2
Dec 30 15:22:43 xyz llama-server[1368]: [56595] slot   load_model: id  0 | task -1 | new slot, n_ctx = 64000
Dec 30 15:22:43 xyz llama-server[1368]: [56595] slot   load_model: id  1 | task -1 | new slot, n_ctx = 64000
Dec 30 15:22:43 xyz llama-server[1368]: [56595] srv    load_model: prompt cache is enabled, size limit: 8192 MiB
Dec 30 15:22:43 xyz llama-server[1368]: [56595] srv    load_model: use `--cache-ram 0` to disable the prompt cache
Dec 30 15:22:43 xyz llama-server[1368]: [56595] srv    load_model: for more info see https://github.com/ggml-org/llama.cpp/pull/16391
Dec 30 15:22:43 xyz llama-server[1368]: [56595] common_chat_params_init_lfm2: Using content relying on the template
Dec 30 15:22:43 xyz llama-server[1368]: [56595] common_chat_params_init_lfm2: Using content relying on the template
Dec 30 15:22:43 xyz llama-server[1368]: [56595] srv    load_model: thinking = 0
Dec 30 15:22:43 xyz llama-server[1368]: [56595] common_chat_params_init_lfm2: Using content relying on the template
Dec 30 15:22:43 xyz llama-server[1368]: [56595] load_model: chat template, chat_template: {{- bos_token -}}{%- set system_prompt = "" -%}{%- set ns = namespace(system_prompt="") -%}{%- if messages[0]["role"] == "system" -%} {%- set ns.system_prompt = messages[0]["content"] -%} {%- set messages = messages[1:] -%}{%- endif -%}{%- if tools -%} {%- set ns.system_prompt = ns.system_prompt + ("
Dec 30 15:22:43 xyz llama-server[1368]: [56595] " if ns.system_prompt else "") + "List of tools: <|tool_list_start|>[" -%} {%- for tool in tools -%} {%- if tool is not string -%} {%- set tool = tool | tojson -%} {%- endif -%} {%- set ns.system_prompt = ns.system_prompt + tool -%} {%- if not loop.last -%} {%- set ns.system_prompt = ns.system_prompt + ", " -%} {%- endif -%} {%- endfor -%} {%- set ns.system_prompt = ns.system_prompt + "]<|tool_list_end|>" -%}{%- endif -%}{%- if ns.system_prompt -%} {{- "<|im_start|>system
Dec 30 15:22:43 xyz llama-server[1368]: [56595] " + ns.system_prompt + "<|im_end|>
Dec 30 15:22:43 xyz llama-server[1368]: [56595] " -}}{%- endif -%}{%- for message in messages -%} {{- "<|im_start|>" + message["role"] + "
Dec 30 15:22:43 xyz llama-server[1368]: [56595] " -}} {%- set content = message["content"] -%} {%- if content is not string -%} {%- set content = content | tojson -%} {%- endif -%} {%- if message["role"] == "tool" -%} {%- set content = "<|tool_response_start|>" + content + "<|tool_response_end|>" -%} {%- endif -%} {{- content + "<|im_end|>
Dec 30 15:22:43 xyz llama-server[1368]: [56595] " -}}{%- endfor -%}{%- if add_generation_prompt -%} {{- "<|im_start|>assistant
Dec 30 15:22:43 xyz llama-server[1368]: [56595] " -}}{%- endif -%}, example_format: '<|im_start|>system
Dec 30 15:22:43 xyz llama-server[1368]: [56595] You are a helpful assistant<|im_end|>
Dec 30 15:22:43 xyz llama-server[1368]: [56595] <|im_start|>user
Dec 30 15:22:43 xyz llama-server[1368]: [56595] Hello<|im_end|>
Dec 30 15:22:43 xyz llama-server[1368]: [56595] <|im_start|>assistant
Dec 30 15:22:43 xyz llama-server[1368]: [56595] Hi there<|im_end|>
Dec 30 15:22:43 xyz llama-server[1368]: [56595] <|im_start|>user
Dec 30 15:22:43 xyz llama-server[1368]: [56595] How are you?<|im_end|>
Dec 30 15:22:43 xyz llama-server[1368]: [56595] <|im_start|>assistant
Dec 30 15:22:43 xyz llama-server[1368]: [56595] '
Dec 30 15:22:43 xyz llama-server[1368]: [56595] main: model loaded
Dec 30 15:22:43 xyz llama-server[1368]: [56595] main: server is listening on http://127.0.0.1:56595
Dec 30 15:22:43 xyz llama-server[1368]: [56595] main: starting the main loop...
Dec 30 15:22:43 xyz llama-server[1368]: [56595] cmd_child_to_router:ready
Dec 30 15:22:43 xyz llama-server[1368]: [56595] srv  update_slots: all slots are idle
Dec 30 15:22:43 xyz llama-server[1368]: [56595] srv    operator(): child server monitoring thread started, waiting for EOF on stdin...

Dec 30 15:27:04 xyz llama-server[1368]: srv  proxy_reques: proxying request to model asr on port 56595
Dec 30 15:27:04 xyz llama-server[1368]: [56595] common_chat_params_init_lfm2: Using content relying on the template
Dec 30 15:27:04 xyz llama-server[1368]: [56595] srv  params_from_: Chat format: Content-only
Dec 30 15:27:04 xyz llama-server[1368]: [56595] slot get_availabl: id  1 | task -1 | selected slot by LRU, t_last = -1
Dec 30 15:27:04 xyz llama-server[1368]: [56595] slot launch_slot_: id  1 | task -1 | sampler chain: logits -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
Dec 30 15:27:04 xyz llama-server[1368]: [56595] slot launch_slot_: id  1 | task 0 | processing task
Dec 30 15:27:04 xyz llama-server[1368]: [56595] slot update_slots: id  1 | task 0 | new prompt, n_ctx_slot = 64000, n_keep = 0, task.n_tokens = 155
Dec 30 15:27:04 xyz llama-server[1368]: [56595] slot update_slots: id  1 | task 0 | n_tokens = 0, memory_seq_rm [0, end)
Dec 30 15:27:04 xyz llama-server[1368]: [56595] slot update_slots: id  1 | task 0 | prompt processing progress, n_tokens = 14, batch.n_tokens = 14, progress = 0.090323
Dec 30 15:27:04 xyz llama-server[1368]: [56595] slot update_slots: id  1 | task 0 | n_tokens = 14, memory_seq_rm [14, end)
Dec 30 15:27:04 xyz llama-server[1368]: [56595] srv  process_chun: processing audio...
Dec 30 15:27:04 xyz llama-server[1368]: [56595] encoding audio slice...
Dec 30 15:27:05 xyz llama-server[1368]: [56595] audio slice encoded in 732 ms
Dec 30 15:27:05 xyz llama-server[1368]: [56595] decoding audio batch 1/1, n_tokens_batch = 136
Dec 30 15:27:05 xyz llama-server[1368]: [56595] audio decoded (batch 1/1) in 23 ms
Dec 30 15:27:05 xyz llama-server[1368]: [56595] srv  process_chun: audio processed in 755 ms
Dec 30 15:27:05 xyz llama-server[1368]: [56595] slot update_slots: id  1 | task 0 | prompt processing progress, n_tokens = 155, batch.n_tokens = 5, progress = 1.000000
Dec 30 15:27:05 xyz llama-server[1368]: [56595] slot update_slots: id  1 | task 0 | prompt done, n_tokens = 155, batch.n_tokens = 5
Dec 30 15:27:05 xyz llama-server[1368]: [56595] slot update_slots: id  1 | task 0 | created context checkpoint 1 of 8 (pos_min = 149, pos_max = 149, size = 0.156 MiB)
Dec 30 15:27:06 xyz llama-server[1368]: [56595] slot print_timing: id  1 | task 0 |
Dec 30 15:27:06 xyz llama-server[1368]: [56595] prompt eval time =     920.69 ms /   155 tokens (    5.94 ms per token,   168.35 tokens per second)
Dec 30 15:27:06 xyz llama-server[1368]: [56595]        eval time =     665.46 ms /    42 tokens (   15.84 ms per token,    63.11 tokens per second)
Dec 30 15:27:06 xyz llama-server[1368]: [56595]       total time =    1586.15 ms /   197 tokens
Dec 30 15:27:06 xyz llama-server[1368]: [56595] slot      release: id  1 | task 0 | stop processing: n_tokens = 196, truncated = 0
Dec 30 15:27:06 xyz llama-server[1368]: [56595] srv  update_slots: all slots are idle
Dec 30 15:27:06 xyz llama-server[1368]: [56595] srv  log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
Dec 30 15:27:06 xyz llama-server[1368]: srv  log_server_r: request: POST /v1/chat/completions 127.0.0.1 200

Contributor Author

tdakhran commented Dec 30, 2025

thanks for testing it @elfarolab , to reduce RAM usage specify --no-mmap.

elfarolab commented Dec 30, 2025

thanks for testing it @elfarolab , to reduce RAM usage specify --no-mmap.

it is already set.. still checking. Thanks!

Contributor Author

tdakhran commented Dec 30, 2025

another possibility is context length, maybe setting -c 4096 explicitly can help

Contributor Author

tdakhran commented Dec 30, 2025

To reduce RAM usage, GGUFs can be quantized using llama-quantize from BF16 to Q4_0 or Q8_0.

elfarolab commented Dec 30, 2025

another possibility is context length, maybe setting -c 4096 explicitly can help

OK now I've got ~4.2GB, better, with GGUF not yet quantized.

my model.ini:

[asr]
load-on-startup = true
m = /opt/usbhd/models/LFM2-Audio-1.5B-GGUF_LiquidAI/LFM2-Audio-1.5B-BF16.gguf
mm = /opt/usbhd/models/LFM2-Audio-1.5B-GGUF_LiquidAI/mmproj-LFM2-Audio-1.5b-BF16.gguf
c = 4096
;threads = 2
temp = 0
;min-p = 0.15
;presence-penalty = 1.05
fit = off
ngl = -1
fa = on
;mlock = on
mmap = off
b = 2048
ub = 2048

What about these:

temp = 0
;min-p = 0.15
;presence-penalty = 1.05

Contributor Author

tdakhran commented Dec 30, 2025

text generation for ASR for LFM2-Audio-1.5B is greedy, params above look good.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

examples ggml Nvidia GPU python testing