Skip to content

model : add ASR support for LFM2-Audio-1.5B#17694

Closed
tdakhran wants to merge 5 commits intoggml-org:masterfrom
Liquid4All:tarek/feat/lfm2-asr-upstream
Closed

model : add ASR support for LFM2-Audio-1.5B#17694
tdakhran wants to merge 5 commits intoggml-org:masterfrom
Liquid4All:tarek/feat/lfm2-asr-upstream

Conversation

@tdakhran
Copy link
Contributor

@tdakhran tdakhran commented Dec 2, 2025

LFM2-Audio-1.5B supports audio input and audio output.

PR adds only ASR support. To perform ASR invoke CLI with

bin/llama-mtmd-cli -m LFM2-Audio-1.5B-F32.gguf --mmproj mmproj-LFM2-Audio-1.5b-F32.gguf -n 30 --audio input.wav -sys "Perform ASR." -p "<__media__>"

Changes to existing code:

  • model requires system prompt, -sys enabled for llama-mtmd-cli
  • mel bins generation reworked, now it is generated dynamically and supports different n_fft values
  • OP_SSM_CONV for CUDA backend is extended to support kernel size 9

cc: @ngxson

@tdakhran
Copy link
Contributor Author

tdakhran commented Dec 2, 2025

tested that llama-server works as intended with input

[
        {"role": "system", "content": "Perform ASR."},
        {
            "role": "user",
            "content": [
                {
                    "type": "input_audio",
                    "input_audio": {
                        "format": "wav",
                        "data": base64.b64encode(pathlib.Path("/data/playground/issue_400/10.wav").read_bytes()).decode(
                            "utf-8"
                        ),
                    },
                },
            ],
        },
    ]

@tdakhran tdakhran changed the title model : add LFM2-Audio-1.5B support model : add ASR support for LFM2-Audio-1.5B Dec 2, 2025
@github-actions github-actions bot added testing Everything test related Nvidia GPU Issues specific to Nvidia GPUs examples python python script changes ggml changes relating to the ggml tensor library for machine learning labels Dec 2, 2025
@tdakhran
Copy link
Contributor Author

The code is tested, will wait for #17978 to be merged, and then rebase and mark it as "ready for review".

@tdakhran
Copy link
Contributor Author

The code is ready for review and is tested with mtmd-cli and llama-server.

python convert_hf_to_gguf.py  /data/playground/checkpoints/LFM2-Audio-1.5B --outtype f32
python convert_hf_to_gguf.py  /data/playground/checkpoints/LFM2-Audio-1.5B --outtype f32 --mmproj

build/bin/llama-mtmd-cli -m /data/playground/checkpoints/LFM2-Audio-1.5B/LFM2-Audio-1.5B-F32.gguf --mmproj /data/playground/checkpoints/LFM2-Audio-1.5B/mmproj-LFM2-Audio-1.5b-F32.gguf -n 30 --audio /data/playground/issue_400/10.wav -sys "Perform ASR." -p "<__media__>" -v

produces valid results for the attached file
10.wav

encoding audio slice...
audio slice encoded in 39 ms
decoding audio batch 1/1, n_tokens_batch = 33
audio decoded (batch 1/1) in 109 ms

I need more air. Can you increase the fan speed?

Comment on lines +114 to +126
Kcur = ggml_cont(ctx0, ggml_permute(ctx0, Kcur, 0, 2, 1, 3));
Q_bias_u = ggml_cont(ctx0, ggml_permute(ctx0, Q_bias_u, 0, 2, 1, 3));
ggml_tensor * matrix_ac = ggml_mul_mat(ctx0, Q_bias_u, Kcur);
matrix_ac = ggml_cont(ctx0, ggml_permute(ctx0, matrix_ac, 1, 0, 2, 3));
cb(matrix_ac, "conformer.layers.{}.self_attn.id3", il);

auto * p = ggml_mul_mat(ctx0, layer.linear_pos_w, pos_emb);
cb(p, "conformer.layers.{}.self_attn.linear_pos", il);
p = ggml_reshape_3d(ctx0, p, d_head, n_head, p->ne[1]);

Q_bias_v = ggml_cont(ctx0, ggml_permute(ctx0, Q_bias_v, 0, 2, 1, 3));
cb(Q_bias_v, "conformer.layers.{}.self_attn.id0", il);
p = ggml_cont(ctx0, ggml_permute(ctx0, p, 1, 2, 0, 3));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you think we could replace this with build_attn?

the advantage of build_attn is that it supports flash attn which can significantly improve the performance, but I'm not sure if there is currently anything missing to make it work in this case

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw some extra stuff like biases, matrix_ac, matrix_bd, it scared me followed Python implementation as is, will give it a second look

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looked into it, build_attn won't fit, too many customizations to attention.

Comment on lines +141 to +148
matrix_bd = ggml_reshape_3d(ctx0, matrix_bd, q_len, pos_len + 1, h);
matrix_bd = ggml_cont(ctx0, ggml_view_3d(ctx0, matrix_bd,
q_len, pos_len, h,
matrix_bd->nb[1], matrix_bd->nb[2], matrix_bd->nb[0] * q_len));
matrix_bd = ggml_reshape_3d(ctx0, matrix_bd, pos_len, q_len, h);
}

matrix_bd = ggml_cont(ctx0, ggml_view_3d(ctx0, matrix_bd,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a bit strange that we're having these 4 reshapes / view without any permutations. can we collapse this into one single ggml_reshape_3d?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it were a plain view, reshapes could be simplified. There is a crop happening inside ggml_view_3d.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm yeah interesting. not very important to optimize this, so I'll have a look later to see if there is another way

Comment on lines +209 to +211
x = ggml_cont(ctx0, ggml_transpose(ctx0, x));
x = ggml_add(ctx0, ggml_mul(ctx0, x, layer.conv_norm_w), layer.conv_norm_b);
x = ggml_cont(ctx0, ggml_transpose(ctx0, x));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we may be able to remove of transposes if conv_norm_b is already transpose upon conversion?

Comment on lines +217 to +221
x = ggml_cont(ctx0, ggml_transpose(ctx0, x));
auto * conv_pw2_w = ggml_reshape_2d(ctx0, layer.conv_pw2_w, layer.conv_pw2_w->ne[1], layer.conv_pw2_w->ne[2]);
x = ggml_mul_mat(ctx0, conv_pw2_w, x);
x = ggml_add(ctx0, x, layer.conv_pw2_b);
x = ggml_cont(ctx0, ggml_transpose(ctx0, x));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I'll have a look into this), I suspect that these 2 transposes can be removed too (or at worse, one can be a view)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many transposes here are following the Python code without optimization in mind. The objective was to get numerically close intermediates. I'll have a closer look to understand what can be optimized.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed most of transposes

Comment on lines +251 to +258
cur = ggml_mul_mat(ctx0, model.mm_1_w, cur);
cur = ggml_add(ctx0, cur, model.mm_1_b);
cb(cur, "audio_adapter.model.{}", 1);
cur = ggml_gelu_erf(ctx0, cur);
cb(cur, "audio_adapter.model.{}", 2);
cur = ggml_mul_mat(ctx0, model.mm_3_w, cur);
cur = ggml_add(ctx0, cur, model.mm_3_b);
cb(cur, "audio_adapter.model.{}", 3);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can be replaced with build_ffn

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

didn't recognize it, will replace

@tdakhran
Copy link
Contributor Author

@ngxson , I addressed most of the feedback, added a comment explaining why build_attn cannot be used, removed unnecessary transposes, and simplified permutes. Applied the formatting as well.

PR requires #18061, otherwise rope_theta won't be set.

@tdakhran tdakhran force-pushed the tarek/feat/lfm2-asr-upstream branch from 8ba4562 to ba9e597 Compare December 15, 2025 21:14
@tdakhran
Copy link
Contributor Author

Rebased to incorporate #18061, now works as is.

@ngxson
Copy link
Contributor

ngxson commented Dec 15, 2025

Thanks @tdakhran ! I'll do a final review tmr and will push commits directly here if needed.

For now, my priority will be to make sure that the GGUF is ready for any possible optimizations in the future. We can then look deeper into these optimizations in a follow-up PR (so users won't have to re-generate the GGUF)

@ngxson

This comment was marked as outdated.

@ngxson
Copy link
Contributor

ngxson commented Dec 16, 2025

nevermind, I can do a follow-up PR

@ngxson
Copy link
Contributor

ngxson commented Dec 16, 2025

hein? I have no idea why github doesn't allow me to merge it 😂

I will copy the commit to another PR then

image

@ngxson
Copy link
Contributor

ngxson commented Dec 16, 2025

Superseded by #18106

@tdakhran
Copy link
Contributor Author

@ngxson , my bad, I think I forgot to click "allow edits" when created PR

@elfarolab
Copy link

elfarolab commented Dec 29, 2025

@tdakhran

Hello Tarek,
I appreciate your work and sharing.

Please, could you share the full command line to run llama-server to do the ASR with LFM2-Audio?

I downloaded the following models:
audiodecoder-LFM2-Audio-1.5B-F16.gguf
LFM2-Audio-1.5B-F16.gguf
mmproj-audioencoder-LFM2-Audio-1.5B-F16.gguf

the correct parameters and the models to run are not clear to me.
It would be so much appreciated if you could share the full command line of llama-server.
I am using a custom client in C, I followed your JSON schema above.
I am sending speech WAV, 1CH mono, 16 bits, 22KHz, base64 encoded. Max 30 secs duration.

Thank you so much.

@tdakhran tdakhran deleted the tarek/feat/lfm2-asr-upstream branch December 30, 2025 11:19
@tdakhran
Copy link
Contributor Author

Hi @elfarolab, and thank you.

GGUFs are not yet updated in https://huggingface.co/LiquidAI/LFM2-Audio-1.5B-GGUF, they are used for LEAPl and will be updated together with LEAP.

Use the latest master commit.
Convert the checkpoint https://huggingface.co/LiquidAI/LFM2-Audio-1.5B manually into GGUFs

export CKPT=/data/playground/checkpoints/LFM2-Audio-1.5B
python convert_hf_to_gguf.py $CKPT
python convert_hf_to_gguf.py $CKPT --mmproj

This will create 2 GGUFs required for ASR

❯ (cd $CKPT && ls *.gguf)
LFM2-Audio-1.5B-BF16.gguf  mmproj-LFM2-Audio-1.5b-BF16.gguf

Now launch llama-server with the command

bin/llama-server -m $CKPT/LFM2-Audio-1.5B-BF16.gguf --mmproj $CKPT/mmproj-LFM2-Audio-1.5b-BF16.gguf

From another terminal, post a request for ASR.

import base64
import pathlib
from openai import OpenAI

wav_file = "/data/playground/issue_400/10.wav"

messages = [
    {"role": "system", "content": "Perform ASR."},
    {
        "role": "user",
        "content": [
            {
                "type": "input_audio",
                "input_audio": {
                    "format": "wav",
                    "data": base64.b64encode(
                        pathlib.Path(wav_file).read_bytes()
                    ).decode("utf-8"),
                },
            },
        ],
    },
]

host = "localhost"
port = 8080
client = OpenAI(base_url=f"http://{host}:{port}/v1", api_key="not-needed")
resp = client.chat.completions.create(
    model="",
    messages=messages,
)
content = resp.choices[0].message.content
print(content)

Please let me know if this works.

@elfarolab
Copy link

@tdakhran

Wow wonderful detailed how-to!

It would be great having a tutorial for ASR and TTS with llama-server into teh tutorial space at:
ggml-org : tutorials

I'll follow your tutorial, in a couple of hours I'll report back here.
Thank you so much for all valuable help!

@elfarolab
Copy link

elfarolab commented Dec 30, 2025

@tdakhran

Hello,

I followed the procedure and run the python script above.


The audio file

I used a .WAV 1CH, mono, 16 bits, 24KHz (I know 24KHz looks weird but it is because it will be used later with libopus).
it is composed by a male and female voice, alternating at every sentence, male voice start first. M..F..M..F
Sorry github doesn't allow attaching the .wav audio file.

mediainfo speech_orig_24000Hz.wav

General
Complete name                            : speech_orig_24000Hz.wav
Format                                   : Wave
File size                                : 506 KiB
Duration                                 : 10 s 800 ms
Overall bit rate mode                    : Constant
Overall bit rate                         : 384 kb/s
Writing application                      : Lavf62.3.100

Audio
Format                                   : PCM
Format settings                          : Little / Signed
Codec ID                                 : 1
Duration                                 : 10 s 800 ms
Bit rate mode                            : Constant
Bit rate                                 : 384 kb/s
Channel(s)                               : 1 channel
Sampling rate                            : 24.0 kHz
Bit depth                                : 16 bits
Stream size                              : 506 KiB (100%)

Results

python test_asr_direct_wav_Tarek.py
The birch canoe slid on the smooth planks. "Glue the sheet to the dark blue background", it said. "It's easy to tell the depth of a well. Four hours of steady work faced us."

It is almost perfect, it added the extra:

<.. ,it said. ..> 

between the female and male voices.


llama-server (router mode) debug output:

...
Dec 30 14:49:46 xyz llama-server[1717]: [47365] srv  log_server_r: request: GET /models 127.0.0.1 200
Dec 30 14:50:30 xyz llama-server[1717]: srv  proxy_reques: proxying request to model asr on port 55121
Dec 30 14:50:30 xyz llama-server[1717]: [55121] common_chat_params_init_lfm2: Using content relying on the template
Dec 30 14:50:30 xyz llama-server[1717]: [55121] srv  params_from_: Chat format: Content-only
Dec 30 14:50:30 xyz llama-server[1717]: [55121] slot get_availabl: id  1 | task -1 | selected slot by LRU, t_last = -1
Dec 30 14:50:30 xyz llama-server[1717]: [55121] slot launch_slot_: id  1 | task -1 | sampler chain: logits -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
Dec 30 14:50:30 xyz llama-server[1717]: [55121] slot launch_slot_: id  1 | task 0 | processing task
Dec 30 14:50:30 xyz llama-server[1717]: [55121] slot update_slots: id  1 | task 0 | new prompt, n_ctx_slot = 64000, n_keep = 0, task.n_tokens = 155
Dec 30 14:50:30 xyz llama-server[1717]: [55121] slot update_slots: id  1 | task 0 | n_tokens = 0, memory_seq_rm [0, end)
Dec 30 14:50:30 xyz llama-server[1717]: [55121] slot update_slots: id  1 | task 0 | prompt processing progress, n_tokens = 14, batch.n_tokens = 14, progress = 0.090323
Dec 30 14:50:30 xyz llama-server[1717]: [55121] slot update_slots: id  1 | task 0 | n_tokens = 14, memory_seq_rm [14, end)
Dec 30 14:50:30 xyz llama-server[1717]: [55121] srv  process_chun: processing audio...
Dec 30 14:50:30 xyz llama-server[1717]: [55121] encoding audio slice...
Dec 30 14:50:31 xyz llama-server[1717]: [55121] audio slice encoded in 856 ms
Dec 30 14:50:31 xyz llama-server[1717]: [55121] decoding audio batch 1/1, n_tokens_batch = 136
Dec 30 14:50:31 xyz llama-server[1717]: [55121] audio decoded (batch 1/1) in 22 ms
Dec 30 14:50:31 xyz llama-server[1717]: [55121] srv  process_chun: audio processed in 878 ms
Dec 30 14:50:31 xyz llama-server[1717]: [55121] slot update_slots: id  1 | task 0 | prompt processing progress, n_tokens = 155, batch.n_tokens = 5, progress = 1.000000
Dec 30 14:50:31 xyz llama-server[1717]: [55121] slot update_slots: id  1 | task 0 | prompt done, n_tokens = 155, batch.n_tokens = 5
Dec 30 14:50:31 xyz llama-server[1717]: [55121] slot update_slots: id  1 | task 0 | created context checkpoint 1 of 8 (pos_min = 149, pos_max = 149, size = 0.156 MiB)
Dec 30 14:50:32 xyz llama-server[1717]: [55121] slot print_timing: id  1 | task 0 |
Dec 30 14:50:32 xyz llama-server[1717]: [55121] prompt eval time =    1040.77 ms /   155 tokens (    6.71 ms per token,   148.93 tokens per second)
Dec 30 14:50:32 xyz llama-server[1717]: [55121]        eval time =     755.80 ms /    47 tokens (   16.08 ms per token,    62.19 tokens per second)
Dec 30 14:50:32 xyz llama-server[1717]: [55121]       total time =    1796.57 ms /   202 tokens
Dec 30 14:50:32 xyz llama-server[1717]: [55121] slot      release: id  1 | task 0 | stop processing: n_tokens = 201, truncated = 0
Dec 30 14:50:32 xyz llama-server[1717]: [55121] srv  update_slots: all slots are idle
Dec 30 14:50:32 xyz llama-server[1717]: [55121] srv  log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
Dec 30 14:50:32 xyz llama-server[1717]: srv  log_server_r: request: POST /v1/chat/completions 127.0.0.1 200

Conclusions

I am looking into the RAM usage, LFM2-Audio-1.5B uses a lot, ~5.5GB, maybe because a bad configuration of model?
Checking..

if you need anything else, running tests or try patches, I will be very happy to help.

Thank you so much to everybody.


Full llama-server log

Dec 30 15:22:07 xyz systemd[1]: Started Llama.cpp Inference Server (GGUF models).
Dec 30 15:22:07 xyz llama-server[1368]: Starting llama-server with 16 arguments:
Dec 30 15:22:07 xyz llama-server[1368]:   '--host'
Dec 30 15:22:07 xyz llama-server[1368]:   '127.0.0.1'
Dec 30 15:22:07 xyz llama-server[1368]:   '--port'
Dec 30 15:22:07 xyz llama-server[1368]:   '8087'
Dec 30 15:22:07 xyz llama-server[1368]:   '--prio'
Dec 30 15:22:07 xyz llama-server[1368]:   '2'
Dec 30 15:22:07 xyz llama-server[1368]:   '--log-colors'
Dec 30 15:22:07 xyz llama-server[1368]:   'on'
Dec 30 15:22:07 xyz llama-server[1368]:   '--models-preset'
Dec 30 15:22:07 xyz llama-server[1368]:   '/opt/usbhd/llama.cpp/etc/models.ini'
Dec 30 15:22:07 xyz llama-server[1368]:   '--threads-http'
Dec 30 15:22:07 xyz llama-server[1368]:   '4'
Dec 30 15:22:07 xyz llama-server[1368]:   '--no-webui'
Dec 30 15:22:07 xyz llama-server[1368]:   '--offline'
Dec 30 15:22:07 xyz llama-server[1368]:   '--parallel'
Dec 30 15:22:07 xyz llama-server[1368]:   '2'
Dec 30 15:22:07 xyz llama-server[1368]: Full command:
Dec 30 15:22:07 xyz llama-server[1368]: /opt/usbhd/llama.cpp/bin/llama-server --host 127.0.0.1 --port 8087 --prio 2 --log-colors on --models-preset /opt/usbhd/llama.cpp/etc/models.ini --threads-http 4 --no-webui --offline --parallel 2
Dec 30 15:22:09 xyz llama-server[1368]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
Dec 30 15:22:09 xyz llama-server[1368]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: yes
Dec 30 15:22:09 xyz llama-server[1368]: ggml_cuda_init: found 1 CUDA devices:
Dec 30 15:22:09 xyz llama-server[1368]:   Device 0: Orin, compute capability 8.7, VMM: yes
Dec 30 15:22:09 xyz llama-server[1368]: build: 17 (06705fd) with GNU 11.4.0 for Linux aarch64
Dec 30 15:22:09 xyz llama-server[1368]: system info: n_threads = 12, n_threads_batch = 12, total_threads = 12
Dec 30 15:22:09 xyz llama-server[1368]: system_info: n_threads = 12 (n_threads_batch = 12) / 12 | CUDA : ARCHS = 870 | FORCE_CUBLAS = 1 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | FA_ALL_QUANTS = 1 | CPU : NEON = 1 | ARM_FMA = 1 | FP16_VA = 1 | DOTPROD = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
Dec 30 15:22:09 xyz llama-server[1368]: init: using 4 threads for HTTP server
Dec 30 15:22:09 xyz llama-server[1368]: Web UI is disabled
Dec 30 15:22:10 xyz llama-server[1368]: srv   load_models: Loaded 0 cached model presets
Dec 30 15:22:10 xyz llama-server[1368]: srv   load_models: Loaded 1 custom model presets from /opt/usbhd/llama.cpp/etc/models.ini
Dec 30 15:22:10 xyz llama-server[1368]: srv   load_models: Available models (1) (*: custom preset)
Dec 30 15:22:10 xyz llama-server[1368]: srv   load_models:   * asr
Dec 30 15:22:10 xyz llama-server[1368]: srv   load_models: (startup) loading model asr
Dec 30 15:22:10 xyz llama-server[1368]: srv          load: spawning server instance with name=asr on port 56595
Dec 30 15:22:10 xyz llama-server[1368]: srv          load: spawning server instance with args:
Dec 30 15:22:10 xyz llama-server[1368]: srv          load:   /opt/usbhd/llama.cpp/bin/llama-server
Dec 30 15:22:10 xyz llama-server[1368]: srv          load:   --host
Dec 30 15:22:10 xyz llama-server[1368]: srv          load:   127.0.0.1
Dec 30 15:22:10 xyz llama-server[1368]: srv          load:   --log-colors
Dec 30 15:22:10 xyz llama-server[1368]: srv          load:   on
Dec 30 15:22:10 xyz llama-server[1368]: srv          load:   --mlock
Dec 30 15:22:10 xyz llama-server[1368]: srv          load:   --no-mmap
Dec 30 15:22:10 xyz llama-server[1368]: srv          load:   --offline
Dec 30 15:22:10 xyz llama-server[1368]: srv          load:   --port
Dec 30 15:22:10 xyz llama-server[1368]: srv          load:   56595
Dec 30 15:22:10 xyz llama-server[1368]: srv          load:   --prio
Dec 30 15:22:10 xyz llama-server[1368]: srv          load:   2
Dec 30 15:22:10 xyz llama-server[1368]: srv          load:   --temp
Dec 30 15:22:10 xyz llama-server[1368]: srv          load:   0
Dec 30 15:22:10 xyz llama-server[1368]: srv          load:   --threads-http
Dec 30 15:22:10 xyz llama-server[1368]: srv          load:   4
Dec 30 15:22:10 xyz llama-server[1368]: srv          load:   --no-webui
Dec 30 15:22:10 xyz llama-server[1368]: srv          load:   --alias
Dec 30 15:22:10 xyz llama-server[1368]: srv          load:   asr
Dec 30 15:22:10 xyz llama-server[1368]: srv          load:   --flash-attn
Dec 30 15:22:10 xyz llama-server[1368]: srv          load:   on
Dec 30 15:22:10 xyz llama-server[1368]: srv          load:   --model
Dec 30 15:22:10 xyz llama-server[1368]: srv          load:   /opt/usbhd/models/LFM2-Audio-1.5B-GGUF_LiquidAI/LFM2-Audio-1.5B-BF16.gguf
Dec 30 15:22:10 xyz llama-server[1368]: srv          load:   --mmproj
Dec 30 15:22:10 xyz llama-server[1368]: srv          load:   /opt/usbhd/models/LFM2-Audio-1.5B-GGUF_LiquidAI/mmproj-LFM2-Audio-1.5b-BF16.gguf
Dec 30 15:22:10 xyz llama-server[1368]: srv          load:   --n-gpu-layers
Dec 30 15:22:10 xyz llama-server[1368]: srv          load:   -1
Dec 30 15:22:10 xyz llama-server[1368]: srv          load:   --parallel
Dec 30 15:22:10 xyz llama-server[1368]: srv          load:   2
Dec 30 15:22:10 xyz llama-server[1368]: srv          load:   --threads
Dec 30 15:22:10 xyz llama-server[1368]: srv          load:   4
Dec 30 15:22:10 xyz llama-server[1368]: main: starting router server, no model will be loaded in this process
Dec 30 15:22:10 xyz llama-server[1368]: start: binding port with default address family
Dec 30 15:22:10 xyz llama-server[1368]: main: router server is listening on http://127.0.0.1:8087
Dec 30 15:22:10 xyz llama-server[1368]: main: NOTE: router mode is experimental
Dec 30 15:22:10 xyz llama-server[1368]: main:       it is not recommended to use this mode in untrusted environments
Dec 30 15:22:10 xyz llama-server[1368]: [56595] ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
Dec 30 15:22:10 xyz llama-server[1368]: [56595] ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: yes
Dec 30 15:22:10 xyz llama-server[1368]: [56595] ggml_cuda_init: found 1 CUDA devices:
Dec 30 15:22:10 xyz llama-server[1368]: [56595]   Device 0: Orin, compute capability 8.7, VMM: yes
Dec 30 15:22:10 xyz llama-server[1368]: [56595] build: 17 (06705fd) with GNU 11.4.0 for Linux aarch64
Dec 30 15:22:10 xyz llama-server[1368]: [56595] system info: n_threads = 4, n_threads_batch = 4, total_threads = 12
Dec 30 15:22:10 xyz llama-server[1368]: [56595]
Dec 30 15:22:10 xyz llama-server[1368]: [56595] system_info: n_threads = 4 (n_threads_batch = 4) / 12 | CUDA : ARCHS = 870 | FORCE_CUBLAS = 1 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | FA_ALL_QUANTS = 1 | CPU : NEON = 1 | ARM_FMA = 1 | FP16_VA = 1 | DOTPROD = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
Dec 30 15:22:10 xyz llama-server[1368]: [56595]
Dec 30 15:22:10 xyz llama-server[1368]: [56595] init: using 4 threads for HTTP server
Dec 30 15:22:10 xyz llama-server[1368]: [56595] Web UI is disabled
Dec 30 15:22:10 xyz llama-server[1368]: [56595] start: binding port with default address family
Dec 30 15:22:10 xyz llama-server[1368]: [56595] main: loading model
Dec 30 15:22:10 xyz llama-server[1368]: [56595] srv    load_model: loading model '/opt/usbhd/models/LFM2-Audio-1.5B-GGUF_LiquidAI/LFM2-Audio-1.5B-BF16.gguf'
Dec 30 15:22:10 xyz llama-server[1368]: [56595] common_init_result: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on
Dec 30 15:22:10 xyz llama-server[1368]: [56595] llama_params_fit_impl: projected to use 5623 MiB of device memory vs. 29369 MiB of free device memory
Dec 30 15:22:10 xyz llama-server[1368]: [56595] llama_params_fit_impl: will leave 23745 >= 1024 MiB of free device memory, no changes needed
Dec 30 15:22:10 xyz llama-server[1368]: [56595] llama_params_fit: successfully fit params to free device memory
Dec 30 15:22:10 xyz llama-server[1368]: [56595] llama_params_fit: fitting params to free memory took 0.83 seconds
Dec 30 15:22:10 xyz llama-server[1368]: [56595] llama_model_load_from_file_impl: using device CUDA0 (Orin) (0000:00:00.0) - 29374 MiB free
Dec 30 15:22:10 xyz llama-server[1368]: [56595] llama_model_loader: loaded meta data with 38 key-value pairs and 148 tensors from /opt/usbhd/models/LFM2-Audio-1.5B-GGUF_LiquidAI/LFM2-Audio-1.5B-BF16.gguf (version GGUF V3 (latest))
Dec 30 15:22:10 xyz llama-server[1368]: [56595] llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
Dec 30 15:22:10 xyz llama-server[1368]: [56595] llama_model_loader: - kv   0:                       general.architecture str              = lfm2
Dec 30 15:22:10 xyz llama-server[1368]: [56595] llama_model_loader: - kv   1:                               general.type str              = model
Dec 30 15:22:10 xyz llama-server[1368]: [56595] llama_model_loader: - kv   2:                               general.name str              = LFM2 Audio 1.5B
Dec 30 15:22:10 xyz llama-server[1368]: [56595] llama_model_loader: - kv   3:                           general.basename str              = LFM2-Audio
Dec 30 15:22:10 xyz llama-server[1368]: [56595] llama_model_loader: - kv   4:                         general.size_label str              = 1.5B
Dec 30 15:22:10 xyz llama-server[1368]: [56595] llama_model_loader: - kv   5:                            general.license str              = other
Dec 30 15:22:10 xyz llama-server[1368]: [56595] llama_model_loader: - kv   6:                       general.license.name str              = lfm1.0
Dec 30 15:22:10 xyz llama-server[1368]: [56595] llama_model_loader: - kv   7:                       general.license.link str              = LICENSE
Dec 30 15:22:10 xyz llama-server[1368]: [56595] llama_model_loader: - kv   8:                   general.base_model.count u32              = 1
Dec 30 15:22:10 xyz llama-server[1368]: [56595] llama_model_loader: - kv   9:                  general.base_model.0.name str              = LFM2 1.2B
Dec 30 15:22:10 xyz llama-server[1368]: [56595] llama_model_loader: - kv  10:          general.base_model.0.organization str              = LiquidAI
Dec 30 15:22:10 xyz llama-server[1368]: [56595] llama_model_loader: - kv  11:              general.base_model.0.repo_url str              = https://huggingface.co/LiquidAI/LFM2-...
Dec 30 15:22:10 xyz llama-server[1368]: [56595] llama_model_loader: - kv  12:                               general.tags arr[str,7]       = ["liquid", "lfm2", "audio", "lfm2-aud...
Dec 30 15:22:10 xyz llama-server[1368]: [56595] llama_model_loader: - kv  13:                          general.languages arr[str,1]       = ["en"]
Dec 30 15:22:10 xyz llama-server[1368]: [56595] llama_model_loader: - kv  14:                           lfm2.block_count u32              = 16
Dec 30 15:22:10 xyz llama-server[1368]: [56595] llama_model_loader: - kv  15:                        lfm2.context_length u32              = 128000
Dec 30 15:22:10 xyz llama-server[1368]: [56595] llama_model_loader: - kv  16:                      lfm2.embedding_length u32              = 2048
Dec 30 15:22:10 xyz llama-server[1368]: [56595] llama_model_loader: - kv  17:                   lfm2.feed_forward_length u32              = 8192
Dec 30 15:22:10 xyz llama-server[1368]: [56595] llama_model_loader: - kv  18:                  lfm2.attention.head_count u32              = 32
Dec 30 15:22:10 xyz llama-server[1368]: [56595] llama_model_loader: - kv  19:               lfm2.attention.head_count_kv arr[i32,16]      = [0, 0, 8, 0, 0, 8, 0, 0, 8, 0, 8, 0, ...
Dec 30 15:22:10 xyz llama-server[1368]: [56595] llama_model_loader: - kv  20:                        lfm2.rope.freq_base f32              = 1000000.000000
Dec 30 15:22:10 xyz llama-server[1368]: [56595] llama_model_loader: - kv  21:      lfm2.attention.layer_norm_rms_epsilon f32              = 0.000010
Dec 30 15:22:10 xyz llama-server[1368]: [56595] llama_model_loader: - kv  22:                          general.file_type u32              = 32
Dec 30 15:22:10 xyz llama-server[1368]: [56595] llama_model_loader: - kv  23:                            lfm2.vocab_size u32              = 65536
Dec 30 15:22:10 xyz llama-server[1368]: [56595] llama_model_loader: - kv  24:                     lfm2.shortconv.l_cache u32              = 3
Dec 30 15:22:10 xyz llama-server[1368]: [56595] llama_model_loader: - kv  25:               general.quantization_version u32              = 2
Dec 30 15:22:10 xyz llama-server[1368]: [56595] llama_model_loader: - kv  26:                       tokenizer.ggml.model str              = gpt2
Dec 30 15:22:10 xyz llama-server[1368]: [56595] llama_model_loader: - kv  27:                         tokenizer.ggml.pre str              = lfm2
Dec 30 15:22:10 xyz llama-server[1368]: [56595] llama_model_loader: - kv  28:                      tokenizer.ggml.tokens arr[str,65536]   = ["<|pad|>", "<|startoftext|>", "<|end...
Dec 30 15:22:10 xyz llama-server[1368]: [56595] llama_model_loader: - kv  29:                  tokenizer.ggml.token_type arr[i32,65536]   = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
Dec 30 15:22:10 xyz llama-server[1368]: [140B blob data]
Dec 30 15:22:10 xyz llama-server[1368]: [56595] llama_model_loader: - kv  31:                tokenizer.ggml.bos_token_id u32              = 1
Dec 30 15:22:10 xyz llama-server[1368]: [56595] llama_model_loader: - kv  32:                tokenizer.ggml.eos_token_id u32              = 7
Dec 30 15:22:10 xyz llama-server[1368]: [56595] llama_model_loader: - kv  33:            tokenizer.ggml.padding_token_id u32              = 0
Dec 30 15:22:10 xyz llama-server[1368]: [56595] llama_model_loader: - kv  34:               tokenizer.ggml.add_bos_token bool             = true
Dec 30 15:22:10 xyz llama-server[1368]: [56595] llama_model_loader: - kv  35:               tokenizer.ggml.add_sep_token bool             = false
Dec 30 15:22:10 xyz llama-server[1368]: [56595] llama_model_loader: - kv  36:               tokenizer.ggml.add_eos_token bool             = false
Dec 30 15:22:10 xyz llama-server[1368]: [56595] llama_model_loader: - kv  37:                    tokenizer.chat_template str              = {{- bos_token -}}{%- set system_promp...
Dec 30 15:22:10 xyz llama-server[1368]: [56595] llama_model_loader: - type  f32:   55 tensors
Dec 30 15:22:10 xyz llama-server[1368]: [56595] llama_model_loader: - type bf16:   93 tensors
Dec 30 15:22:10 xyz llama-server[1368]: [56595] print_info: file format = GGUF V3 (latest)
Dec 30 15:22:10 xyz llama-server[1368]: [56595] print_info: file type   = BF16
Dec 30 15:22:10 xyz llama-server[1368]: [56595] print_info: file size   = 2.18 GiB (16.00 BPW)
Dec 30 15:22:11 xyz llama-server[1368]: [56595] load: printing all EOG tokens:
Dec 30 15:22:11 xyz llama-server[1368]: [56595] load:   - 2 ('<|endoftext|>')
Dec 30 15:22:11 xyz llama-server[1368]: [56595] load:   - 7 ('<|im_end|>')
Dec 30 15:22:11 xyz llama-server[1368]: [56595] load: special tokens cache size = 507
Dec 30 15:22:11 xyz llama-server[1368]: [56595] load: token to piece cache size = 0.3756 MB
Dec 30 15:22:11 xyz llama-server[1368]: [56595] print_info: arch             = lfm2
Dec 30 15:22:11 xyz llama-server[1368]: [56595] print_info: vocab_only       = 0
Dec 30 15:22:11 xyz llama-server[1368]: [56595] print_info: no_alloc         = 0
Dec 30 15:22:11 xyz llama-server[1368]: [56595] print_info: n_ctx_train      = 128000
Dec 30 15:22:11 xyz llama-server[1368]: [56595] print_info: n_embd           = 2048
Dec 30 15:22:11 xyz llama-server[1368]: [56595] print_info: n_embd_inp       = 2048
Dec 30 15:22:11 xyz llama-server[1368]: [56595] print_info: n_layer          = 16
Dec 30 15:22:11 xyz llama-server[1368]: [56595] print_info: n_head           = 32
Dec 30 15:22:11 xyz llama-server[1368]: [56595] print_info: n_head_kv        = [0, 0, 8, 0, 0, 8, 0, 0, 8, 0, 8, 0, 8, 0, 8, 0]
Dec 30 15:22:11 xyz llama-server[1368]: [56595] print_info: n_rot            = 64
Dec 30 15:22:11 xyz llama-server[1368]: [56595] print_info: n_swa            = 0
Dec 30 15:22:11 xyz llama-server[1368]: [56595] print_info: is_swa_any       = 0
Dec 30 15:22:11 xyz llama-server[1368]: [56595] print_info: n_embd_head_k    = 64
Dec 30 15:22:11 xyz llama-server[1368]: [56595] print_info: n_embd_head_v    = 64
Dec 30 15:22:11 xyz llama-server[1368]: [56595] print_info: n_gqa            = [0, 0, 4, 0, 0, 4, 0, 0, 4, 0, 4, 0, 4, 0, 4, 0]
Dec 30 15:22:11 xyz llama-server[1368]: [56595] print_info: n_embd_k_gqa     = [0, 0, 512, 0, 0, 512, 0, 0, 512, 0, 512, 0, 512, 0, 512, 0]
Dec 30 15:22:11 xyz llama-server[1368]: [56595] print_info: n_embd_v_gqa     = [0, 0, 512, 0, 0, 512, 0, 0, 512, 0, 512, 0, 512, 0, 512, 0]
Dec 30 15:22:11 xyz llama-server[1368]: [56595] print_info: f_norm_eps       = 0.0e+00
Dec 30 15:22:11 xyz llama-server[1368]: [56595] print_info: f_norm_rms_eps   = 1.0e-05
Dec 30 15:22:11 xyz llama-server[1368]: [56595] print_info: f_clamp_kqv      = 0.0e+00
Dec 30 15:22:11 xyz llama-server[1368]: [56595] print_info: f_max_alibi_bias = 0.0e+00
Dec 30 15:22:11 xyz llama-server[1368]: [56595] print_info: f_logit_scale    = 0.0e+00
Dec 30 15:22:11 xyz llama-server[1368]: [56595] print_info: f_attn_scale     = 0.0e+00
Dec 30 15:22:11 xyz llama-server[1368]: [56595] print_info: n_ff             = 8192
Dec 30 15:22:11 xyz llama-server[1368]: [56595] print_info: n_expert         = 0
Dec 30 15:22:11 xyz llama-server[1368]: [56595] print_info: n_expert_used    = 0
Dec 30 15:22:11 xyz llama-server[1368]: [56595] print_info: n_expert_groups  = 0
Dec 30 15:22:11 xyz llama-server[1368]: [56595] print_info: n_group_used     = 0
Dec 30 15:22:11 xyz llama-server[1368]: [56595] print_info: causal attn      = 1
Dec 30 15:22:11 xyz llama-server[1368]: [56595] print_info: pooling type     = 0
Dec 30 15:22:11 xyz llama-server[1368]: [56595] print_info: rope type        = 2
Dec 30 15:22:11 xyz llama-server[1368]: [56595] print_info: rope scaling     = linear
Dec 30 15:22:11 xyz llama-server[1368]: [56595] print_info: freq_base_train  = 1000000.0
Dec 30 15:22:11 xyz llama-server[1368]: [56595] print_info: freq_scale_train = 1
Dec 30 15:22:11 xyz llama-server[1368]: [56595] print_info: n_ctx_orig_yarn  = 128000
Dec 30 15:22:11 xyz llama-server[1368]: [56595] print_info: rope_yarn_log_mul= 0.0000
Dec 30 15:22:11 xyz llama-server[1368]: [56595] print_info: rope_finetuned   = unknown
Dec 30 15:22:11 xyz llama-server[1368]: [56595] print_info: model type       = 1.2B
Dec 30 15:22:11 xyz llama-server[1368]: [56595] print_info: model params     = 1.17 B
Dec 30 15:22:11 xyz llama-server[1368]: [56595] print_info: general.name     = LFM2 Audio 1.5B
Dec 30 15:22:11 xyz llama-server[1368]: [56595] print_info: vocab type       = BPE
Dec 30 15:22:11 xyz llama-server[1368]: [56595] print_info: n_vocab          = 65536
Dec 30 15:22:11 xyz llama-server[1368]: [56595] print_info: n_merges         = 63683
Dec 30 15:22:11 xyz llama-server[1368]: [56595] print_info: BOS token        = 1 '<|startoftext|>'
Dec 30 15:22:11 xyz llama-server[1368]: [56595] print_info: EOS token        = 7 '<|im_end|>'
Dec 30 15:22:11 xyz llama-server[1368]: [56595] print_info: EOT token        = 2 '<|endoftext|>'
Dec 30 15:22:11 xyz llama-server[1368]: [56595] print_info: PAD token        = 0 '<|pad|>'
Dec 30 15:22:11 xyz llama-server[1368]: [56595] print_info: LF token         = 708 'Ċ'
Dec 30 15:22:11 xyz llama-server[1368]: [56595] print_info: EOG token        = 2 '<|endoftext|>'
Dec 30 15:22:11 xyz llama-server[1368]: [56595] print_info: EOG token        = 7 '<|im_end|>'
Dec 30 15:22:11 xyz llama-server[1368]: [56595] print_info: max token length = 30
Dec 30 15:22:11 xyz llama-server[1368]: [56595] load_tensors: loading model tensors, this can take a while... (mmap = false)
Dec 30 15:22:11 xyz llama-server[1368]: [56595] load_tensors: offloading output layer to GPU
Dec 30 15:22:11 xyz llama-server[1368]: [56595] load_tensors: offloading 15 repeating layers to GPU
Dec 30 15:22:11 xyz llama-server[1368]: [56595] load_tensors: offloaded 17/17 layers to GPU
Dec 30 15:22:11 xyz llama-server[1368]: [56595] load_tensors:        CUDA0 model buffer size =  2232.50 MiB
Dec 30 15:22:11 xyz llama-server[1368]: [56595] load_tensors:    CUDA_Host model buffer size =   256.00 MiB
Dec 30 15:22:37 xyz llama-server[1368]: [56595] ..................................................................
Dec 30 15:22:37 xyz llama-server[1368]: [56595] common_init_result: added <|endoftext|> logit bias = -inf
Dec 30 15:22:37 xyz llama-server[1368]: [56595] common_init_result: added <|im_end|> logit bias = -inf
Dec 30 15:22:37 xyz llama-server[1368]: [56595] llama_context: constructing llama_context
Dec 30 15:22:37 xyz llama-server[1368]: [56595] llama_context: n_seq_max     = 2
Dec 30 15:22:37 xyz llama-server[1368]: [56595] llama_context: n_ctx         = 128000
Dec 30 15:22:37 xyz llama-server[1368]: [56595] llama_context: n_ctx_seq     = 64000
Dec 30 15:22:37 xyz llama-server[1368]: [56595] llama_context: n_batch       = 2048
Dec 30 15:22:37 xyz llama-server[1368]: [56595] llama_context: n_ubatch      = 512
Dec 30 15:22:37 xyz llama-server[1368]: [56595] llama_context: causal_attn   = 1
Dec 30 15:22:37 xyz llama-server[1368]: [56595] llama_context: flash_attn    = enabled
Dec 30 15:22:37 xyz llama-server[1368]: [56595] llama_context: kv_unified    = false
Dec 30 15:22:37 xyz llama-server[1368]: [56595] llama_context: freq_base     = 1000000.0
Dec 30 15:22:37 xyz llama-server[1368]: [56595] llama_context: freq_scale    = 1
Dec 30 15:22:37 xyz llama-server[1368]: [56595] llama_context: n_ctx_seq (64000) < n_ctx_train (128000) -- the full capacity of the model will not be utilized
Dec 30 15:22:37 xyz llama-server[1368]: [56595] llama_context:  CUDA_Host  output buffer size =     0.50 MiB
Dec 30 15:22:37 xyz llama-server[1368]: [56595] llama_kv_cache:      CUDA0 KV buffer size =  3000.00 MiB
Dec 30 15:22:37 xyz llama-server[1368]: [56595] llama_kv_cache: size = 3000.00 MiB (128000 cells,   6 layers,  2/2 seqs), K (f16): 1500.00 MiB, V (f16): 1500.00 MiB
Dec 30 15:22:37 xyz llama-server[1368]: [56595] llama_memory_recurrent:      CUDA0 RS buffer size =     0.31 MiB
Dec 30 15:22:37 xyz llama-server[1368]: [56595] llama_memory_recurrent: size =    0.31 MiB (     2 cells,  16 layers,  2 seqs), R (f32):    0.31 MiB, S (f32):    0.00 MiB
Dec 30 15:22:37 xyz llama-server[1368]: [56595] llama_context:      CUDA0 compute buffer size =   391.01 MiB
Dec 30 15:22:37 xyz llama-server[1368]: [56595] llama_context:  CUDA_Host compute buffer size =   254.01 MiB
Dec 30 15:22:37 xyz llama-server[1368]: [56595] llama_context: graph nodes  = 561
Dec 30 15:22:37 xyz llama-server[1368]: [56595] llama_context: graph splits = 2
Dec 30 15:22:37 xyz llama-server[1368]: [56595] common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
Dec 30 15:22:38 xyz llama-server[1368]: [56595] common_chat_params_init_lfm2: Using content relying on the template
Dec 30 15:22:38 xyz llama-server[1368]: [56595] clip_model_loader: model name:   LFM2 Audio 1.5B
Dec 30 15:22:38 xyz llama-server[1368]: [56595] clip_model_loader: description:
Dec 30 15:22:38 xyz llama-server[1368]: [56595] clip_model_loader: GGUF version: 3
Dec 30 15:22:38 xyz llama-server[1368]: [56595] clip_model_loader: alignment:    32
Dec 30 15:22:38 xyz llama-server[1368]: [56595] clip_model_loader: n_tensors:    650
Dec 30 15:22:38 xyz llama-server[1368]: [56595] clip_model_loader: n_kv:         26
Dec 30 15:22:38 xyz llama-server[1368]: [56595]
Dec 30 15:22:38 xyz llama-server[1368]: [56595] clip_model_loader: has audio encoder
Dec 30 15:22:38 xyz llama-server[1368]: [56595] clip_ctx: CLIP using CUDA0 backend
Dec 30 15:22:38 xyz llama-server[1368]: [56595] load_hparams: projector:          lfm2a
Dec 30 15:22:38 xyz llama-server[1368]: [56595] load_hparams: n_embd:             512
Dec 30 15:22:38 xyz llama-server[1368]: [56595] load_hparams: n_head:             8
Dec 30 15:22:38 xyz llama-server[1368]: [56595] load_hparams: n_ff:               512
Dec 30 15:22:38 xyz llama-server[1368]: [56595] load_hparams: n_layer:            17
Dec 30 15:22:38 xyz llama-server[1368]: [56595] load_hparams: ffn_op:             gelu_quick
Dec 30 15:22:38 xyz llama-server[1368]: [56595] load_hparams: projection_dim:     2048
Dec 30 15:22:38 xyz llama-server[1368]: [56595]
Dec 30 15:22:38 xyz llama-server[1368]: [56595] --- audio hparams ---
Dec 30 15:22:38 xyz llama-server[1368]: [56595] load_hparams: n_mel_bins:         128
Dec 30 15:22:38 xyz llama-server[1368]: [56595] load_hparams: proj_stack_factor:  0
Dec 30 15:22:38 xyz llama-server[1368]: [56595] load_hparams: audio_chunk_len:    1
Dec 30 15:22:38 xyz llama-server[1368]: [56595] load_hparams: audio_sample_rate:  16000
Dec 30 15:22:38 xyz llama-server[1368]: [56595] load_hparams: audio_n_fft:        512
Dec 30 15:22:38 xyz llama-server[1368]: [56595] load_hparams: audio_window_len:   400
Dec 30 15:22:38 xyz llama-server[1368]: [56595] load_hparams: audio_hop_len:      160
Dec 30 15:22:38 xyz llama-server[1368]: [56595]
Dec 30 15:22:38 xyz llama-server[1368]: [56595] load_hparams: model size:         437.52 MiB
Dec 30 15:22:38 xyz llama-server[1368]: [56595] load_hparams: metadata size:      0.23 MiB
Dec 30 15:22:43 xyz llama-server[1368]: [56595] warmup: warmup with audio size = 3000
Dec 30 15:22:43 xyz llama-server[1368]: [56595] alloc_compute_meta:      CUDA0 compute buffer size =   195.19 MiB
Dec 30 15:22:43 xyz llama-server[1368]: [56595] alloc_compute_meta:        CPU compute buffer size =     2.93 MiB
Dec 30 15:22:43 xyz llama-server[1368]: [56595] alloc_compute_meta: graph splits = 35, nodes = 1547
Dec 30 15:22:43 xyz llama-server[1368]: [56595] warmup: flash attention is enabled
Dec 30 15:22:43 xyz llama-server[1368]: [56595] warmup: *****************************************************************
Dec 30 15:22:43 xyz llama-server[1368]: [56595] warmup: WARNING: the CLIP graph uses unsupported operators by the backend
Dec 30 15:22:43 xyz llama-server[1368]: [56595] warmup:          the performance will be suboptimal
Dec 30 15:22:43 xyz llama-server[1368]: [56595] warmup:          list of unsupported ops (backend=CUDA0):
Dec 30 15:22:43 xyz llama-server[1368]: [56595] warmup:            UNARY: type = f32, ne = [512 375 1 1]
Dec 30 15:22:43 xyz llama-server[1368]: [56595] warmup:            UNARY: type = f32, ne = [512 375 1 1]
Dec 30 15:22:43 xyz llama-server[1368]: [56595] warmup:            UNARY: type = f32, ne = [512 375 1 1]
Dec 30 15:22:43 xyz llama-server[1368]: [56595] warmup:            UNARY: type = f32, ne = [512 375 1 1]
Dec 30 15:22:43 xyz llama-server[1368]: [56595] warmup:            UNARY: type = f32, ne = [512 375 1 1]
Dec 30 15:22:43 xyz llama-server[1368]: [56595] warmup:            UNARY: type = f32, ne = [512 375 1 1]
Dec 30 15:22:43 xyz llama-server[1368]: [56595] warmup:            UNARY: type = f32, ne = [512 375 1 1]
Dec 30 15:22:43 xyz llama-server[1368]: [56595] warmup:            UNARY: type = f32, ne = [512 375 1 1]
Dec 30 15:22:43 xyz llama-server[1368]: [56595] warmup:            UNARY: type = f32, ne = [512 375 1 1]
Dec 30 15:22:43 xyz llama-server[1368]: [56595] warmup:            UNARY: type = f32, ne = [512 375 1 1]
Dec 30 15:22:43 xyz llama-server[1368]: [56595] warmup:            UNARY: type = f32, ne = [512 375 1 1]
Dec 30 15:22:43 xyz llama-server[1368]: [56595] warmup:            UNARY: type = f32, ne = [512 375 1 1]
Dec 30 15:22:43 xyz llama-server[1368]: [56595] warmup:            UNARY: type = f32, ne = [512 375 1 1]
Dec 30 15:22:43 xyz llama-server[1368]: [56595] warmup:            UNARY: type = f32, ne = [512 375 1 1]
Dec 30 15:22:43 xyz llama-server[1368]: [56595] warmup:            UNARY: type = f32, ne = [512 375 1 1]
Dec 30 15:22:43 xyz llama-server[1368]: [56595] warmup:            UNARY: type = f32, ne = [512 375 1 1]
Dec 30 15:22:43 xyz llama-server[1368]: [56595] warmup:            UNARY: type = f32, ne = [512 375 1 1]
Dec 30 15:22:43 xyz llama-server[1368]: [56595] warmup: flash attention is enabled
Dec 30 15:22:43 xyz llama-server[1368]: [56595] warmup: please report this on github as an issue
Dec 30 15:22:43 xyz llama-server[1368]: [56595] warmup: ref: https://github.com/ggml-org/llama.cpp/pull/16837#issuecomment-3461676118
Dec 30 15:22:43 xyz llama-server[1368]: [56595] warmup: *****************************************************************
Dec 30 15:22:43 xyz llama-server[1368]: [56595] init_audio: audio input is in experimental stage and may have reduced quality:
Dec 30 15:22:43 xyz llama-server[1368]: [56595]     https://github.com/ggml-org/llama.cpp/discussions/13759
Dec 30 15:22:43 xyz llama-server[1368]: [56595] srv    load_model: loaded multimodal model, '/opt/usbhd/models/LFM2-Audio-1.5B-GGUF_LiquidAI/mmproj-LFM2-Audio-1.5b-BF16.gguf'
Dec 30 15:22:43 xyz llama-server[1368]: [56595] srv    load_model: initializing slots, n_slots = 2
Dec 30 15:22:43 xyz llama-server[1368]: [56595] slot   load_model: id  0 | task -1 | new slot, n_ctx = 64000
Dec 30 15:22:43 xyz llama-server[1368]: [56595] slot   load_model: id  1 | task -1 | new slot, n_ctx = 64000
Dec 30 15:22:43 xyz llama-server[1368]: [56595] srv    load_model: prompt cache is enabled, size limit: 8192 MiB
Dec 30 15:22:43 xyz llama-server[1368]: [56595] srv    load_model: use `--cache-ram 0` to disable the prompt cache
Dec 30 15:22:43 xyz llama-server[1368]: [56595] srv    load_model: for more info see https://github.com/ggml-org/llama.cpp/pull/16391
Dec 30 15:22:43 xyz llama-server[1368]: [56595] common_chat_params_init_lfm2: Using content relying on the template
Dec 30 15:22:43 xyz llama-server[1368]: [56595] common_chat_params_init_lfm2: Using content relying on the template
Dec 30 15:22:43 xyz llama-server[1368]: [56595] srv    load_model: thinking = 0
Dec 30 15:22:43 xyz llama-server[1368]: [56595] common_chat_params_init_lfm2: Using content relying on the template
Dec 30 15:22:43 xyz llama-server[1368]: [56595] load_model: chat template, chat_template: {{- bos_token -}}{%- set system_prompt = "" -%}{%- set ns = namespace(system_prompt="") -%}{%- if messages[0]["role"] == "system" -%} {%- set ns.system_prompt = messages[0]["content"] -%} {%- set messages = messages[1:] -%}{%- endif -%}{%- if tools -%} {%- set ns.system_prompt = ns.system_prompt + ("
Dec 30 15:22:43 xyz llama-server[1368]: [56595] " if ns.system_prompt else "") + "List of tools: <|tool_list_start|>[" -%} {%- for tool in tools -%} {%- if tool is not string -%} {%- set tool = tool | tojson -%} {%- endif -%} {%- set ns.system_prompt = ns.system_prompt + tool -%} {%- if not loop.last -%} {%- set ns.system_prompt = ns.system_prompt + ", " -%} {%- endif -%} {%- endfor -%} {%- set ns.system_prompt = ns.system_prompt + "]<|tool_list_end|>" -%}{%- endif -%}{%- if ns.system_prompt -%} {{- "<|im_start|>system
Dec 30 15:22:43 xyz llama-server[1368]: [56595] " + ns.system_prompt + "<|im_end|>
Dec 30 15:22:43 xyz llama-server[1368]: [56595] " -}}{%- endif -%}{%- for message in messages -%} {{- "<|im_start|>" + message["role"] + "
Dec 30 15:22:43 xyz llama-server[1368]: [56595] " -}} {%- set content = message["content"] -%} {%- if content is not string -%} {%- set content = content | tojson -%} {%- endif -%} {%- if message["role"] == "tool" -%} {%- set content = "<|tool_response_start|>" + content + "<|tool_response_end|>" -%} {%- endif -%} {{- content + "<|im_end|>
Dec 30 15:22:43 xyz llama-server[1368]: [56595] " -}}{%- endfor -%}{%- if add_generation_prompt -%} {{- "<|im_start|>assistant
Dec 30 15:22:43 xyz llama-server[1368]: [56595] " -}}{%- endif -%}, example_format: '<|im_start|>system
Dec 30 15:22:43 xyz llama-server[1368]: [56595] You are a helpful assistant<|im_end|>
Dec 30 15:22:43 xyz llama-server[1368]: [56595] <|im_start|>user
Dec 30 15:22:43 xyz llama-server[1368]: [56595] Hello<|im_end|>
Dec 30 15:22:43 xyz llama-server[1368]: [56595] <|im_start|>assistant
Dec 30 15:22:43 xyz llama-server[1368]: [56595] Hi there<|im_end|>
Dec 30 15:22:43 xyz llama-server[1368]: [56595] <|im_start|>user
Dec 30 15:22:43 xyz llama-server[1368]: [56595] How are you?<|im_end|>
Dec 30 15:22:43 xyz llama-server[1368]: [56595] <|im_start|>assistant
Dec 30 15:22:43 xyz llama-server[1368]: [56595] '
Dec 30 15:22:43 xyz llama-server[1368]: [56595] main: model loaded
Dec 30 15:22:43 xyz llama-server[1368]: [56595] main: server is listening on http://127.0.0.1:56595
Dec 30 15:22:43 xyz llama-server[1368]: [56595] main: starting the main loop...
Dec 30 15:22:43 xyz llama-server[1368]: [56595] cmd_child_to_router:ready
Dec 30 15:22:43 xyz llama-server[1368]: [56595] srv  update_slots: all slots are idle
Dec 30 15:22:43 xyz llama-server[1368]: [56595] srv    operator(): child server monitoring thread started, waiting for EOF on stdin...

Dec 30 15:27:04 xyz llama-server[1368]: srv  proxy_reques: proxying request to model asr on port 56595
Dec 30 15:27:04 xyz llama-server[1368]: [56595] common_chat_params_init_lfm2: Using content relying on the template
Dec 30 15:27:04 xyz llama-server[1368]: [56595] srv  params_from_: Chat format: Content-only
Dec 30 15:27:04 xyz llama-server[1368]: [56595] slot get_availabl: id  1 | task -1 | selected slot by LRU, t_last = -1
Dec 30 15:27:04 xyz llama-server[1368]: [56595] slot launch_slot_: id  1 | task -1 | sampler chain: logits -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
Dec 30 15:27:04 xyz llama-server[1368]: [56595] slot launch_slot_: id  1 | task 0 | processing task
Dec 30 15:27:04 xyz llama-server[1368]: [56595] slot update_slots: id  1 | task 0 | new prompt, n_ctx_slot = 64000, n_keep = 0, task.n_tokens = 155
Dec 30 15:27:04 xyz llama-server[1368]: [56595] slot update_slots: id  1 | task 0 | n_tokens = 0, memory_seq_rm [0, end)
Dec 30 15:27:04 xyz llama-server[1368]: [56595] slot update_slots: id  1 | task 0 | prompt processing progress, n_tokens = 14, batch.n_tokens = 14, progress = 0.090323
Dec 30 15:27:04 xyz llama-server[1368]: [56595] slot update_slots: id  1 | task 0 | n_tokens = 14, memory_seq_rm [14, end)
Dec 30 15:27:04 xyz llama-server[1368]: [56595] srv  process_chun: processing audio...
Dec 30 15:27:04 xyz llama-server[1368]: [56595] encoding audio slice...
Dec 30 15:27:05 xyz llama-server[1368]: [56595] audio slice encoded in 732 ms
Dec 30 15:27:05 xyz llama-server[1368]: [56595] decoding audio batch 1/1, n_tokens_batch = 136
Dec 30 15:27:05 xyz llama-server[1368]: [56595] audio decoded (batch 1/1) in 23 ms
Dec 30 15:27:05 xyz llama-server[1368]: [56595] srv  process_chun: audio processed in 755 ms
Dec 30 15:27:05 xyz llama-server[1368]: [56595] slot update_slots: id  1 | task 0 | prompt processing progress, n_tokens = 155, batch.n_tokens = 5, progress = 1.000000
Dec 30 15:27:05 xyz llama-server[1368]: [56595] slot update_slots: id  1 | task 0 | prompt done, n_tokens = 155, batch.n_tokens = 5
Dec 30 15:27:05 xyz llama-server[1368]: [56595] slot update_slots: id  1 | task 0 | created context checkpoint 1 of 8 (pos_min = 149, pos_max = 149, size = 0.156 MiB)
Dec 30 15:27:06 xyz llama-server[1368]: [56595] slot print_timing: id  1 | task 0 |
Dec 30 15:27:06 xyz llama-server[1368]: [56595] prompt eval time =     920.69 ms /   155 tokens (    5.94 ms per token,   168.35 tokens per second)
Dec 30 15:27:06 xyz llama-server[1368]: [56595]        eval time =     665.46 ms /    42 tokens (   15.84 ms per token,    63.11 tokens per second)
Dec 30 15:27:06 xyz llama-server[1368]: [56595]       total time =    1586.15 ms /   197 tokens
Dec 30 15:27:06 xyz llama-server[1368]: [56595] slot      release: id  1 | task 0 | stop processing: n_tokens = 196, truncated = 0
Dec 30 15:27:06 xyz llama-server[1368]: [56595] srv  update_slots: all slots are idle
Dec 30 15:27:06 xyz llama-server[1368]: [56595] srv  log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
Dec 30 15:27:06 xyz llama-server[1368]: srv  log_server_r: request: POST /v1/chat/completions 127.0.0.1 200

@tdakhran
Copy link
Contributor Author

thanks for testing it @elfarolab , to reduce RAM usage specify --no-mmap.

@elfarolab
Copy link

thanks for testing it @elfarolab , to reduce RAM usage specify --no-mmap.

it is already set.. still checking. Thanks!

@tdakhran
Copy link
Contributor Author

another possibility is context length, maybe setting -c 4096 explicitly can help

@tdakhran
Copy link
Contributor Author

To reduce RAM usage, GGUFs can be quantized using llama-quantize from BF16 to Q4_0 or Q8_0.

@elfarolab
Copy link

another possibility is context length, maybe setting -c 4096 explicitly can help

OK now I've got ~4.2GB, better, with GGUF not yet quantized.

my model.ini:

[asr]
load-on-startup = true
m = /opt/usbhd/models/LFM2-Audio-1.5B-GGUF_LiquidAI/LFM2-Audio-1.5B-BF16.gguf
mm = /opt/usbhd/models/LFM2-Audio-1.5B-GGUF_LiquidAI/mmproj-LFM2-Audio-1.5b-BF16.gguf
c = 4096
;threads = 2
temp = 0
;min-p = 0.15
;presence-penalty = 1.05
fit = off
ngl = -1
fa = on
;mlock = on
mmap = off
b = 2048
ub = 2048

What about these:

temp = 0
;min-p = 0.15
;presence-penalty = 1.05

@tdakhran
Copy link
Contributor Author

text generation for ASR for LFM2-Audio-1.5B is greedy, params above look good.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

examples ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs python python script changes testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants