Skip to content

Adding --direct-io flag for model loading#18166

Merged
ggerganov merged 11 commits intoggml-org:masterfrom
JTischbein:direct_io_flag
Jan 8, 2026
Merged

Adding --direct-io flag for model loading#18166
ggerganov merged 11 commits intoggml-org:masterfrom
JTischbein:direct_io_flag

Conversation

@JTischbein
Copy link
Contributor

Follow up for PR #18012 (comment).

To enable Direct IO model reading by default on Linux and Windows, but to stay with --mmap as default on Mac, this PR adds an additional flag for enabling/disabling Direct IO. This flag is by default true and overrules the mmap parameter. In case --direct-io is true and Direct IO is available, --mmap gets disabled. And if --no-direct-io is set or Direct IO is not available (e.g. on Mac), the specified mmap value is used.

@ggerganov
Copy link
Member

I think you need this:

diff --git a/src/llama-model-loader.cpp b/src/llama-model-loader.cpp
index 1355eea95..2db2115a0 100644
--- a/src/llama-model-loader.cpp
+++ b/src/llama-model-loader.cpp
@@ -918,8 +918,7 @@ void llama_model_loader::load_data_for(struct ggml_tensor * cur) const {
         GGML_ASSERT(cur->data != nullptr);
         GGML_ASSERT(w.idx < files.size());
         const auto & file = files.at(w.idx);
-        file->seek(w.offs, SEEK_SET);
-        file->read_raw(cur->data, ggml_nbytes(cur));
+        file->read_raw_at(cur->data, ggml_nbytes(cur), w.offs);
     }
 
     if (check_tensors && !ggml_validate_row_data(cur->type, cur->data, ggml_nbytes(cur))) {

Probably need to assert that llama_file::read_raw is never used with direct io?

@JTischbein
Copy link
Contributor Author

Thanks for the hint, changed that.

I think an assert would not work as read_raw in the current form is needed. Would you suggest to rename read_raw_at to be read_raw (and using tell instead of the offset argument)? Like this read_raw can be safely used again and in the loop of load_all_data the current read_raw is called (as read_raw_direct?)

@ggerganov
Copy link
Member

Would you suggest to rename read_raw_at to be read_raw (and using tell instead of the offset argument)? Like this read_raw can be safely used again and in the loop of load_all_data the current read_raw is called (as read_raw_direct?)

Ok. Would we even need to have read_raw_direct in this case? If we still needed for some reason, then maybe call it read_raw_unsafe to not overload the "direct" word with more meanings.


Also some suggestions that I have not tested, but should at least convey what I mean:

diff --git a/src/llama-model-loader.cpp b/src/llama-model-loader.cpp
index 2db2115a0..ae0c698be 100644
--- a/src/llama-model-loader.cpp
+++ b/src/llama-model-loader.cpp
@@ -508,8 +508,11 @@ llama_model_loader::llama_model_loader(
     files.emplace_back(new llama_file(fname.c_str(), "rb", use_direct_io));
     contexts.emplace_back(ctx);
 
-    // Disable mmap in case Direct I/O is enabled and available
-    if (use_direct_io && files.at(0)->has_direct_io()) {
+    // check if direct io is enabled and supported
+    use_direct_io = use_direct_io && files.back()->has_direct_io();
+
+    if (use_direct_io && use_mmap) {
+        LLAMA_LOG_WARN("%s: direct I/O is enabled, disabling mmap\n", __func__);
         use_mmap = false;
     }
 
@@ -581,6 +584,10 @@ llama_model_loader::llama_model_loader(
             files.emplace_back(new llama_file(fname_split, "rb", use_direct_io));
             contexts.emplace_back(ctx);
 
+            if (use_direct_io && !files.back()->has_direct_io()) {
+                throw std::runtime_error(format("unexpected: direct I/O is not supported for split file %s", fname_split));
+            }
+
             // Save tensors data offset info of the shard.
             for (ggml_tensor * cur = ggml_get_first_tensor(ctx); cur; cur = ggml_get_next_tensor(ctx, cur)) {
                 std::string tensor_name = std::string(cur->name);
@@ -722,6 +729,7 @@ llama_model_loader::llama_model_loader(
     }
 
     this->use_mmap = use_mmap;
+    this->use_direct_io = use_direct_io;
     this->check_tensors = check_tensors;
     this->no_alloc = no_alloc;
 }
diff --git a/src/llama-model-loader.h b/src/llama-model-loader.h
index de06b5283..6f15115ce 100644
--- a/src/llama-model-loader.h
+++ b/src/llama-model-loader.h
@@ -70,6 +70,7 @@ struct llama_model_loader {
     size_t   n_bytes    = 0;
 
     bool use_mmap = false;
+    bool use_direct_io = false;
     bool check_tensors;
     bool no_alloc;
 
diff --git a/src/llama-model.cpp b/src/llama-model.cpp
index cf0c39475..502859d2e 100644
--- a/src/llama-model.cpp
+++ b/src/llama-model.cpp
@@ -2337,7 +2337,8 @@ bool llama_model::load_tensors(llama_model_loader & ml) {
 
     const bool use_mmap_buffer = true;
 
-    LLAMA_LOG_INFO("%s: loading model tensors, this can take a while... (mmap = %s)\n", __func__, ml.use_mmap ? "true" : "false");
+    LLAMA_LOG_INFO("%s: loading model tensors, this can take a while... (mmap = %s, direct_io = %s)\n",
+            __func__, ml.use_mmap ? "true" : "false", ml.use_direct_io ? "true" : "false");
 
     // build a list of buffer types for the CPU and GPU devices
     pimpl->cpu_buft_list = make_cpu_buft_list(devices, params.use_extra_bufts, params.no_host);

@askmyteapot
Copy link

Just an FYI. #18012 Broke loading with mmap disabled on windows.

@JTischbein
Copy link
Contributor Author

@askmyteapot Thank you for the hint. Issue was that I used off_t, which is a signed long on windows. The fix is in this PR.

@JTischbein
Copy link
Contributor Author

@ggerganov Should I isolate the changes with the Windows fix in a new PR?

@ggerganov
Copy link
Member

Yes, would like to take extra look at the changes here, so better to fix the windows issue in the meantime. Thanks

Copy link
Contributor

@NeoZhangJianyu NeoZhangJianyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the parameters to load the model file in this PR?
Here is my understanding, please correct me if it's wrong:
--no-mmap -ndio
--no-mmap -dio
--mmap

When must user use -ndio?

I think the parameters are a little complex.
Here is my suggestion:
We should keep: --mmap and --no-mmap.
In the case of --no-mmap, the code should detect & switch to use dio or ndio smartly in windows/linux/mac.

@ehoogeveen-medweb
Copy link

IIUC, the implementation in this PR currently requires passing --no-direct-io in order to enable mmap, and --no-direct-io --no-mmap to disable both. I think it should also recognize that if a user passes just --mmap, they want to disable Direct IO and use mmap (as the two options are mutually exclusive).

In other words, --mmap should imply --no-direct-io (and --direct-io should imply --no-mmap, although that doesn't matter with the current logic). Aside from that case, I think the logic is reasonable assuming that preferring Direct IO over mmap is the way to go.

@JTischbein
Copy link
Contributor Author

I agree with @ehoogeveen-medweb, now explicitly specifying --mmap disables direct io. Only handling the model load using --mmap and --no-mmap does not work with the current implementation, as we need the code paths for loading via mmap, read() and std::fread().

Now there are three ways:

  • Default (implicitly -dio --mmap): Load via direct io and if it is not available fallback to mmap
  • Explicitly specifying --mmap: Load via mmap
  • Explicitly specifying --no-mmap -ndio: Load via std::fread()

@NeoZhangJianyu
Copy link
Contributor

NeoZhangJianyu commented Dec 29, 2025

@JTischbein
I test this PR with SYCL backend on Ubuntu 22.04.
All parameters case are work.
But the parameter -dio will make the load slower than other -ndio case, about 2-4 times as my feeling.

Here is my test cases and result:

Parameters Ubuntu 22.04/SYCL
--mmap quick
--no-mmap slow
--mmap -dio show
--mmap -ndio quick
--no-mmap -dio show
--no-mmap -ndio quick
empty: none above parameters slow

Why "-dio" make the performance to be reduced?
It's far away of the purpose of this PR.

Could you confirm it on any Intel Core CPU (12th or newer) with iGPU (windows/linux)?
And add more test result in above table for different OS and backend.
So that we could know the impact of new feature.

Here is the test cmd:

./build/bin/llama-server --model ./models/llama-2-7b.Q4_0.gguf -ngl 99  --ctx-size 5000 --no-mmap -dio

Thank you!

@JTischbein
Copy link
Contributor Author

JTischbein commented Dec 29, 2025

@NeoZhangJianyu I have found the issue for the performance drop and probably why it is working on some devices with the SYCL backend and Vulkan backend. The current code uses an additional CPU buffer, from which the tensors are then copied into the pinned memory buffers (making it slower depending on the memcpy speed). If I disable this additional buffer again we are seeing on some devices (only SYCL and Vulkan backend so far) the error Bad address. The issue is most likely that the allocated buffer is not in the CPU RAM pages so that a DMA from disk can be done, but the allocated buffer is on device and the pages are mapped into CPU memory (which then leads to Bad address). I unfortunately don't have access to an Intel iGPU and I have no experience with the SYCL backend, could you help fix the allocation of pinned memory? Thank you

@jeffbolznv
Copy link
Contributor

I don't fully understand this theory. These are UMA systems, are you suggesting that these pages are in a carveout region or something similar that makes them inaccessible for DMA? I don't think we have a way to detect that in the backend. Something like #18317 (comment) is the best we could try but that didn't work.

Is it possible to handle this error in the model loader and fallback to the non-directio path? That seems like it would be a more robust solution.

@jeffbolznv
Copy link
Contributor

Another solution might be to allocate these buffers with (aligned)malloc and use the buffer_from_host_ptr function to make them GPU-accessible. This isn't currently implemented in ggml-vulkan but it's probably not too difficult to do so.

@JTischbein
Copy link
Contributor Author

@jeffbolznv Yes, that is currently my guess. Multiple people reported now that this PR fixed the reading issues and the difference between master and this PR is that the buffer pointer given to read points to a posix_memalign buffer and not into the buffer provided by the backend.

read with O_DIRECT enabled prompts the DMA controller to copy the data from disk directly into physical memory and bypasses the paging. Is it possible that mapped memory leads to issues here?

In case that can be tested without too much effort it would be great to know whether buffer_from_host_ptr works

@NeoZhangJianyu
Copy link
Contributor

@engrtipusultan thank you for testing and sharing! @NeoZhangJianyu I have posted the performance benefits of the uncached reads in #18012. I don't have access to an Intel Core CPU with iGPU

To bridge the time until other backends support DMA compatible pinned memory buffers I will try to create a fallback to mmap for backends which don't support these buffers. Vulkan and SYCL have issues on some UMA devices.

@JTischbein
OK, it's great!
I could help test with your new solution if need.

Thank you!

@NeoZhangJianyu
Copy link
Contributor

I test with old version: 93bb92664e37086cb4adab24f4911f7681724fce
which is earlier than the #18012.
llama-server supports parameter: --no-mmap (use mmap if no this parameter).
I find the performance of mmap > --no-mmap >> -dio (#18012).

The --mmap is preferred firstly in SYCL backend.
In some special case (like PVC), mmap can't work well. So user have to use --no-mmap.

For SYCL backend, I satisfy the performance of load model in old version.

I only hope the load model function is passed and performance is not reduced.
For example, you could detect the backend type and keep legacy behavior of earlier version of 18012 for SYCL backend.

Thank you!

@ddh0
Copy link
Contributor

ddh0 commented Jan 3, 2026

Hello, thanks for this PR - has anyone tested how this affects offloaded MoE models where the experts are memory-mapped from disk? Does this PR provide any speedup in this case?

@engrtipusultan
Copy link

engrtipusultan commented Jan 5, 2026

After the merge
(mmap = false, direct_io = true)

llama_model_load_from_file_impl: failed to load model
common_init_from_params: failed to load model '/home/tipu/AI/models/ggml-org/Qwen3-Coder-30B-A3B/Qwen3-Coder-30B-A3B-Instruct-Q8_0.gguf'
srv    load_model: failed to load model, '/home/tipu/AI/models/ggml-org/Qwen3-Coder-30B-A3B/Qwen3-Coder-30B-A3B-Instruct-Q8_0.gguf'
srv    operator(): operator(): cleaning up before exit...
main: exiting due to model loading error

(mmap = false, direct_io = false)
is working. This seems 10% slower on server. 10.98 tokens per second.

prviously build that I have using this PR.
(mmap = false, direct_io = true) working.
12.20 tokens per second. In past sometimes it was 13 also.

Also -dio and -ndio is not implemented in the llama-bench. So I cannot run the benchmark. How much it is slower after the new merge.

If you need any further logs from my side. Do let me know.

@JTischbein
Copy link
Contributor Author

@engrtipusultan Thank you for testing! Which backend are you using? And do you have some more logs? There should be a line like llama_model_load: error loading model: read error: Bad address

@engrtipusultan
Copy link

I am using vulkan. I created #18317 earlier.

I am using Kernel: 6.14.0-37-generic OS: Ubuntu: 24.04 AMD Ryzen 7 5825U Vega 8 APU. 64Gb of RAM. Using BIOS UMA plus GTT or TTM module to make it accessible.

Logs

bash  ./llama-server -m /home/tipu/AI/models/ggml-org/Qwen3-Coder-30B-A3B/Qwen3-Coder-30B-A3B-Instruct-Q8_0.gguf --port 8888 --jinja --n-predict -1 --n-gpu-layers 99 --flash-attn off --temp 0.7 --top-k 20 --top-p 0.8 --min-p 0.01 --ctx-size 30768 --mlock --ubatch-size 512 --batch-size 2048 --cache-reuse 512 --cache-ram -1 --ctx-checkpoints 16 --offline --parallel 1 --kv-unified --context-shift --prio-batch 3 --prio 3 --alias Qwen3-Coder-30B-A3B --no-mmap -v -dio
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon Graphics (RADV RENOIR) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none
build: 7637 (baa6be686) with GNU 13.3.0 for Linux x86_64
system info: n_threads = 8, n_threads_batch = 8, total_threads = 16

system_info: n_threads = 8 (n_threads_batch = 8) / 16 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | 

Running without SSL
init: using 15 threads for HTTP server
start: binding port with default address family
main: loading model
srv    load_model: loading model '/home/tipu/AI/models/ggml-org/Qwen3-Coder-30B-A3B/Qwen3-Coder-30B-A3B-Instruct-Q8_0.gguf'
common_init_result: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on
llama_params_fit_impl: getting device memory data for initial parameters:
llama_model_load_from_file_impl: using device Vulkan0 (AMD Radeon Graphics (RADV RENOIR)) (0000:03:00.0) - 57310 MiB free
llama_model_loader: loaded meta data with 44 key-value pairs and 579 tensors from /home/tipu/AI/models/ggml-org/Qwen3-Coder-30B-A3B/Qwen3-Coder-30B-A3B-Instruct-Q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen3moe
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Qwen3-Coder-30B-A3B-Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Qwen3-Coder-30B-A3B-Instruct
llama_model_loader: - kv   5:                       general.quantized_by str              = Unsloth
llama_model_loader: - kv   6:                         general.size_label str              = 30B-A3B
llama_model_loader: - kv   7:                            general.license str              = apache-2.0
llama_model_loader: - kv   8:                       general.license.link str              = https://huggingface.co/Qwen/Qwen3-Cod...
llama_model_loader: - kv   9:                           general.repo_url str              = https://huggingface.co/unsloth
llama_model_loader: - kv  10:                   general.base_model.count u32              = 1
llama_model_loader: - kv  11:                  general.base_model.0.name str              = Qwen3 Coder 30B A3B Instruct
llama_model_loader: - kv  12:          general.base_model.0.organization str              = Qwen
llama_model_loader: - kv  13:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen3-Cod...
llama_model_loader: - kv  14:                               general.tags arr[str,2]       = ["unsloth", "text-generation"]
llama_model_loader: - kv  15:                       qwen3moe.block_count u32              = 48
llama_model_loader: - kv  16:                    qwen3moe.context_length u32              = 262144
llama_model_loader: - kv  17:                  qwen3moe.embedding_length u32              = 2048
llama_model_loader: - kv  18:               qwen3moe.feed_forward_length u32              = 5472
llama_model_loader: - kv  19:              qwen3moe.attention.head_count u32              = 32
llama_model_loader: - kv  20:           qwen3moe.attention.head_count_kv u32              = 4
llama_model_loader: - kv  21:                    qwen3moe.rope.freq_base f32              = 10000000.000000
llama_model_loader: - kv  22:  qwen3moe.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  23:                 qwen3moe.expert_used_count u32              = 8
llama_model_loader: - kv  24:              qwen3moe.attention.key_length u32              = 128
llama_model_loader: - kv  25:            qwen3moe.attention.value_length u32              = 128
llama_model_loader: - kv  26:                      qwen3moe.expert_count u32              = 128
llama_model_loader: - kv  27:        qwen3moe.expert_feed_forward_length u32              = 768
llama_model_loader: - kv  28: qwen3moe.expert_shared_feed_forward_length u32              = 0
llama_model_loader: - kv  29:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  30:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  31:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  32:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  33:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  34:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  35:            tokenizer.ggml.padding_token_id u32              = 151654
llama_model_loader: - kv  36:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  37:                    tokenizer.chat_template str              = {# Copyright 2025-present Unsloth. Ap...
llama_model_loader: - kv  38:               general.quantization_version u32              = 2
llama_model_loader: - kv  39:                          general.file_type u32              = 7
llama_model_loader: - kv  40:                      quantize.imatrix.file str              = Qwen3-Coder-30B-A3B-Instruct-GGUF/ima...
llama_model_loader: - kv  41:                   quantize.imatrix.dataset str              = unsloth_calibration_Qwen3-Coder-30B-A...
llama_model_loader: - kv  42:             quantize.imatrix.entries_count u32              = 384
llama_model_loader: - kv  43:              quantize.imatrix.chunks_count u32              = 154
llama_model_loader: - type  f32:  241 tensors
llama_model_loader: - type q8_0:  338 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q8_0
print_info: file size   = 30.25 GiB (8.51 BPW) 
init_tokenizer: initializing tokenizer for type 2
load: 0 unused tokens
load: control token: 151660 '<|fim_middle|>' is not marked as EOG
load: control token: 151659 '<|fim_prefix|>' is not marked as EOG
load: control token: 151653 '<|vision_end|>' is not marked as EOG
load: control token: 151648 '<|box_start|>' is not marked as EOG
load: control token: 151646 '<|object_ref_start|>' is not marked as EOG
load: control token: 151649 '<|box_end|>' is not marked as EOG
load: control token: 151655 '<|image_pad|>' is not marked as EOG
load: control token: 151651 '<|quad_end|>' is not marked as EOG
load: control token: 151647 '<|object_ref_end|>' is not marked as EOG
load: control token: 151652 '<|vision_start|>' is not marked as EOG
load: control token: 151654 '<|vision_pad|>' is not marked as EOG
load: control token: 151656 '<|video_pad|>' is not marked as EOG
load: control token: 151644 '<|im_start|>' is not marked as EOG
load: control token: 151661 '<|fim_suffix|>' is not marked as EOG
load: control token: 151650 '<|quad_start|>' is not marked as EOG
load: printing all EOG tokens:
load:   - 151643 ('<|endoftext|>')
load:   - 151645 ('<|im_end|>')
load:   - 151662 ('<|fim_pad|>')
load:   - 151663 ('<|repo_name|>')
load:   - 151664 ('<|file_sep|>')
load: special tokens cache size = 26
load: token to piece cache size = 0.9311 MB
print_info: arch             = qwen3moe
print_info: vocab_only       = 0
print_info: no_alloc         = 1
print_info: n_ctx_train      = 262144
print_info: n_embd           = 2048
print_info: n_embd_inp       = 2048
print_info: n_layer          = 48
print_info: n_head           = 32
print_info: n_head_kv        = 4
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: is_swa_any       = 0
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 8
print_info: n_embd_k_gqa     = 512
print_info: n_embd_v_gqa     = 512
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 5472
print_info: n_expert         = 128
print_info: n_expert_used    = 8
print_info: n_expert_groups  = 0
print_info: n_group_used     = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 10000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 262144
print_info: rope_yarn_log_mul= 0.0000
print_info: rope_finetuned   = unknown
print_info: model type       = 30B.A3B
print_info: model params     = 30.53 B
print_info: general.name     = Qwen3-Coder-30B-A3B-Instruct
print_info: n_ff_exp         = 768
print_info: vocab type       = BPE
print_info: n_vocab          = 151936
print_info: n_merges         = 151387
print_info: BOS token        = 11 ','
print_info: EOS token        = 151645 '<|im_end|>'
print_info: EOT token        = 151645 '<|im_end|>'
print_info: PAD token        = 151654 '<|vision_pad|>'
print_info: LF token         = 198 'Ċ'
print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
print_info: FIM MID token    = 151660 '<|fim_middle|>'
print_info: FIM PAD token    = 151662 '<|fim_pad|>'
print_info: FIM REP token    = 151663 '<|repo_name|>'
print_info: FIM SEP token    = 151664 '<|file_sep|>'
print_info: EOG token        = 151643 '<|endoftext|>'
print_info: EOG token        = 151645 '<|im_end|>'
print_info: EOG token        = 151662 '<|fim_pad|>'
print_info: EOG token        = 151663 '<|repo_name|>'
print_info: EOG token        = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = false, direct_io = true)
load_tensors: layer   0 assigned to device Vulkan0, is_swa = 0
load_tensors: layer   1 assigned to device Vulkan0, is_swa = 0
load_tensors: layer   2 assigned to device Vulkan0, is_swa = 0
load_tensors: layer   3 assigned to device Vulkan0, is_swa = 0
load_tensors: layer   4 assigned to device Vulkan0, is_swa = 0
load_tensors: layer   5 assigned to device Vulkan0, is_swa = 0
load_tensors: layer   6 assigned to device Vulkan0, is_swa = 0
load_tensors: layer   7 assigned to device Vulkan0, is_swa = 0
load_tensors: layer   8 assigned to device Vulkan0, is_swa = 0
load_tensors: layer   9 assigned to device Vulkan0, is_swa = 0
load_tensors: layer  10 assigned to device Vulkan0, is_swa = 0
load_tensors: layer  11 assigned to device Vulkan0, is_swa = 0
load_tensors: layer  12 assigned to device Vulkan0, is_swa = 0
load_tensors: layer  13 assigned to device Vulkan0, is_swa = 0
load_tensors: layer  14 assigned to device Vulkan0, is_swa = 0
load_tensors: layer  15 assigned to device Vulkan0, is_swa = 0
load_tensors: layer  16 assigned to device Vulkan0, is_swa = 0
load_tensors: layer  17 assigned to device Vulkan0, is_swa = 0
load_tensors: layer  18 assigned to device Vulkan0, is_swa = 0
load_tensors: layer  19 assigned to device Vulkan0, is_swa = 0
load_tensors: layer  20 assigned to device Vulkan0, is_swa = 0
load_tensors: layer  21 assigned to device Vulkan0, is_swa = 0
load_tensors: layer  22 assigned to device Vulkan0, is_swa = 0
load_tensors: layer  23 assigned to device Vulkan0, is_swa = 0
load_tensors: layer  24 assigned to device Vulkan0, is_swa = 0
load_tensors: layer  25 assigned to device Vulkan0, is_swa = 0
load_tensors: layer  26 assigned to device Vulkan0, is_swa = 0
load_tensors: layer  27 assigned to device Vulkan0, is_swa = 0
load_tensors: layer  28 assigned to device Vulkan0, is_swa = 0
load_tensors: layer  29 assigned to device Vulkan0, is_swa = 0
load_tensors: layer  30 assigned to device Vulkan0, is_swa = 0
load_tensors: layer  31 assigned to device Vulkan0, is_swa = 0
load_tensors: layer  32 assigned to device Vulkan0, is_swa = 0
load_tensors: layer  33 assigned to device Vulkan0, is_swa = 0
load_tensors: layer  34 assigned to device Vulkan0, is_swa = 0
load_tensors: layer  35 assigned to device Vulkan0, is_swa = 0
load_tensors: layer  36 assigned to device Vulkan0, is_swa = 0
load_tensors: layer  37 assigned to device Vulkan0, is_swa = 0
load_tensors: layer  38 assigned to device Vulkan0, is_swa = 0
load_tensors: layer  39 assigned to device Vulkan0, is_swa = 0
load_tensors: layer  40 assigned to device Vulkan0, is_swa = 0
load_tensors: layer  41 assigned to device Vulkan0, is_swa = 0
load_tensors: layer  42 assigned to device Vulkan0, is_swa = 0
load_tensors: layer  43 assigned to device Vulkan0, is_swa = 0
load_tensors: layer  44 assigned to device Vulkan0, is_swa = 0
load_tensors: layer  45 assigned to device Vulkan0, is_swa = 0
load_tensors: layer  46 assigned to device Vulkan0, is_swa = 0
load_tensors: layer  47 assigned to device Vulkan0, is_swa = 0
load_tensors: layer  48 assigned to device Vulkan0, is_swa = 0
create_tensor: loading tensor token_embd.weight
create_tensor: loading tensor output_norm.weight
create_tensor: loading tensor output.weight
create_tensor: loading tensor blk.0.attn_norm.weight
create_tensor: loading tensor blk.0.attn_q.weight
create_tensor: loading tensor blk.0.attn_k.weight
create_tensor: loading tensor blk.0.attn_v.weight
create_tensor: loading tensor blk.0.attn_output.weight
create_tensor: loading tensor blk.0.attn_k_norm.weight
create_tensor: loading tensor blk.0.attn_q_norm.weight
create_tensor: loading tensor blk.0.ffn_norm.weight
create_tensor: loading tensor blk.0.ffn_gate_inp.weight
create_tensor: loading tensor blk.0.ffn_gate_exps.weight
create_tensor: loading tensor blk.0.ffn_down_exps.weight
create_tensor: loading tensor blk.0.ffn_up_exps.weight
create_tensor: loading tensor blk.1.attn_norm.weight
create_tensor: loading tensor blk.1.attn_q.weight
create_tensor: loading tensor blk.1.attn_k.weight
create_tensor: loading tensor blk.1.attn_v.weight
create_tensor: loading tensor blk.1.attn_output.weight
create_tensor: loading tensor blk.1.attn_k_norm.weight
create_tensor: loading tensor blk.1.attn_q_norm.weight
create_tensor: loading tensor blk.1.ffn_norm.weight
create_tensor: loading tensor blk.1.ffn_gate_inp.weight
create_tensor: loading tensor blk.1.ffn_gate_exps.weight
create_tensor: loading tensor blk.1.ffn_down_exps.weight
create_tensor: loading tensor blk.1.ffn_up_exps.weight
create_tensor: loading tensor blk.2.attn_norm.weight
create_tensor: loading tensor blk.2.attn_q.weight
create_tensor: loading tensor blk.2.attn_k.weight
create_tensor: loading tensor blk.2.attn_v.weight
create_tensor: loading tensor blk.2.attn_output.weight
create_tensor: loading tensor blk.2.attn_k_norm.weight
create_tensor: loading tensor blk.2.attn_q_norm.weight
create_tensor: loading tensor blk.2.ffn_norm.weight
create_tensor: loading tensor blk.2.ffn_gate_inp.weight
create_tensor: loading tensor blk.2.ffn_gate_exps.weight
create_tensor: loading tensor blk.2.ffn_down_exps.weight
create_tensor: loading tensor blk.2.ffn_up_exps.weight
create_tensor: loading tensor blk.3.attn_norm.weight
create_tensor: loading tensor blk.3.attn_q.weight
create_tensor: loading tensor blk.3.attn_k.weight
create_tensor: loading tensor blk.3.attn_v.weight
create_tensor: loading tensor blk.3.attn_output.weight
create_tensor: loading tensor blk.3.attn_k_norm.weight
create_tensor: loading tensor blk.3.attn_q_norm.weight
create_tensor: loading tensor blk.3.ffn_norm.weight
create_tensor: loading tensor blk.3.ffn_gate_inp.weight
create_tensor: loading tensor blk.3.ffn_gate_exps.weight
create_tensor: loading tensor blk.3.ffn_down_exps.weight
create_tensor: loading tensor blk.3.ffn_up_exps.weight
create_tensor: loading tensor blk.4.attn_norm.weight
create_tensor: loading tensor blk.4.attn_q.weight
create_tensor: loading tensor blk.4.attn_k.weight
create_tensor: loading tensor blk.4.attn_v.weight
create_tensor: loading tensor blk.4.attn_output.weight
create_tensor: loading tensor blk.4.attn_k_norm.weight
create_tensor: loading tensor blk.4.attn_q_norm.weight
create_tensor: loading tensor blk.4.ffn_norm.weight
create_tensor: loading tensor blk.4.ffn_gate_inp.weight
create_tensor: loading tensor blk.4.ffn_gate_exps.weight
create_tensor: loading tensor blk.4.ffn_down_exps.weight
create_tensor: loading tensor blk.4.ffn_up_exps.weight
create_tensor: loading tensor blk.5.attn_norm.weight
create_tensor: loading tensor blk.5.attn_q.weight
create_tensor: loading tensor blk.5.attn_k.weight
create_tensor: loading tensor blk.5.attn_v.weight
create_tensor: loading tensor blk.5.attn_output.weight
create_tensor: loading tensor blk.5.attn_k_norm.weight
create_tensor: loading tensor blk.5.attn_q_norm.weight
create_tensor: loading tensor blk.5.ffn_norm.weight
create_tensor: loading tensor blk.5.ffn_gate_inp.weight
create_tensor: loading tensor blk.5.ffn_gate_exps.weight
create_tensor: loading tensor blk.5.ffn_down_exps.weight
create_tensor: loading tensor blk.5.ffn_up_exps.weight
create_tensor: loading tensor blk.6.attn_norm.weight
create_tensor: loading tensor blk.6.attn_q.weight
create_tensor: loading tensor blk.6.attn_k.weight
create_tensor: loading tensor blk.6.attn_v.weight
create_tensor: loading tensor blk.6.attn_output.weight
create_tensor: loading tensor blk.6.attn_k_norm.weight
create_tensor: loading tensor blk.6.attn_q_norm.weight
create_tensor: loading tensor blk.6.ffn_norm.weight
create_tensor: loading tensor blk.6.ffn_gate_inp.weight
create_tensor: loading tensor blk.6.ffn_gate_exps.weight
create_tensor: loading tensor blk.6.ffn_down_exps.weight
create_tensor: loading tensor blk.6.ffn_up_exps.weight
create_tensor: loading tensor blk.7.attn_norm.weight
create_tensor: loading tensor blk.7.attn_q.weight
create_tensor: loading tensor blk.7.attn_k.weight
create_tensor: loading tensor blk.7.attn_v.weight
create_tensor: loading tensor blk.7.attn_output.weight
create_tensor: loading tensor blk.7.attn_k_norm.weight
create_tensor: loading tensor blk.7.attn_q_norm.weight
create_tensor: loading tensor blk.7.ffn_norm.weight
create_tensor: loading tensor blk.7.ffn_gate_inp.weight
create_tensor: loading tensor blk.7.ffn_gate_exps.weight
create_tensor: loading tensor blk.7.ffn_down_exps.weight
create_tensor: loading tensor blk.7.ffn_up_exps.weight
create_tensor: loading tensor blk.8.attn_norm.weight
create_tensor: loading tensor blk.8.attn_q.weight
create_tensor: loading tensor blk.8.attn_k.weight
create_tensor: loading tensor blk.8.attn_v.weight
create_tensor: loading tensor blk.8.attn_output.weight
create_tensor: loading tensor blk.8.attn_k_norm.weight
create_tensor: loading tensor blk.8.attn_q_norm.weight
create_tensor: loading tensor blk.8.ffn_norm.weight
create_tensor: loading tensor blk.8.ffn_gate_inp.weight
create_tensor: loading tensor blk.8.ffn_gate_exps.weight
create_tensor: loading tensor blk.8.ffn_down_exps.weight
create_tensor: loading tensor blk.8.ffn_up_exps.weight
create_tensor: loading tensor blk.9.attn_norm.weight
create_tensor: loading tensor blk.9.attn_q.weight
create_tensor: loading tensor blk.9.attn_k.weight
create_tensor: loading tensor blk.9.attn_v.weight
create_tensor: loading tensor blk.9.attn_output.weight
create_tensor: loading tensor blk.9.attn_k_norm.weight
create_tensor: loading tensor blk.9.attn_q_norm.weight
create_tensor: loading tensor blk.9.ffn_norm.weight
create_tensor: loading tensor blk.9.ffn_gate_inp.weight
create_tensor: loading tensor blk.9.ffn_gate_exps.weight
create_tensor: loading tensor blk.9.ffn_down_exps.weight
create_tensor: loading tensor blk.9.ffn_up_exps.weight
create_tensor: loading tensor blk.10.attn_norm.weight
create_tensor: loading tensor blk.10.attn_q.weight
create_tensor: loading tensor blk.10.attn_k.weight
create_tensor: loading tensor blk.10.attn_v.weight
create_tensor: loading tensor blk.10.attn_output.weight
create_tensor: loading tensor blk.10.attn_k_norm.weight
create_tensor: loading tensor blk.10.attn_q_norm.weight
create_tensor: loading tensor blk.10.ffn_norm.weight
create_tensor: loading tensor blk.10.ffn_gate_inp.weight
create_tensor: loading tensor blk.10.ffn_gate_exps.weight
create_tensor: loading tensor blk.10.ffn_down_exps.weight
create_tensor: loading tensor blk.10.ffn_up_exps.weight
create_tensor: loading tensor blk.11.attn_norm.weight
create_tensor: loading tensor blk.11.attn_q.weight
create_tensor: loading tensor blk.11.attn_k.weight
create_tensor: loading tensor blk.11.attn_v.weight
create_tensor: loading tensor blk.11.attn_output.weight
create_tensor: loading tensor blk.11.attn_k_norm.weight
create_tensor: loading tensor blk.11.attn_q_norm.weight
create_tensor: loading tensor blk.11.ffn_norm.weight
create_tensor: loading tensor blk.11.ffn_gate_inp.weight
create_tensor: loading tensor blk.11.ffn_gate_exps.weight
create_tensor: loading tensor blk.11.ffn_down_exps.weight
create_tensor: loading tensor blk.11.ffn_up_exps.weight
create_tensor: loading tensor blk.12.attn_norm.weight
create_tensor: loading tensor blk.12.attn_q.weight
create_tensor: loading tensor blk.12.attn_k.weight
create_tensor: loading tensor blk.12.attn_v.weight
create_tensor: loading tensor blk.12.attn_output.weight
create_tensor: loading tensor blk.12.attn_k_norm.weight
create_tensor: loading tensor blk.12.attn_q_norm.weight
create_tensor: loading tensor blk.12.ffn_norm.weight
create_tensor: loading tensor blk.12.ffn_gate_inp.weight
create_tensor: loading tensor blk.12.ffn_gate_exps.weight
create_tensor: loading tensor blk.12.ffn_down_exps.weight
create_tensor: loading tensor blk.12.ffn_up_exps.weight
create_tensor: loading tensor blk.13.attn_norm.weight
create_tensor: loading tensor blk.13.attn_q.weight
create_tensor: loading tensor blk.13.attn_k.weight
create_tensor: loading tensor blk.13.attn_v.weight
create_tensor: loading tensor blk.13.attn_output.weight
create_tensor: loading tensor blk.13.attn_k_norm.weight
create_tensor: loading tensor blk.13.attn_q_norm.weight
create_tensor: loading tensor blk.13.ffn_norm.weight
create_tensor: loading tensor blk.13.ffn_gate_inp.weight
create_tensor: loading tensor blk.13.ffn_gate_exps.weight
create_tensor: loading tensor blk.13.ffn_down_exps.weight
create_tensor: loading tensor blk.13.ffn_up_exps.weight
create_tensor: loading tensor blk.14.attn_norm.weight
create_tensor: loading tensor blk.14.attn_q.weight
create_tensor: loading tensor blk.14.attn_k.weight
create_tensor: loading tensor blk.14.attn_v.weight
create_tensor: loading tensor blk.14.attn_output.weight
create_tensor: loading tensor blk.14.attn_k_norm.weight
create_tensor: loading tensor blk.14.attn_q_norm.weight
create_tensor: loading tensor blk.14.ffn_norm.weight
create_tensor: loading tensor blk.14.ffn_gate_inp.weight
create_tensor: loading tensor blk.14.ffn_gate_exps.weight
create_tensor: loading tensor blk.14.ffn_down_exps.weight
create_tensor: loading tensor blk.14.ffn_up_exps.weight
create_tensor: loading tensor blk.15.attn_norm.weight
create_tensor: loading tensor blk.15.attn_q.weight
create_tensor: loading tensor blk.15.attn_k.weight
create_tensor: loading tensor blk.15.attn_v.weight
create_tensor: loading tensor blk.15.attn_output.weight
create_tensor: loading tensor blk.15.attn_k_norm.weight
create_tensor: loading tensor blk.15.attn_q_norm.weight
create_tensor: loading tensor blk.15.ffn_norm.weight
create_tensor: loading tensor blk.15.ffn_gate_inp.weight
create_tensor: loading tensor blk.15.ffn_gate_exps.weight
create_tensor: loading tensor blk.15.ffn_down_exps.weight
create_tensor: loading tensor blk.15.ffn_up_exps.weight
create_tensor: loading tensor blk.16.attn_norm.weight
create_tensor: loading tensor blk.16.attn_q.weight
create_tensor: loading tensor blk.16.attn_k.weight
create_tensor: loading tensor blk.16.attn_v.weight
create_tensor: loading tensor blk.16.attn_output.weight
create_tensor: loading tensor blk.16.attn_k_norm.weight
create_tensor: loading tensor blk.16.attn_q_norm.weight
create_tensor: loading tensor blk.16.ffn_norm.weight
create_tensor: loading tensor blk.16.ffn_gate_inp.weight
create_tensor: loading tensor blk.16.ffn_gate_exps.weight
create_tensor: loading tensor blk.16.ffn_down_exps.weight
create_tensor: loading tensor blk.16.ffn_up_exps.weight
create_tensor: loading tensor blk.17.attn_norm.weight
create_tensor: loading tensor blk.17.attn_q.weight
create_tensor: loading tensor blk.17.attn_k.weight
create_tensor: loading tensor blk.17.attn_v.weight
create_tensor: loading tensor blk.17.attn_output.weight
create_tensor: loading tensor blk.17.attn_k_norm.weight
create_tensor: loading tensor blk.17.attn_q_norm.weight
create_tensor: loading tensor blk.17.ffn_norm.weight
create_tensor: loading tensor blk.17.ffn_gate_inp.weight
create_tensor: loading tensor blk.17.ffn_gate_exps.weight
create_tensor: loading tensor blk.17.ffn_down_exps.weight
create_tensor: loading tensor blk.17.ffn_up_exps.weight
create_tensor: loading tensor blk.18.attn_norm.weight
create_tensor: loading tensor blk.18.attn_q.weight
create_tensor: loading tensor blk.18.attn_k.weight
create_tensor: loading tensor blk.18.attn_v.weight
create_tensor: loading tensor blk.18.attn_output.weight
create_tensor: loading tensor blk.18.attn_k_norm.weight
create_tensor: loading tensor blk.18.attn_q_norm.weight
create_tensor: loading tensor blk.18.ffn_norm.weight
create_tensor: loading tensor blk.18.ffn_gate_inp.weight
create_tensor: loading tensor blk.18.ffn_gate_exps.weight
create_tensor: loading tensor blk.18.ffn_down_exps.weight
create_tensor: loading tensor blk.18.ffn_up_exps.weight
create_tensor: loading tensor blk.19.attn_norm.weight
create_tensor: loading tensor blk.19.attn_q.weight
create_tensor: loading tensor blk.19.attn_k.weight
create_tensor: loading tensor blk.19.attn_v.weight
create_tensor: loading tensor blk.19.attn_output.weight
create_tensor: loading tensor blk.19.attn_k_norm.weight
create_tensor: loading tensor blk.19.attn_q_norm.weight
create_tensor: loading tensor blk.19.ffn_norm.weight
create_tensor: loading tensor blk.19.ffn_gate_inp.weight
create_tensor: loading tensor blk.19.ffn_gate_exps.weight
create_tensor: loading tensor blk.19.ffn_down_exps.weight
create_tensor: loading tensor blk.19.ffn_up_exps.weight
create_tensor: loading tensor blk.20.attn_norm.weight
create_tensor: loading tensor blk.20.attn_q.weight
create_tensor: loading tensor blk.20.attn_k.weight
create_tensor: loading tensor blk.20.attn_v.weight
create_tensor: loading tensor blk.20.attn_output.weight
create_tensor: loading tensor blk.20.attn_k_norm.weight
create_tensor: loading tensor blk.20.attn_q_norm.weight
create_tensor: loading tensor blk.20.ffn_norm.weight
create_tensor: loading tensor blk.20.ffn_gate_inp.weight
create_tensor: loading tensor blk.20.ffn_gate_exps.weight
create_tensor: loading tensor blk.20.ffn_down_exps.weight
create_tensor: loading tensor blk.20.ffn_up_exps.weight
create_tensor: loading tensor blk.21.attn_norm.weight
create_tensor: loading tensor blk.21.attn_q.weight
create_tensor: loading tensor blk.21.attn_k.weight
create_tensor: loading tensor blk.21.attn_v.weight
create_tensor: loading tensor blk.21.attn_output.weight
create_tensor: loading tensor blk.21.attn_k_norm.weight
create_tensor: loading tensor blk.21.attn_q_norm.weight
create_tensor: loading tensor blk.21.ffn_norm.weight
create_tensor: loading tensor blk.21.ffn_gate_inp.weight
create_tensor: loading tensor blk.21.ffn_gate_exps.weight
create_tensor: loading tensor blk.21.ffn_down_exps.weight
create_tensor: loading tensor blk.21.ffn_up_exps.weight
create_tensor: loading tensor blk.22.attn_norm.weight
create_tensor: loading tensor blk.22.attn_q.weight
create_tensor: loading tensor blk.22.attn_k.weight
create_tensor: loading tensor blk.22.attn_v.weight
create_tensor: loading tensor blk.22.attn_output.weight
create_tensor: loading tensor blk.22.attn_k_norm.weight
create_tensor: loading tensor blk.22.attn_q_norm.weight
create_tensor: loading tensor blk.22.ffn_norm.weight
create_tensor: loading tensor blk.22.ffn_gate_inp.weight
create_tensor: loading tensor blk.22.ffn_gate_exps.weight
create_tensor: loading tensor blk.22.ffn_down_exps.weight
create_tensor: loading tensor blk.22.ffn_up_exps.weight
create_tensor: loading tensor blk.23.attn_norm.weight
create_tensor: loading tensor blk.23.attn_q.weight
create_tensor: loading tensor blk.23.attn_k.weight
create_tensor: loading tensor blk.23.attn_v.weight
create_tensor: loading tensor blk.23.attn_output.weight
create_tensor: loading tensor blk.23.attn_k_norm.weight
create_tensor: loading tensor blk.23.attn_q_norm.weight
create_tensor: loading tensor blk.23.ffn_norm.weight
create_tensor: loading tensor blk.23.ffn_gate_inp.weight
create_tensor: loading tensor blk.23.ffn_gate_exps.weight
create_tensor: loading tensor blk.23.ffn_down_exps.weight
create_tensor: loading tensor blk.23.ffn_up_exps.weight
create_tensor: loading tensor blk.24.attn_norm.weight
create_tensor: loading tensor blk.24.attn_q.weight
create_tensor: loading tensor blk.24.attn_k.weight
create_tensor: loading tensor blk.24.attn_v.weight
create_tensor: loading tensor blk.24.attn_output.weight
create_tensor: loading tensor blk.24.attn_k_norm.weight
create_tensor: loading tensor blk.24.attn_q_norm.weight
create_tensor: loading tensor blk.24.ffn_norm.weight
create_tensor: loading tensor blk.24.ffn_gate_inp.weight
create_tensor: loading tensor blk.24.ffn_gate_exps.weight
create_tensor: loading tensor blk.24.ffn_down_exps.weight
create_tensor: loading tensor blk.24.ffn_up_exps.weight
create_tensor: loading tensor blk.25.attn_norm.weight
create_tensor: loading tensor blk.25.attn_q.weight
create_tensor: loading tensor blk.25.attn_k.weight
create_tensor: loading tensor blk.25.attn_v.weight
create_tensor: loading tensor blk.25.attn_output.weight
create_tensor: loading tensor blk.25.attn_k_norm.weight
create_tensor: loading tensor blk.25.attn_q_norm.weight
create_tensor: loading tensor blk.25.ffn_norm.weight
create_tensor: loading tensor blk.25.ffn_gate_inp.weight
create_tensor: loading tensor blk.25.ffn_gate_exps.weight
create_tensor: loading tensor blk.25.ffn_down_exps.weight
create_tensor: loading tensor blk.25.ffn_up_exps.weight
create_tensor: loading tensor blk.26.attn_norm.weight
create_tensor: loading tensor blk.26.attn_q.weight
create_tensor: loading tensor blk.26.attn_k.weight
create_tensor: loading tensor blk.26.attn_v.weight
create_tensor: loading tensor blk.26.attn_output.weight
create_tensor: loading tensor blk.26.attn_k_norm.weight
create_tensor: loading tensor blk.26.attn_q_norm.weight
create_tensor: loading tensor blk.26.ffn_norm.weight
create_tensor: loading tensor blk.26.ffn_gate_inp.weight
create_tensor: loading tensor blk.26.ffn_gate_exps.weight
create_tensor: loading tensor blk.26.ffn_down_exps.weight
create_tensor: loading tensor blk.26.ffn_up_exps.weight
create_tensor: loading tensor blk.27.attn_norm.weight
create_tensor: loading tensor blk.27.attn_q.weight
create_tensor: loading tensor blk.27.attn_k.weight
create_tensor: loading tensor blk.27.attn_v.weight
create_tensor: loading tensor blk.27.attn_output.weight
create_tensor: loading tensor blk.27.attn_k_norm.weight
create_tensor: loading tensor blk.27.attn_q_norm.weight
create_tensor: loading tensor blk.27.ffn_norm.weight
create_tensor: loading tensor blk.27.ffn_gate_inp.weight
create_tensor: loading tensor blk.27.ffn_gate_exps.weight
create_tensor: loading tensor blk.27.ffn_down_exps.weight
create_tensor: loading tensor blk.27.ffn_up_exps.weight
create_tensor: loading tensor blk.28.attn_norm.weight
create_tensor: loading tensor blk.28.attn_q.weight
create_tensor: loading tensor blk.28.attn_k.weight
create_tensor: loading tensor blk.28.attn_v.weight
create_tensor: loading tensor blk.28.attn_output.weight
create_tensor: loading tensor blk.28.attn_k_norm.weight
create_tensor: loading tensor blk.28.attn_q_norm.weight
create_tensor: loading tensor blk.28.ffn_norm.weight
create_tensor: loading tensor blk.28.ffn_gate_inp.weight
create_tensor: loading tensor blk.28.ffn_gate_exps.weight
create_tensor: loading tensor blk.28.ffn_down_exps.weight
create_tensor: loading tensor blk.28.ffn_up_exps.weight
create_tensor: loading tensor blk.29.attn_norm.weight
create_tensor: loading tensor blk.29.attn_q.weight
create_tensor: loading tensor blk.29.attn_k.weight
create_tensor: loading tensor blk.29.attn_v.weight
create_tensor: loading tensor blk.29.attn_output.weight
create_tensor: loading tensor blk.29.attn_k_norm.weight
create_tensor: loading tensor blk.29.attn_q_norm.weight
create_tensor: loading tensor blk.29.ffn_norm.weight
create_tensor: loading tensor blk.29.ffn_gate_inp.weight
create_tensor: loading tensor blk.29.ffn_gate_exps.weight
create_tensor: loading tensor blk.29.ffn_down_exps.weight
create_tensor: loading tensor blk.29.ffn_up_exps.weight
create_tensor: loading tensor blk.30.attn_norm.weight
create_tensor: loading tensor blk.30.attn_q.weight
create_tensor: loading tensor blk.30.attn_k.weight
create_tensor: loading tensor blk.30.attn_v.weight
create_tensor: loading tensor blk.30.attn_output.weight
create_tensor: loading tensor blk.30.attn_k_norm.weight
create_tensor: loading tensor blk.30.attn_q_norm.weight
create_tensor: loading tensor blk.30.ffn_norm.weight
create_tensor: loading tensor blk.30.ffn_gate_inp.weight
create_tensor: loading tensor blk.30.ffn_gate_exps.weight
create_tensor: loading tensor blk.30.ffn_down_exps.weight
create_tensor: loading tensor blk.30.ffn_up_exps.weight
create_tensor: loading tensor blk.31.attn_norm.weight
create_tensor: loading tensor blk.31.attn_q.weight
create_tensor: loading tensor blk.31.attn_k.weight
create_tensor: loading tensor blk.31.attn_v.weight
create_tensor: loading tensor blk.31.attn_output.weight
create_tensor: loading tensor blk.31.attn_k_norm.weight
create_tensor: loading tensor blk.31.attn_q_norm.weight
create_tensor: loading tensor blk.31.ffn_norm.weight
create_tensor: loading tensor blk.31.ffn_gate_inp.weight
create_tensor: loading tensor blk.31.ffn_gate_exps.weight
create_tensor: loading tensor blk.31.ffn_down_exps.weight
create_tensor: loading tensor blk.31.ffn_up_exps.weight
create_tensor: loading tensor blk.32.attn_norm.weight
create_tensor: loading tensor blk.32.attn_q.weight
create_tensor: loading tensor blk.32.attn_k.weight
create_tensor: loading tensor blk.32.attn_v.weight
create_tensor: loading tensor blk.32.attn_output.weight
create_tensor: loading tensor blk.32.attn_k_norm.weight
create_tensor: loading tensor blk.32.attn_q_norm.weight
create_tensor: loading tensor blk.32.ffn_norm.weight
create_tensor: loading tensor blk.32.ffn_gate_inp.weight
create_tensor: loading tensor blk.32.ffn_gate_exps.weight
create_tensor: loading tensor blk.32.ffn_down_exps.weight
create_tensor: loading tensor blk.32.ffn_up_exps.weight
create_tensor: loading tensor blk.33.attn_norm.weight
create_tensor: loading tensor blk.33.attn_q.weight
create_tensor: loading tensor blk.33.attn_k.weight
create_tensor: loading tensor blk.33.attn_v.weight
create_tensor: loading tensor blk.33.attn_output.weight
create_tensor: loading tensor blk.33.attn_k_norm.weight
create_tensor: loading tensor blk.33.attn_q_norm.weight
create_tensor: loading tensor blk.33.ffn_norm.weight
create_tensor: loading tensor blk.33.ffn_gate_inp.weight
create_tensor: loading tensor blk.33.ffn_gate_exps.weight
create_tensor: loading tensor blk.33.ffn_down_exps.weight
create_tensor: loading tensor blk.33.ffn_up_exps.weight
create_tensor: loading tensor blk.34.attn_norm.weight
create_tensor: loading tensor blk.34.attn_q.weight
create_tensor: loading tensor blk.34.attn_k.weight
create_tensor: loading tensor blk.34.attn_v.weight
create_tensor: loading tensor blk.34.attn_output.weight
create_tensor: loading tensor blk.34.attn_k_norm.weight
create_tensor: loading tensor blk.34.attn_q_norm.weight
create_tensor: loading tensor blk.34.ffn_norm.weight
create_tensor: loading tensor blk.34.ffn_gate_inp.weight
create_tensor: loading tensor blk.34.ffn_gate_exps.weight
create_tensor: loading tensor blk.34.ffn_down_exps.weight
create_tensor: loading tensor blk.34.ffn_up_exps.weight
create_tensor: loading tensor blk.35.attn_norm.weight
create_tensor: loading tensor blk.35.attn_q.weight
create_tensor: loading tensor blk.35.attn_k.weight
create_tensor: loading tensor blk.35.attn_v.weight
create_tensor: loading tensor blk.35.attn_output.weight
create_tensor: loading tensor blk.35.attn_k_norm.weight
create_tensor: loading tensor blk.35.attn_q_norm.weight
create_tensor: loading tensor blk.35.ffn_norm.weight
create_tensor: loading tensor blk.35.ffn_gate_inp.weight
create_tensor: loading tensor blk.35.ffn_gate_exps.weight
create_tensor: loading tensor blk.35.ffn_down_exps.weight
create_tensor: loading tensor blk.35.ffn_up_exps.weight
create_tensor: loading tensor blk.36.attn_norm.weight
create_tensor: loading tensor blk.36.attn_q.weight
create_tensor: loading tensor blk.36.attn_k.weight
create_tensor: loading tensor blk.36.attn_v.weight
create_tensor: loading tensor blk.36.attn_output.weight
create_tensor: loading tensor blk.36.attn_k_norm.weight
create_tensor: loading tensor blk.36.attn_q_norm.weight
create_tensor: loading tensor blk.36.ffn_norm.weight
create_tensor: loading tensor blk.36.ffn_gate_inp.weight
create_tensor: loading tensor blk.36.ffn_gate_exps.weight
create_tensor: loading tensor blk.36.ffn_down_exps.weight
create_tensor: loading tensor blk.36.ffn_up_exps.weight
create_tensor: loading tensor blk.37.attn_norm.weight
create_tensor: loading tensor blk.37.attn_q.weight
create_tensor: loading tensor blk.37.attn_k.weight
create_tensor: loading tensor blk.37.attn_v.weight
create_tensor: loading tensor blk.37.attn_output.weight
create_tensor: loading tensor blk.37.attn_k_norm.weight
create_tensor: loading tensor blk.37.attn_q_norm.weight
create_tensor: loading tensor blk.37.ffn_norm.weight
create_tensor: loading tensor blk.37.ffn_gate_inp.weight
create_tensor: loading tensor blk.37.ffn_gate_exps.weight
create_tensor: loading tensor blk.37.ffn_down_exps.weight
create_tensor: loading tensor blk.37.ffn_up_exps.weight
create_tensor: loading tensor blk.38.attn_norm.weight
create_tensor: loading tensor blk.38.attn_q.weight
create_tensor: loading tensor blk.38.attn_k.weight
create_tensor: loading tensor blk.38.attn_v.weight
create_tensor: loading tensor blk.38.attn_output.weight
create_tensor: loading tensor blk.38.attn_k_norm.weight
create_tensor: loading tensor blk.38.attn_q_norm.weight
create_tensor: loading tensor blk.38.ffn_norm.weight
create_tensor: loading tensor blk.38.ffn_gate_inp.weight
create_tensor: loading tensor blk.38.ffn_gate_exps.weight
create_tensor: loading tensor blk.38.ffn_down_exps.weight
create_tensor: loading tensor blk.38.ffn_up_exps.weight
create_tensor: loading tensor blk.39.attn_norm.weight
create_tensor: loading tensor blk.39.attn_q.weight
create_tensor: loading tensor blk.39.attn_k.weight
create_tensor: loading tensor blk.39.attn_v.weight
create_tensor: loading tensor blk.39.attn_output.weight
create_tensor: loading tensor blk.39.attn_k_norm.weight
create_tensor: loading tensor blk.39.attn_q_norm.weight
create_tensor: loading tensor blk.39.ffn_norm.weight
create_tensor: loading tensor blk.39.ffn_gate_inp.weight
create_tensor: loading tensor blk.39.ffn_gate_exps.weight
create_tensor: loading tensor blk.39.ffn_down_exps.weight
create_tensor: loading tensor blk.39.ffn_up_exps.weight
create_tensor: loading tensor blk.40.attn_norm.weight
create_tensor: loading tensor blk.40.attn_q.weight
create_tensor: loading tensor blk.40.attn_k.weight
create_tensor: loading tensor blk.40.attn_v.weight
create_tensor: loading tensor blk.40.attn_output.weight
create_tensor: loading tensor blk.40.attn_k_norm.weight
create_tensor: loading tensor blk.40.attn_q_norm.weight
create_tensor: loading tensor blk.40.ffn_norm.weight
create_tensor: loading tensor blk.40.ffn_gate_inp.weight
create_tensor: loading tensor blk.40.ffn_gate_exps.weight
create_tensor: loading tensor blk.40.ffn_down_exps.weight
create_tensor: loading tensor blk.40.ffn_up_exps.weight
create_tensor: loading tensor blk.41.attn_norm.weight
create_tensor: loading tensor blk.41.attn_q.weight
create_tensor: loading tensor blk.41.attn_k.weight
create_tensor: loading tensor blk.41.attn_v.weight
create_tensor: loading tensor blk.41.attn_output.weight
create_tensor: loading tensor blk.41.attn_k_norm.weight
create_tensor: loading tensor blk.41.attn_q_norm.weight
create_tensor: loading tensor blk.41.ffn_norm.weight
create_tensor: loading tensor blk.41.ffn_gate_inp.weight
create_tensor: loading tensor blk.41.ffn_gate_exps.weight
create_tensor: loading tensor blk.41.ffn_down_exps.weight
create_tensor: loading tensor blk.41.ffn_up_exps.weight
create_tensor: loading tensor blk.42.attn_norm.weight
create_tensor: loading tensor blk.42.attn_q.weight
create_tensor: loading tensor blk.42.attn_k.weight
create_tensor: loading tensor blk.42.attn_v.weight
create_tensor: loading tensor blk.42.attn_output.weight
create_tensor: loading tensor blk.42.attn_k_norm.weight
create_tensor: loading tensor blk.42.attn_q_norm.weight
create_tensor: loading tensor blk.42.ffn_norm.weight
create_tensor: loading tensor blk.42.ffn_gate_inp.weight
create_tensor: loading tensor blk.42.ffn_gate_exps.weight
create_tensor: loading tensor blk.42.ffn_down_exps.weight
create_tensor: loading tensor blk.42.ffn_up_exps.weight
create_tensor: loading tensor blk.43.attn_norm.weight
create_tensor: loading tensor blk.43.attn_q.weight
create_tensor: loading tensor blk.43.attn_k.weight
create_tensor: loading tensor blk.43.attn_v.weight
create_tensor: loading tensor blk.43.attn_output.weight
create_tensor: loading tensor blk.43.attn_k_norm.weight
create_tensor: loading tensor blk.43.attn_q_norm.weight
create_tensor: loading tensor blk.43.ffn_norm.weight
create_tensor: loading tensor blk.43.ffn_gate_inp.weight
create_tensor: loading tensor blk.43.ffn_gate_exps.weight
create_tensor: loading tensor blk.43.ffn_down_exps.weight
create_tensor: loading tensor blk.43.ffn_up_exps.weight
create_tensor: loading tensor blk.44.attn_norm.weight
create_tensor: loading tensor blk.44.attn_q.weight
create_tensor: loading tensor blk.44.attn_k.weight
create_tensor: loading tensor blk.44.attn_v.weight
create_tensor: loading tensor blk.44.attn_output.weight
create_tensor: loading tensor blk.44.attn_k_norm.weight
create_tensor: loading tensor blk.44.attn_q_norm.weight
create_tensor: loading tensor blk.44.ffn_norm.weight
create_tensor: loading tensor blk.44.ffn_gate_inp.weight
create_tensor: loading tensor blk.44.ffn_gate_exps.weight
create_tensor: loading tensor blk.44.ffn_down_exps.weight
create_tensor: loading tensor blk.44.ffn_up_exps.weight
create_tensor: loading tensor blk.45.attn_norm.weight
create_tensor: loading tensor blk.45.attn_q.weight
create_tensor: loading tensor blk.45.attn_k.weight
create_tensor: loading tensor blk.45.attn_v.weight
create_tensor: loading tensor blk.45.attn_output.weight
create_tensor: loading tensor blk.45.attn_k_norm.weight
create_tensor: loading tensor blk.45.attn_q_norm.weight
create_tensor: loading tensor blk.45.ffn_norm.weight
create_tensor: loading tensor blk.45.ffn_gate_inp.weight
create_tensor: loading tensor blk.45.ffn_gate_exps.weight
create_tensor: loading tensor blk.45.ffn_down_exps.weight
create_tensor: loading tensor blk.45.ffn_up_exps.weight
create_tensor: loading tensor blk.46.attn_norm.weight
create_tensor: loading tensor blk.46.attn_q.weight
create_tensor: loading tensor blk.46.attn_k.weight
create_tensor: loading tensor blk.46.attn_v.weight
create_tensor: loading tensor blk.46.attn_output.weight
create_tensor: loading tensor blk.46.attn_k_norm.weight
create_tensor: loading tensor blk.46.attn_q_norm.weight
create_tensor: loading tensor blk.46.ffn_norm.weight
create_tensor: loading tensor blk.46.ffn_gate_inp.weight
create_tensor: loading tensor blk.46.ffn_gate_exps.weight
create_tensor: loading tensor blk.46.ffn_down_exps.weight
create_tensor: loading tensor blk.46.ffn_up_exps.weight
create_tensor: loading tensor blk.47.attn_norm.weight
create_tensor: loading tensor blk.47.attn_q.weight
create_tensor: loading tensor blk.47.attn_k.weight
create_tensor: loading tensor blk.47.attn_v.weight
create_tensor: loading tensor blk.47.attn_output.weight
create_tensor: loading tensor blk.47.attn_k_norm.weight
create_tensor: loading tensor blk.47.attn_q_norm.weight
create_tensor: loading tensor blk.47.ffn_norm.weight
create_tensor: loading tensor blk.47.ffn_gate_inp.weight
create_tensor: loading tensor blk.47.ffn_gate_exps.weight
create_tensor: loading tensor blk.47.ffn_down_exps.weight
create_tensor: loading tensor blk.47.ffn_up_exps.weight
load_tensors: offloading output layer to GPU
load_tensors: offloading 47 repeating layers to GPU
load_tensors: offloaded 49/49 layers to GPU
load_tensors:      Vulkan0 model buffer size =     0.00 MiB
load_tensors:  Vulkan_Host model buffer size =     0.00 MiB
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 30976
llama_context: n_ctx_seq     = 30976
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = disabled
llama_context: kv_unified    = true
llama_context: freq_base     = 10000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_seq (30976) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
set_abort_callback: call
llama_context: Vulkan_Host  output buffer size =     0.58 MiB
llama_kv_cache: layer   0: dev = Vulkan0
llama_kv_cache: layer   1: dev = Vulkan0
llama_kv_cache: layer   2: dev = Vulkan0
llama_kv_cache: layer   3: dev = Vulkan0
llama_kv_cache: layer   4: dev = Vulkan0
llama_kv_cache: layer   5: dev = Vulkan0
llama_kv_cache: layer   6: dev = Vulkan0
llama_kv_cache: layer   7: dev = Vulkan0
llama_kv_cache: layer   8: dev = Vulkan0
llama_kv_cache: layer   9: dev = Vulkan0
llama_kv_cache: layer  10: dev = Vulkan0
llama_kv_cache: layer  11: dev = Vulkan0
llama_kv_cache: layer  12: dev = Vulkan0
llama_kv_cache: layer  13: dev = Vulkan0
llama_kv_cache: layer  14: dev = Vulkan0
llama_kv_cache: layer  15: dev = Vulkan0
llama_kv_cache: layer  16: dev = Vulkan0
llama_kv_cache: layer  17: dev = Vulkan0
llama_kv_cache: layer  18: dev = Vulkan0
llama_kv_cache: layer  19: dev = Vulkan0
llama_kv_cache: layer  20: dev = Vulkan0
llama_kv_cache: layer  21: dev = Vulkan0
llama_kv_cache: layer  22: dev = Vulkan0
llama_kv_cache: layer  23: dev = Vulkan0
llama_kv_cache: layer  24: dev = Vulkan0
llama_kv_cache: layer  25: dev = Vulkan0
llama_kv_cache: layer  26: dev = Vulkan0
llama_kv_cache: layer  27: dev = Vulkan0
llama_kv_cache: layer  28: dev = Vulkan0
llama_kv_cache: layer  29: dev = Vulkan0
llama_kv_cache: layer  30: dev = Vulkan0
llama_kv_cache: layer  31: dev = Vulkan0
llama_kv_cache: layer  32: dev = Vulkan0
llama_kv_cache: layer  33: dev = Vulkan0
llama_kv_cache: layer  34: dev = Vulkan0
llama_kv_cache: layer  35: dev = Vulkan0
llama_kv_cache: layer  36: dev = Vulkan0
llama_kv_cache: layer  37: dev = Vulkan0
llama_kv_cache: layer  38: dev = Vulkan0
llama_kv_cache: layer  39: dev = Vulkan0
llama_kv_cache: layer  40: dev = Vulkan0
llama_kv_cache: layer  41: dev = Vulkan0
llama_kv_cache: layer  42: dev = Vulkan0
llama_kv_cache: layer  43: dev = Vulkan0
llama_kv_cache: layer  44: dev = Vulkan0
llama_kv_cache: layer  45: dev = Vulkan0
llama_kv_cache: layer  46: dev = Vulkan0
llama_kv_cache: layer  47: dev = Vulkan0
llama_kv_cache:    Vulkan0 KV buffer size =     0.00 MiB
llama_kv_cache: size = 2904.00 MiB ( 30976 cells,  48 layers,  1/1 seqs), K (f16): 1452.00 MiB, V (f16): 1452.00 MiB
llama_context: enumerating backends
llama_context: backend_ptrs.size() = 2
llama_context: max_nodes = 4632
llama_context: reserving full memory module
llama_context: worst-case: n_tokens = 512, n_seqs = 1, n_outputs = 1
graph_reserve: reserving a graph for ubatch with n_tokens =  512, n_seqs =  1, n_outputs =  512
graph_reserve: reserving a graph for ubatch with n_tokens =    1, n_seqs =  1, n_outputs =    1
graph_reserve: reserving a graph for ubatch with n_tokens =  512, n_seqs =  1, n_outputs =  512
llama_context:    Vulkan0 compute buffer size =  3960.51 MiB
llama_context: Vulkan_Host compute buffer size =    66.52 MiB
llama_context: graph nodes  = 3270
llama_context: graph splits = 2
llama_memory_breakdown_print: | memory breakdown [MiB]               | total    free     self   model   context   compute       unaccounted |
llama_memory_breakdown_print: |   - Vulkan0 (Graphics (RADV RENOIR)) | 58368 = 57309 + (37522 = 30658 +    2904 +    3960) + 17592186007951 |
llama_memory_breakdown_print: |   - Host                             |                    381 =   315 +       0 +      66                   |
llama_params_fit_impl: projected to use 37522 MiB of device memory vs. 57309 MiB of free device memory
llama_params_fit_impl: will leave 19787 >= 1024 MiB of free device memory, no changes needed
llama_params_fit: successfully fit params to free device memory
llama_params_fit: fitting params to free memory took 0.28 seconds
llama_model_load_from_file_impl: using device Vulkan0 (AMD Radeon Graphics (RADV RENOIR)) (0000:03:00.0) - 57310 MiB free
llama_model_loader: loaded meta data with 44 key-value pairs and 579 tensors from /home/tipu/AI/models/ggml-org/Qwen3-Coder-30B-A3B/Qwen3-Coder-30B-A3B-Instruct-Q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen3moe
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Qwen3-Coder-30B-A3B-Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Qwen3-Coder-30B-A3B-Instruct
llama_model_loader: - kv   5:                       general.quantized_by str              = Unsloth
llama_model_loader: - kv   6:                         general.size_label str              = 30B-A3B
llama_model_loader: - kv   7:                            general.license str              = apache-2.0
llama_model_loader: - kv   8:                       general.license.link str              = https://huggingface.co/Qwen/Qwen3-Cod...
llama_model_loader: - kv   9:                           general.repo_url str              = https://huggingface.co/unsloth
llama_model_loader: - kv  10:                   general.base_model.count u32              = 1
llama_model_loader: - kv  11:                  general.base_model.0.name str              = Qwen3 Coder 30B A3B Instruct
llama_model_loader: - kv  12:          general.base_model.0.organization str              = Qwen
llama_model_loader: - kv  13:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen3-Cod...
llama_model_loader: - kv  14:                               general.tags arr[str,2]       = ["unsloth", "text-generation"]
llama_model_loader: - kv  15:                       qwen3moe.block_count u32              = 48
llama_model_loader: - kv  16:                    qwen3moe.context_length u32              = 262144
llama_model_loader: - kv  17:                  qwen3moe.embedding_length u32              = 2048
llama_model_loader: - kv  18:               qwen3moe.feed_forward_length u32              = 5472
llama_model_loader: - kv  19:              qwen3moe.attention.head_count u32              = 32
llama_model_loader: - kv  20:           qwen3moe.attention.head_count_kv u32              = 4
llama_model_loader: - kv  21:                    qwen3moe.rope.freq_base f32              = 10000000.000000
llama_model_loader: - kv  22:  qwen3moe.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  23:                 qwen3moe.expert_used_count u32              = 8
llama_model_loader: - kv  24:              qwen3moe.attention.key_length u32              = 128
llama_model_loader: - kv  25:            qwen3moe.attention.value_length u32              = 128
llama_model_loader: - kv  26:                      qwen3moe.expert_count u32              = 128
llama_model_loader: - kv  27:        qwen3moe.expert_feed_forward_length u32              = 768
llama_model_loader: - kv  28: qwen3moe.expert_shared_feed_forward_length u32              = 0
llama_model_loader: - kv  29:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  30:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  31:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  32:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  33:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  34:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  35:            tokenizer.ggml.padding_token_id u32              = 151654
llama_model_loader: - kv  36:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  37:                    tokenizer.chat_template str              = {# Copyright 2025-present Unsloth. Ap...
llama_model_loader: - kv  38:               general.quantization_version u32              = 2
llama_model_loader: - kv  39:                          general.file_type u32              = 7
llama_model_loader: - kv  40:                      quantize.imatrix.file str              = Qwen3-Coder-30B-A3B-Instruct-GGUF/ima...
llama_model_loader: - kv  41:                   quantize.imatrix.dataset str              = unsloth_calibration_Qwen3-Coder-30B-A...
llama_model_loader: - kv  42:             quantize.imatrix.entries_count u32              = 384
llama_model_loader: - kv  43:              quantize.imatrix.chunks_count u32              = 154
llama_model_loader: - type  f32:  241 tensors
llama_model_loader: - type q8_0:  338 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q8_0
print_info: file size   = 30.25 GiB (8.51 BPW) 
init_tokenizer: initializing tokenizer for type 2
load: 0 unused tokens
load: control token: 151660 '<|fim_middle|>' is not marked as EOG
load: control token: 151659 '<|fim_prefix|>' is not marked as EOG
load: control token: 151653 '<|vision_end|>' is not marked as EOG
load: control token: 151648 '<|box_start|>' is not marked as EOG
load: control token: 151646 '<|object_ref_start|>' is not marked as EOG
load: control token: 151649 '<|box_end|>' is not marked as EOG
load: control token: 151655 '<|image_pad|>' is not marked as EOG
load: control token: 151651 '<|quad_end|>' is not marked as EOG
load: control token: 151647 '<|object_ref_end|>' is not marked as EOG
load: control token: 151652 '<|vision_start|>' is not marked as EOG
load: control token: 151654 '<|vision_pad|>' is not marked as EOG
load: control token: 151656 '<|video_pad|>' is not marked as EOG
load: control token: 151644 '<|im_start|>' is not marked as EOG
load: control token: 151661 '<|fim_suffix|>' is not marked as EOG
load: control token: 151650 '<|quad_start|>' is not marked as EOG
load: printing all EOG tokens:
load:   - 151643 ('<|endoftext|>')
load:   - 151645 ('<|im_end|>')
load:   - 151662 ('<|fim_pad|>')
load:   - 151663 ('<|repo_name|>')
load:   - 151664 ('<|file_sep|>')
load: special tokens cache size = 26
load: token to piece cache size = 0.9311 MB
print_info: arch             = qwen3moe
print_info: vocab_only       = 0
print_info: no_alloc         = 0
print_info: n_ctx_train      = 262144
print_info: n_embd           = 2048
print_info: n_embd_inp       = 2048
print_info: n_layer          = 48
print_info: n_head           = 32
print_info: n_head_kv        = 4
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: is_swa_any       = 0
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 8
print_info: n_embd_k_gqa     = 512
print_info: n_embd_v_gqa     = 512
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 5472
print_info: n_expert         = 128
print_info: n_expert_used    = 8
print_info: n_expert_groups  = 0
print_info: n_group_used     = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 10000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 262144
print_info: rope_yarn_log_mul= 0.0000
print_info: rope_finetuned   = unknown
print_info: model type       = 30B.A3B
print_info: model params     = 30.53 B
print_info: general.name     = Qwen3-Coder-30B-A3B-Instruct
print_info: n_ff_exp         = 768
print_info: vocab type       = BPE
print_info: n_vocab          = 151936
print_info: n_merges         = 151387
print_info: BOS token        = 11 ','
print_info: EOS token        = 151645 '<|im_end|>'
print_info: EOT token        = 151645 '<|im_end|>'
print_info: PAD token        = 151654 '<|vision_pad|>'
print_info: LF token         = 198 'Ċ'
print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
print_info: FIM MID token    = 151660 '<|fim_middle|>'
print_info: FIM PAD token    = 151662 '<|fim_pad|>'
print_info: FIM REP token    = 151663 '<|repo_name|>'
print_info: FIM SEP token    = 151664 '<|file_sep|>'
print_info: EOG token        = 151643 '<|endoftext|>'
print_info: EOG token        = 151645 '<|im_end|>'
print_info: EOG token        = 151662 '<|fim_pad|>'
print_info: EOG token        = 151663 '<|repo_name|>'
print_info: EOG token        = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = false, direct_io = true)
load_tensors: layer   0 assigned to device Vulkan0, is_swa = 0
load_tensors: layer   1 assigned to device Vulkan0, is_swa = 0
load_tensors: layer   2 assigned to device Vulkan0, is_swa = 0
load_tensors: layer   3 assigned to device Vulkan0, is_swa = 0
load_tensors: layer   4 assigned to device Vulkan0, is_swa = 0
load_tensors: layer   5 assigned to device Vulkan0, is_swa = 0
load_tensors: layer   6 assigned to device Vulkan0, is_swa = 0
load_tensors: layer   7 assigned to device Vulkan0, is_swa = 0
load_tensors: layer   8 assigned to device Vulkan0, is_swa = 0
load_tensors: layer   9 assigned to device Vulkan0, is_swa = 0
load_tensors: layer  10 assigned to device Vulkan0, is_swa = 0
load_tensors: layer  11 assigned to device Vulkan0, is_swa = 0
load_tensors: layer  12 assigned to device Vulkan0, is_swa = 0
load_tensors: layer  13 assigned to device Vulkan0, is_swa = 0
load_tensors: layer  14 assigned to device Vulkan0, is_swa = 0
load_tensors: layer  15 assigned to device Vulkan0, is_swa = 0
load_tensors: layer  16 assigned to device Vulkan0, is_swa = 0
load_tensors: layer  17 assigned to device Vulkan0, is_swa = 0
load_tensors: layer  18 assigned to device Vulkan0, is_swa = 0
load_tensors: layer  19 assigned to device Vulkan0, is_swa = 0
load_tensors: layer  20 assigned to device Vulkan0, is_swa = 0
load_tensors: layer  21 assigned to device Vulkan0, is_swa = 0
load_tensors: layer  22 assigned to device Vulkan0, is_swa = 0
load_tensors: layer  23 assigned to device Vulkan0, is_swa = 0
load_tensors: layer  24 assigned to device Vulkan0, is_swa = 0
load_tensors: layer  25 assigned to device Vulkan0, is_swa = 0
load_tensors: layer  26 assigned to device Vulkan0, is_swa = 0
load_tensors: layer  27 assigned to device Vulkan0, is_swa = 0
load_tensors: layer  28 assigned to device Vulkan0, is_swa = 0
load_tensors: layer  29 assigned to device Vulkan0, is_swa = 0
load_tensors: layer  30 assigned to device Vulkan0, is_swa = 0
load_tensors: layer  31 assigned to device Vulkan0, is_swa = 0
load_tensors: layer  32 assigned to device Vulkan0, is_swa = 0
load_tensors: layer  33 assigned to device Vulkan0, is_swa = 0
load_tensors: layer  34 assigned to device Vulkan0, is_swa = 0
load_tensors: layer  35 assigned to device Vulkan0, is_swa = 0
load_tensors: layer  36 assigned to device Vulkan0, is_swa = 0
load_tensors: layer  37 assigned to device Vulkan0, is_swa = 0
load_tensors: layer  38 assigned to device Vulkan0, is_swa = 0
load_tensors: layer  39 assigned to device Vulkan0, is_swa = 0
load_tensors: layer  40 assigned to device Vulkan0, is_swa = 0
load_tensors: layer  41 assigned to device Vulkan0, is_swa = 0
load_tensors: layer  42 assigned to device Vulkan0, is_swa = 0
load_tensors: layer  43 assigned to device Vulkan0, is_swa = 0
load_tensors: layer  44 assigned to device Vulkan0, is_swa = 0
load_tensors: layer  45 assigned to device Vulkan0, is_swa = 0
load_tensors: layer  46 assigned to device Vulkan0, is_swa = 0
load_tensors: layer  47 assigned to device Vulkan0, is_swa = 0
load_tensors: layer  48 assigned to device Vulkan0, is_swa = 0
create_tensor: loading tensor token_embd.weight
create_tensor: loading tensor output_norm.weight
create_tensor: loading tensor output.weight
create_tensor: loading tensor blk.0.attn_norm.weight
create_tensor: loading tensor blk.0.attn_q.weight
create_tensor: loading tensor blk.0.attn_k.weight
create_tensor: loading tensor blk.0.attn_v.weight
create_tensor: loading tensor blk.0.attn_output.weight
create_tensor: loading tensor blk.0.attn_k_norm.weight
create_tensor: loading tensor blk.0.attn_q_norm.weight
create_tensor: loading tensor blk.0.ffn_norm.weight
create_tensor: loading tensor blk.0.ffn_gate_inp.weight
create_tensor: loading tensor blk.0.ffn_gate_exps.weight
create_tensor: loading tensor blk.0.ffn_down_exps.weight
create_tensor: loading tensor blk.0.ffn_up_exps.weight
create_tensor: loading tensor blk.1.attn_norm.weight
create_tensor: loading tensor blk.1.attn_q.weight
create_tensor: loading tensor blk.1.attn_k.weight
create_tensor: loading tensor blk.1.attn_v.weight
create_tensor: loading tensor blk.1.attn_output.weight
create_tensor: loading tensor blk.1.attn_k_norm.weight
create_tensor: loading tensor blk.1.attn_q_norm.weight
create_tensor: loading tensor blk.1.ffn_norm.weight
create_tensor: loading tensor blk.1.ffn_gate_inp.weight
create_tensor: loading tensor blk.1.ffn_gate_exps.weight
create_tensor: loading tensor blk.1.ffn_down_exps.weight
create_tensor: loading tensor blk.1.ffn_up_exps.weight
create_tensor: loading tensor blk.2.attn_norm.weight
create_tensor: loading tensor blk.2.attn_q.weight
create_tensor: loading tensor blk.2.attn_k.weight
create_tensor: loading tensor blk.2.attn_v.weight
create_tensor: loading tensor blk.2.attn_output.weight
create_tensor: loading tensor blk.2.attn_k_norm.weight
create_tensor: loading tensor blk.2.attn_q_norm.weight
create_tensor: loading tensor blk.2.ffn_norm.weight
create_tensor: loading tensor blk.2.ffn_gate_inp.weight
create_tensor: loading tensor blk.2.ffn_gate_exps.weight
create_tensor: loading tensor blk.2.ffn_down_exps.weight
create_tensor: loading tensor blk.2.ffn_up_exps.weight
create_tensor: loading tensor blk.3.attn_norm.weight
create_tensor: loading tensor blk.3.attn_q.weight
create_tensor: loading tensor blk.3.attn_k.weight
create_tensor: loading tensor blk.3.attn_v.weight
create_tensor: loading tensor blk.3.attn_output.weight
create_tensor: loading tensor blk.3.attn_k_norm.weight
create_tensor: loading tensor blk.3.attn_q_norm.weight
create_tensor: loading tensor blk.3.ffn_norm.weight
create_tensor: loading tensor blk.3.ffn_gate_inp.weight
create_tensor: loading tensor blk.3.ffn_gate_exps.weight
create_tensor: loading tensor blk.3.ffn_down_exps.weight
create_tensor: loading tensor blk.3.ffn_up_exps.weight
create_tensor: loading tensor blk.4.attn_norm.weight
create_tensor: loading tensor blk.4.attn_q.weight
create_tensor: loading tensor blk.4.attn_k.weight
create_tensor: loading tensor blk.4.attn_v.weight
create_tensor: loading tensor blk.4.attn_output.weight
create_tensor: loading tensor blk.4.attn_k_norm.weight
create_tensor: loading tensor blk.4.attn_q_norm.weight
create_tensor: loading tensor blk.4.ffn_norm.weight
create_tensor: loading tensor blk.4.ffn_gate_inp.weight
create_tensor: loading tensor blk.4.ffn_gate_exps.weight
create_tensor: loading tensor blk.4.ffn_down_exps.weight
create_tensor: loading tensor blk.4.ffn_up_exps.weight
create_tensor: loading tensor blk.5.attn_norm.weight
create_tensor: loading tensor blk.5.attn_q.weight
create_tensor: loading tensor blk.5.attn_k.weight
create_tensor: loading tensor blk.5.attn_v.weight
create_tensor: loading tensor blk.5.attn_output.weight
create_tensor: loading tensor blk.5.attn_k_norm.weight
create_tensor: loading tensor blk.5.attn_q_norm.weight
create_tensor: loading tensor blk.5.ffn_norm.weight
create_tensor: loading tensor blk.5.ffn_gate_inp.weight
create_tensor: loading tensor blk.5.ffn_gate_exps.weight
create_tensor: loading tensor blk.5.ffn_down_exps.weight
create_tensor: loading tensor blk.5.ffn_up_exps.weight
create_tensor: loading tensor blk.6.attn_norm.weight
create_tensor: loading tensor blk.6.attn_q.weight
create_tensor: loading tensor blk.6.attn_k.weight
create_tensor: loading tensor blk.6.attn_v.weight
create_tensor: loading tensor blk.6.attn_output.weight
create_tensor: loading tensor blk.6.attn_k_norm.weight
create_tensor: loading tensor blk.6.attn_q_norm.weight
create_tensor: loading tensor blk.6.ffn_norm.weight
create_tensor: loading tensor blk.6.ffn_gate_inp.weight
create_tensor: loading tensor blk.6.ffn_gate_exps.weight
create_tensor: loading tensor blk.6.ffn_down_exps.weight
create_tensor: loading tensor blk.6.ffn_up_exps.weight
create_tensor: loading tensor blk.7.attn_norm.weight
create_tensor: loading tensor blk.7.attn_q.weight
create_tensor: loading tensor blk.7.attn_k.weight
create_tensor: loading tensor blk.7.attn_v.weight
create_tensor: loading tensor blk.7.attn_output.weight
create_tensor: loading tensor blk.7.attn_k_norm.weight
create_tensor: loading tensor blk.7.attn_q_norm.weight
create_tensor: loading tensor blk.7.ffn_norm.weight
create_tensor: loading tensor blk.7.ffn_gate_inp.weight
create_tensor: loading tensor blk.7.ffn_gate_exps.weight
create_tensor: loading tensor blk.7.ffn_down_exps.weight
create_tensor: loading tensor blk.7.ffn_up_exps.weight
create_tensor: loading tensor blk.8.attn_norm.weight
create_tensor: loading tensor blk.8.attn_q.weight
create_tensor: loading tensor blk.8.attn_k.weight
create_tensor: loading tensor blk.8.attn_v.weight
create_tensor: loading tensor blk.8.attn_output.weight
create_tensor: loading tensor blk.8.attn_k_norm.weight
create_tensor: loading tensor blk.8.attn_q_norm.weight
create_tensor: loading tensor blk.8.ffn_norm.weight
create_tensor: loading tensor blk.8.ffn_gate_inp.weight
create_tensor: loading tensor blk.8.ffn_gate_exps.weight
create_tensor: loading tensor blk.8.ffn_down_exps.weight
create_tensor: loading tensor blk.8.ffn_up_exps.weight
create_tensor: loading tensor blk.9.attn_norm.weight
create_tensor: loading tensor blk.9.attn_q.weight
create_tensor: loading tensor blk.9.attn_k.weight
create_tensor: loading tensor blk.9.attn_v.weight
create_tensor: loading tensor blk.9.attn_output.weight
create_tensor: loading tensor blk.9.attn_k_norm.weight
create_tensor: loading tensor blk.9.attn_q_norm.weight
create_tensor: loading tensor blk.9.ffn_norm.weight
create_tensor: loading tensor blk.9.ffn_gate_inp.weight
create_tensor: loading tensor blk.9.ffn_gate_exps.weight
create_tensor: loading tensor blk.9.ffn_down_exps.weight
create_tensor: loading tensor blk.9.ffn_up_exps.weight
create_tensor: loading tensor blk.10.attn_norm.weight
create_tensor: loading tensor blk.10.attn_q.weight
create_tensor: loading tensor blk.10.attn_k.weight
create_tensor: loading tensor blk.10.attn_v.weight
create_tensor: loading tensor blk.10.attn_output.weight
create_tensor: loading tensor blk.10.attn_k_norm.weight
create_tensor: loading tensor blk.10.attn_q_norm.weight
create_tensor: loading tensor blk.10.ffn_norm.weight
create_tensor: loading tensor blk.10.ffn_gate_inp.weight
create_tensor: loading tensor blk.10.ffn_gate_exps.weight
create_tensor: loading tensor blk.10.ffn_down_exps.weight
create_tensor: loading tensor blk.10.ffn_up_exps.weight
create_tensor: loading tensor blk.11.attn_norm.weight
create_tensor: loading tensor blk.11.attn_q.weight
create_tensor: loading tensor blk.11.attn_k.weight
create_tensor: loading tensor blk.11.attn_v.weight
create_tensor: loading tensor blk.11.attn_output.weight
create_tensor: loading tensor blk.11.attn_k_norm.weight
create_tensor: loading tensor blk.11.attn_q_norm.weight
create_tensor: loading tensor blk.11.ffn_norm.weight
create_tensor: loading tensor blk.11.ffn_gate_inp.weight
create_tensor: loading tensor blk.11.ffn_gate_exps.weight
create_tensor: loading tensor blk.11.ffn_down_exps.weight
create_tensor: loading tensor blk.11.ffn_up_exps.weight
create_tensor: loading tensor blk.12.attn_norm.weight
create_tensor: loading tensor blk.12.attn_q.weight
create_tensor: loading tensor blk.12.attn_k.weight
create_tensor: loading tensor blk.12.attn_v.weight
create_tensor: loading tensor blk.12.attn_output.weight
create_tensor: loading tensor blk.12.attn_k_norm.weight
create_tensor: loading tensor blk.12.attn_q_norm.weight
create_tensor: loading tensor blk.12.ffn_norm.weight
create_tensor: loading tensor blk.12.ffn_gate_inp.weight
create_tensor: loading tensor blk.12.ffn_gate_exps.weight
create_tensor: loading tensor blk.12.ffn_down_exps.weight
create_tensor: loading tensor blk.12.ffn_up_exps.weight
create_tensor: loading tensor blk.13.attn_norm.weight
create_tensor: loading tensor blk.13.attn_q.weight
create_tensor: loading tensor blk.13.attn_k.weight
create_tensor: loading tensor blk.13.attn_v.weight
create_tensor: loading tensor blk.13.attn_output.weight
create_tensor: loading tensor blk.13.attn_k_norm.weight
create_tensor: loading tensor blk.13.attn_q_norm.weight
create_tensor: loading tensor blk.13.ffn_norm.weight
create_tensor: loading tensor blk.13.ffn_gate_inp.weight
create_tensor: loading tensor blk.13.ffn_gate_exps.weight
create_tensor: loading tensor blk.13.ffn_down_exps.weight
create_tensor: loading tensor blk.13.ffn_up_exps.weight
create_tensor: loading tensor blk.14.attn_norm.weight
create_tensor: loading tensor blk.14.attn_q.weight
create_tensor: loading tensor blk.14.attn_k.weight
create_tensor: loading tensor blk.14.attn_v.weight
create_tensor: loading tensor blk.14.attn_output.weight
create_tensor: loading tensor blk.14.attn_k_norm.weight
create_tensor: loading tensor blk.14.attn_q_norm.weight
create_tensor: loading tensor blk.14.ffn_norm.weight
create_tensor: loading tensor blk.14.ffn_gate_inp.weight
create_tensor: loading tensor blk.14.ffn_gate_exps.weight
create_tensor: loading tensor blk.14.ffn_down_exps.weight
create_tensor: loading tensor blk.14.ffn_up_exps.weight
create_tensor: loading tensor blk.15.attn_norm.weight
create_tensor: loading tensor blk.15.attn_q.weight
create_tensor: loading tensor blk.15.attn_k.weight
create_tensor: loading tensor blk.15.attn_v.weight
create_tensor: loading tensor blk.15.attn_output.weight
create_tensor: loading tensor blk.15.attn_k_norm.weight
create_tensor: loading tensor blk.15.attn_q_norm.weight
create_tensor: loading tensor blk.15.ffn_norm.weight
create_tensor: loading tensor blk.15.ffn_gate_inp.weight
create_tensor: loading tensor blk.15.ffn_gate_exps.weight
create_tensor: loading tensor blk.15.ffn_down_exps.weight
create_tensor: loading tensor blk.15.ffn_up_exps.weight
create_tensor: loading tensor blk.16.attn_norm.weight
create_tensor: loading tensor blk.16.attn_q.weight
create_tensor: loading tensor blk.16.attn_k.weight
create_tensor: loading tensor blk.16.attn_v.weight
create_tensor: loading tensor blk.16.attn_output.weight
create_tensor: loading tensor blk.16.attn_k_norm.weight
create_tensor: loading tensor blk.16.attn_q_norm.weight
create_tensor: loading tensor blk.16.ffn_norm.weight
create_tensor: loading tensor blk.16.ffn_gate_inp.weight
create_tensor: loading tensor blk.16.ffn_gate_exps.weight
create_tensor: loading tensor blk.16.ffn_down_exps.weight
create_tensor: loading tensor blk.16.ffn_up_exps.weight
create_tensor: loading tensor blk.17.attn_norm.weight
create_tensor: loading tensor blk.17.attn_q.weight
create_tensor: loading tensor blk.17.attn_k.weight
create_tensor: loading tensor blk.17.attn_v.weight
create_tensor: loading tensor blk.17.attn_output.weight
create_tensor: loading tensor blk.17.attn_k_norm.weight
create_tensor: loading tensor blk.17.attn_q_norm.weight
create_tensor: loading tensor blk.17.ffn_norm.weight
create_tensor: loading tensor blk.17.ffn_gate_inp.weight
create_tensor: loading tensor blk.17.ffn_gate_exps.weight
create_tensor: loading tensor blk.17.ffn_down_exps.weight
create_tensor: loading tensor blk.17.ffn_up_exps.weight
create_tensor: loading tensor blk.18.attn_norm.weight
create_tensor: loading tensor blk.18.attn_q.weight
create_tensor: loading tensor blk.18.attn_k.weight
create_tensor: loading tensor blk.18.attn_v.weight
create_tensor: loading tensor blk.18.attn_output.weight
create_tensor: loading tensor blk.18.attn_k_norm.weight
create_tensor: loading tensor blk.18.attn_q_norm.weight
create_tensor: loading tensor blk.18.ffn_norm.weight
create_tensor: loading tensor blk.18.ffn_gate_inp.weight
create_tensor: loading tensor blk.18.ffn_gate_exps.weight
create_tensor: loading tensor blk.18.ffn_down_exps.weight
create_tensor: loading tensor blk.18.ffn_up_exps.weight
create_tensor: loading tensor blk.19.attn_norm.weight
create_tensor: loading tensor blk.19.attn_q.weight
create_tensor: loading tensor blk.19.attn_k.weight
create_tensor: loading tensor blk.19.attn_v.weight
create_tensor: loading tensor blk.19.attn_output.weight
create_tensor: loading tensor blk.19.attn_k_norm.weight
create_tensor: loading tensor blk.19.attn_q_norm.weight
create_tensor: loading tensor blk.19.ffn_norm.weight
create_tensor: loading tensor blk.19.ffn_gate_inp.weight
create_tensor: loading tensor blk.19.ffn_gate_exps.weight
create_tensor: loading tensor blk.19.ffn_down_exps.weight
create_tensor: loading tensor blk.19.ffn_up_exps.weight
create_tensor: loading tensor blk.20.attn_norm.weight
create_tensor: loading tensor blk.20.attn_q.weight
create_tensor: loading tensor blk.20.attn_k.weight
create_tensor: loading tensor blk.20.attn_v.weight
create_tensor: loading tensor blk.20.attn_output.weight
create_tensor: loading tensor blk.20.attn_k_norm.weight
create_tensor: loading tensor blk.20.attn_q_norm.weight
create_tensor: loading tensor blk.20.ffn_norm.weight
create_tensor: loading tensor blk.20.ffn_gate_inp.weight
create_tensor: loading tensor blk.20.ffn_gate_exps.weight
create_tensor: loading tensor blk.20.ffn_down_exps.weight
create_tensor: loading tensor blk.20.ffn_up_exps.weight
create_tensor: loading tensor blk.21.attn_norm.weight
create_tensor: loading tensor blk.21.attn_q.weight
create_tensor: loading tensor blk.21.attn_k.weight
create_tensor: loading tensor blk.21.attn_v.weight
create_tensor: loading tensor blk.21.attn_output.weight
create_tensor: loading tensor blk.21.attn_k_norm.weight
create_tensor: loading tensor blk.21.attn_q_norm.weight
create_tensor: loading tensor blk.21.ffn_norm.weight
create_tensor: loading tensor blk.21.ffn_gate_inp.weight
create_tensor: loading tensor blk.21.ffn_gate_exps.weight
create_tensor: loading tensor blk.21.ffn_down_exps.weight
create_tensor: loading tensor blk.21.ffn_up_exps.weight
create_tensor: loading tensor blk.22.attn_norm.weight
create_tensor: loading tensor blk.22.attn_q.weight
create_tensor: loading tensor blk.22.attn_k.weight
create_tensor: loading tensor blk.22.attn_v.weight
create_tensor: loading tensor blk.22.attn_output.weight
create_tensor: loading tensor blk.22.attn_k_norm.weight
create_tensor: loading tensor blk.22.attn_q_norm.weight
create_tensor: loading tensor blk.22.ffn_norm.weight
create_tensor: loading tensor blk.22.ffn_gate_inp.weight
create_tensor: loading tensor blk.22.ffn_gate_exps.weight
create_tensor: loading tensor blk.22.ffn_down_exps.weight
create_tensor: loading tensor blk.22.ffn_up_exps.weight
create_tensor: loading tensor blk.23.attn_norm.weight
create_tensor: loading tensor blk.23.attn_q.weight
create_tensor: loading tensor blk.23.attn_k.weight
create_tensor: loading tensor blk.23.attn_v.weight
create_tensor: loading tensor blk.23.attn_output.weight
create_tensor: loading tensor blk.23.attn_k_norm.weight
create_tensor: loading tensor blk.23.attn_q_norm.weight
create_tensor: loading tensor blk.23.ffn_norm.weight
create_tensor: loading tensor blk.23.ffn_gate_inp.weight
create_tensor: loading tensor blk.23.ffn_gate_exps.weight
create_tensor: loading tensor blk.23.ffn_down_exps.weight
create_tensor: loading tensor blk.23.ffn_up_exps.weight
create_tensor: loading tensor blk.24.attn_norm.weight
create_tensor: loading tensor blk.24.attn_q.weight
create_tensor: loading tensor blk.24.attn_k.weight
create_tensor: loading tensor blk.24.attn_v.weight
create_tensor: loading tensor blk.24.attn_output.weight
create_tensor: loading tensor blk.24.attn_k_norm.weight
create_tensor: loading tensor blk.24.attn_q_norm.weight
create_tensor: loading tensor blk.24.ffn_norm.weight
create_tensor: loading tensor blk.24.ffn_gate_inp.weight
create_tensor: loading tensor blk.24.ffn_gate_exps.weight
create_tensor: loading tensor blk.24.ffn_down_exps.weight
create_tensor: loading tensor blk.24.ffn_up_exps.weight
create_tensor: loading tensor blk.25.attn_norm.weight
create_tensor: loading tensor blk.25.attn_q.weight
create_tensor: loading tensor blk.25.attn_k.weight
create_tensor: loading tensor blk.25.attn_v.weight
create_tensor: loading tensor blk.25.attn_output.weight
create_tensor: loading tensor blk.25.attn_k_norm.weight
create_tensor: loading tensor blk.25.attn_q_norm.weight
create_tensor: loading tensor blk.25.ffn_norm.weight
create_tensor: loading tensor blk.25.ffn_gate_inp.weight
create_tensor: loading tensor blk.25.ffn_gate_exps.weight
create_tensor: loading tensor blk.25.ffn_down_exps.weight
create_tensor: loading tensor blk.25.ffn_up_exps.weight
create_tensor: loading tensor blk.26.attn_norm.weight
create_tensor: loading tensor blk.26.attn_q.weight
create_tensor: loading tensor blk.26.attn_k.weight
create_tensor: loading tensor blk.26.attn_v.weight
create_tensor: loading tensor blk.26.attn_output.weight
create_tensor: loading tensor blk.26.attn_k_norm.weight
create_tensor: loading tensor blk.26.attn_q_norm.weight
create_tensor: loading tensor blk.26.ffn_norm.weight
create_tensor: loading tensor blk.26.ffn_gate_inp.weight
create_tensor: loading tensor blk.26.ffn_gate_exps.weight
create_tensor: loading tensor blk.26.ffn_down_exps.weight
create_tensor: loading tensor blk.26.ffn_up_exps.weight
create_tensor: loading tensor blk.27.attn_norm.weight
create_tensor: loading tensor blk.27.attn_q.weight
create_tensor: loading tensor blk.27.attn_k.weight
create_tensor: loading tensor blk.27.attn_v.weight
create_tensor: loading tensor blk.27.attn_output.weight
create_tensor: loading tensor blk.27.attn_k_norm.weight
create_tensor: loading tensor blk.27.attn_q_norm.weight
create_tensor: loading tensor blk.27.ffn_norm.weight
create_tensor: loading tensor blk.27.ffn_gate_inp.weight
create_tensor: loading tensor blk.27.ffn_gate_exps.weight
create_tensor: loading tensor blk.27.ffn_down_exps.weight
create_tensor: loading tensor blk.27.ffn_up_exps.weight
create_tensor: loading tensor blk.28.attn_norm.weight
create_tensor: loading tensor blk.28.attn_q.weight
create_tensor: loading tensor blk.28.attn_k.weight
create_tensor: loading tensor blk.28.attn_v.weight
create_tensor: loading tensor blk.28.attn_output.weight
create_tensor: loading tensor blk.28.attn_k_norm.weight
create_tensor: loading tensor blk.28.attn_q_norm.weight
create_tensor: loading tensor blk.28.ffn_norm.weight
create_tensor: loading tensor blk.28.ffn_gate_inp.weight
create_tensor: loading tensor blk.28.ffn_gate_exps.weight
create_tensor: loading tensor blk.28.ffn_down_exps.weight
create_tensor: loading tensor blk.28.ffn_up_exps.weight
create_tensor: loading tensor blk.29.attn_norm.weight
create_tensor: loading tensor blk.29.attn_q.weight
create_tensor: loading tensor blk.29.attn_k.weight
create_tensor: loading tensor blk.29.attn_v.weight
create_tensor: loading tensor blk.29.attn_output.weight
create_tensor: loading tensor blk.29.attn_k_norm.weight
create_tensor: loading tensor blk.29.attn_q_norm.weight
create_tensor: loading tensor blk.29.ffn_norm.weight
create_tensor: loading tensor blk.29.ffn_gate_inp.weight
create_tensor: loading tensor blk.29.ffn_gate_exps.weight
create_tensor: loading tensor blk.29.ffn_down_exps.weight
create_tensor: loading tensor blk.29.ffn_up_exps.weight
create_tensor: loading tensor blk.30.attn_norm.weight
create_tensor: loading tensor blk.30.attn_q.weight
create_tensor: loading tensor blk.30.attn_k.weight
create_tensor: loading tensor blk.30.attn_v.weight
create_tensor: loading tensor blk.30.attn_output.weight
create_tensor: loading tensor blk.30.attn_k_norm.weight
create_tensor: loading tensor blk.30.attn_q_norm.weight
create_tensor: loading tensor blk.30.ffn_norm.weight
create_tensor: loading tensor blk.30.ffn_gate_inp.weight
create_tensor: loading tensor blk.30.ffn_gate_exps.weight
create_tensor: loading tensor blk.30.ffn_down_exps.weight
create_tensor: loading tensor blk.30.ffn_up_exps.weight
create_tensor: loading tensor blk.31.attn_norm.weight
create_tensor: loading tensor blk.31.attn_q.weight
create_tensor: loading tensor blk.31.attn_k.weight
create_tensor: loading tensor blk.31.attn_v.weight
create_tensor: loading tensor blk.31.attn_output.weight
create_tensor: loading tensor blk.31.attn_k_norm.weight
create_tensor: loading tensor blk.31.attn_q_norm.weight
create_tensor: loading tensor blk.31.ffn_norm.weight
create_tensor: loading tensor blk.31.ffn_gate_inp.weight
create_tensor: loading tensor blk.31.ffn_gate_exps.weight
create_tensor: loading tensor blk.31.ffn_down_exps.weight
create_tensor: loading tensor blk.31.ffn_up_exps.weight
create_tensor: loading tensor blk.32.attn_norm.weight
create_tensor: loading tensor blk.32.attn_q.weight
create_tensor: loading tensor blk.32.attn_k.weight
create_tensor: loading tensor blk.32.attn_v.weight
create_tensor: loading tensor blk.32.attn_output.weight
create_tensor: loading tensor blk.32.attn_k_norm.weight
create_tensor: loading tensor blk.32.attn_q_norm.weight
create_tensor: loading tensor blk.32.ffn_norm.weight
create_tensor: loading tensor blk.32.ffn_gate_inp.weight
create_tensor: loading tensor blk.32.ffn_gate_exps.weight
create_tensor: loading tensor blk.32.ffn_down_exps.weight
create_tensor: loading tensor blk.32.ffn_up_exps.weight
create_tensor: loading tensor blk.33.attn_norm.weight
create_tensor: loading tensor blk.33.attn_q.weight
create_tensor: loading tensor blk.33.attn_k.weight
create_tensor: loading tensor blk.33.attn_v.weight
create_tensor: loading tensor blk.33.attn_output.weight
create_tensor: loading tensor blk.33.attn_k_norm.weight
create_tensor: loading tensor blk.33.attn_q_norm.weight
create_tensor: loading tensor blk.33.ffn_norm.weight
create_tensor: loading tensor blk.33.ffn_gate_inp.weight
create_tensor: loading tensor blk.33.ffn_gate_exps.weight
create_tensor: loading tensor blk.33.ffn_down_exps.weight
create_tensor: loading tensor blk.33.ffn_up_exps.weight
create_tensor: loading tensor blk.34.attn_norm.weight
create_tensor: loading tensor blk.34.attn_q.weight
create_tensor: loading tensor blk.34.attn_k.weight
create_tensor: loading tensor blk.34.attn_v.weight
create_tensor: loading tensor blk.34.attn_output.weight
create_tensor: loading tensor blk.34.attn_k_norm.weight
create_tensor: loading tensor blk.34.attn_q_norm.weight
create_tensor: loading tensor blk.34.ffn_norm.weight
create_tensor: loading tensor blk.34.ffn_gate_inp.weight
create_tensor: loading tensor blk.34.ffn_gate_exps.weight
create_tensor: loading tensor blk.34.ffn_down_exps.weight
create_tensor: loading tensor blk.34.ffn_up_exps.weight
create_tensor: loading tensor blk.35.attn_norm.weight
create_tensor: loading tensor blk.35.attn_q.weight
create_tensor: loading tensor blk.35.attn_k.weight
create_tensor: loading tensor blk.35.attn_v.weight
create_tensor: loading tensor blk.35.attn_output.weight
create_tensor: loading tensor blk.35.attn_k_norm.weight
create_tensor: loading tensor blk.35.attn_q_norm.weight
create_tensor: loading tensor blk.35.ffn_norm.weight
create_tensor: loading tensor blk.35.ffn_gate_inp.weight
create_tensor: loading tensor blk.35.ffn_gate_exps.weight
create_tensor: loading tensor blk.35.ffn_down_exps.weight
create_tensor: loading tensor blk.35.ffn_up_exps.weight
create_tensor: loading tensor blk.36.attn_norm.weight
create_tensor: loading tensor blk.36.attn_q.weight
create_tensor: loading tensor blk.36.attn_k.weight
create_tensor: loading tensor blk.36.attn_v.weight
create_tensor: loading tensor blk.36.attn_output.weight
create_tensor: loading tensor blk.36.attn_k_norm.weight
create_tensor: loading tensor blk.36.attn_q_norm.weight
create_tensor: loading tensor blk.36.ffn_norm.weight
create_tensor: loading tensor blk.36.ffn_gate_inp.weight
create_tensor: loading tensor blk.36.ffn_gate_exps.weight
create_tensor: loading tensor blk.36.ffn_down_exps.weight
create_tensor: loading tensor blk.36.ffn_up_exps.weight
create_tensor: loading tensor blk.37.attn_norm.weight
create_tensor: loading tensor blk.37.attn_q.weight
create_tensor: loading tensor blk.37.attn_k.weight
create_tensor: loading tensor blk.37.attn_v.weight
create_tensor: loading tensor blk.37.attn_output.weight
create_tensor: loading tensor blk.37.attn_k_norm.weight
create_tensor: loading tensor blk.37.attn_q_norm.weight
create_tensor: loading tensor blk.37.ffn_norm.weight
create_tensor: loading tensor blk.37.ffn_gate_inp.weight
create_tensor: loading tensor blk.37.ffn_gate_exps.weight
create_tensor: loading tensor blk.37.ffn_down_exps.weight
create_tensor: loading tensor blk.37.ffn_up_exps.weight
create_tensor: loading tensor blk.38.attn_norm.weight
create_tensor: loading tensor blk.38.attn_q.weight
create_tensor: loading tensor blk.38.attn_k.weight
create_tensor: loading tensor blk.38.attn_v.weight
create_tensor: loading tensor blk.38.attn_output.weight
create_tensor: loading tensor blk.38.attn_k_norm.weight
create_tensor: loading tensor blk.38.attn_q_norm.weight
create_tensor: loading tensor blk.38.ffn_norm.weight
create_tensor: loading tensor blk.38.ffn_gate_inp.weight
create_tensor: loading tensor blk.38.ffn_gate_exps.weight
create_tensor: loading tensor blk.38.ffn_down_exps.weight
create_tensor: loading tensor blk.38.ffn_up_exps.weight
create_tensor: loading tensor blk.39.attn_norm.weight
create_tensor: loading tensor blk.39.attn_q.weight
create_tensor: loading tensor blk.39.attn_k.weight
create_tensor: loading tensor blk.39.attn_v.weight
create_tensor: loading tensor blk.39.attn_output.weight
create_tensor: loading tensor blk.39.attn_k_norm.weight
create_tensor: loading tensor blk.39.attn_q_norm.weight
create_tensor: loading tensor blk.39.ffn_norm.weight
create_tensor: loading tensor blk.39.ffn_gate_inp.weight
create_tensor: loading tensor blk.39.ffn_gate_exps.weight
create_tensor: loading tensor blk.39.ffn_down_exps.weight
create_tensor: loading tensor blk.39.ffn_up_exps.weight
create_tensor: loading tensor blk.40.attn_norm.weight
create_tensor: loading tensor blk.40.attn_q.weight
create_tensor: loading tensor blk.40.attn_k.weight
create_tensor: loading tensor blk.40.attn_v.weight
create_tensor: loading tensor blk.40.attn_output.weight
create_tensor: loading tensor blk.40.attn_k_norm.weight
create_tensor: loading tensor blk.40.attn_q_norm.weight
create_tensor: loading tensor blk.40.ffn_norm.weight
create_tensor: loading tensor blk.40.ffn_gate_inp.weight
create_tensor: loading tensor blk.40.ffn_gate_exps.weight
create_tensor: loading tensor blk.40.ffn_down_exps.weight
create_tensor: loading tensor blk.40.ffn_up_exps.weight
create_tensor: loading tensor blk.41.attn_norm.weight
create_tensor: loading tensor blk.41.attn_q.weight
create_tensor: loading tensor blk.41.attn_k.weight
create_tensor: loading tensor blk.41.attn_v.weight
create_tensor: loading tensor blk.41.attn_output.weight
create_tensor: loading tensor blk.41.attn_k_norm.weight
create_tensor: loading tensor blk.41.attn_q_norm.weight
create_tensor: loading tensor blk.41.ffn_norm.weight
create_tensor: loading tensor blk.41.ffn_gate_inp.weight
create_tensor: loading tensor blk.41.ffn_gate_exps.weight
create_tensor: loading tensor blk.41.ffn_down_exps.weight
create_tensor: loading tensor blk.41.ffn_up_exps.weight
create_tensor: loading tensor blk.42.attn_norm.weight
create_tensor: loading tensor blk.42.attn_q.weight
create_tensor: loading tensor blk.42.attn_k.weight
create_tensor: loading tensor blk.42.attn_v.weight
create_tensor: loading tensor blk.42.attn_output.weight
create_tensor: loading tensor blk.42.attn_k_norm.weight
create_tensor: loading tensor blk.42.attn_q_norm.weight
create_tensor: loading tensor blk.42.ffn_norm.weight
create_tensor: loading tensor blk.42.ffn_gate_inp.weight
create_tensor: loading tensor blk.42.ffn_gate_exps.weight
create_tensor: loading tensor blk.42.ffn_down_exps.weight
create_tensor: loading tensor blk.42.ffn_up_exps.weight
create_tensor: loading tensor blk.43.attn_norm.weight
create_tensor: loading tensor blk.43.attn_q.weight
create_tensor: loading tensor blk.43.attn_k.weight
create_tensor: loading tensor blk.43.attn_v.weight
create_tensor: loading tensor blk.43.attn_output.weight
create_tensor: loading tensor blk.43.attn_k_norm.weight
create_tensor: loading tensor blk.43.attn_q_norm.weight
create_tensor: loading tensor blk.43.ffn_norm.weight
create_tensor: loading tensor blk.43.ffn_gate_inp.weight
create_tensor: loading tensor blk.43.ffn_gate_exps.weight
create_tensor: loading tensor blk.43.ffn_down_exps.weight
create_tensor: loading tensor blk.43.ffn_up_exps.weight
create_tensor: loading tensor blk.44.attn_norm.weight
create_tensor: loading tensor blk.44.attn_q.weight
create_tensor: loading tensor blk.44.attn_k.weight
create_tensor: loading tensor blk.44.attn_v.weight
create_tensor: loading tensor blk.44.attn_output.weight
create_tensor: loading tensor blk.44.attn_k_norm.weight
create_tensor: loading tensor blk.44.attn_q_norm.weight
create_tensor: loading tensor blk.44.ffn_norm.weight
create_tensor: loading tensor blk.44.ffn_gate_inp.weight
create_tensor: loading tensor blk.44.ffn_gate_exps.weight
create_tensor: loading tensor blk.44.ffn_down_exps.weight
create_tensor: loading tensor blk.44.ffn_up_exps.weight
create_tensor: loading tensor blk.45.attn_norm.weight
create_tensor: loading tensor blk.45.attn_q.weight
create_tensor: loading tensor blk.45.attn_k.weight
create_tensor: loading tensor blk.45.attn_v.weight
create_tensor: loading tensor blk.45.attn_output.weight
create_tensor: loading tensor blk.45.attn_k_norm.weight
create_tensor: loading tensor blk.45.attn_q_norm.weight
create_tensor: loading tensor blk.45.ffn_norm.weight
create_tensor: loading tensor blk.45.ffn_gate_inp.weight
create_tensor: loading tensor blk.45.ffn_gate_exps.weight
create_tensor: loading tensor blk.45.ffn_down_exps.weight
create_tensor: loading tensor blk.45.ffn_up_exps.weight
create_tensor: loading tensor blk.46.attn_norm.weight
create_tensor: loading tensor blk.46.attn_q.weight
create_tensor: loading tensor blk.46.attn_k.weight
create_tensor: loading tensor blk.46.attn_v.weight
create_tensor: loading tensor blk.46.attn_output.weight
create_tensor: loading tensor blk.46.attn_k_norm.weight
create_tensor: loading tensor blk.46.attn_q_norm.weight
create_tensor: loading tensor blk.46.ffn_norm.weight
create_tensor: loading tensor blk.46.ffn_gate_inp.weight
create_tensor: loading tensor blk.46.ffn_gate_exps.weight
create_tensor: loading tensor blk.46.ffn_down_exps.weight
create_tensor: loading tensor blk.46.ffn_up_exps.weight
create_tensor: loading tensor blk.47.attn_norm.weight
create_tensor: loading tensor blk.47.attn_q.weight
create_tensor: loading tensor blk.47.attn_k.weight
create_tensor: loading tensor blk.47.attn_v.weight
create_tensor: loading tensor blk.47.attn_output.weight
create_tensor: loading tensor blk.47.attn_k_norm.weight
create_tensor: loading tensor blk.47.attn_q_norm.weight
create_tensor: loading tensor blk.47.ffn_norm.weight
create_tensor: loading tensor blk.47.ffn_gate_inp.weight
create_tensor: loading tensor blk.47.ffn_gate_exps.weight
create_tensor: loading tensor blk.47.ffn_down_exps.weight
create_tensor: loading tensor blk.47.ffn_up_exps.weight
load_tensors: offloading output layer to GPU
load_tensors: offloading 47 repeating layers to GPU
load_tensors: offloaded 49/49 layers to GPU
load_tensors:      Vulkan0 model buffer size = 30658.10 MiB
load_tensors:  Vulkan_Host model buffer size =   315.30 MiB
load_all_data: using async uploads for device Vulkan0, buffer type Vulkan0, backend Vulkan0
llama_model_load: error loading model: read error: Bad address
llama_model_load_from_file_impl: failed to load model
common_init_from_params: failed to load model '/home/tipu/AI/models/ggml-org/Qwen3-Coder-30B-A3B/Qwen3-Coder-30B-A3B-Instruct-Q8_0.gguf'
srv    load_model: failed to load model, '/home/tipu/AI/models/ggml-org/Qwen3-Coder-30B-A3B/Qwen3-Coder-30B-A3B-Instruct-Q8_0.gguf'
srv    operator(): operator(): cleaning up before exit...
main: exiting due to model loading error

@engrtipusultan
Copy link

After the latest commit. For me things appear to be back to normal with mmap false and direct io. Thank you very much.

(mmap = false, direct_io = true)

prompt eval time =     118.47 ms /    16 tokens (    7.40 ms per token,   135.05 tokens per second)
       eval time =   59024.84 ms /   739 tokens (   79.87 ms per token,    12.52 tokens per second)
      total time =   59143.31 ms /   755 tokens

(mmap = false, direct_io = false) loading of model and inference both are slow. Which is fine by me since I can use above combination. But if something needs to be checked do let me know.

prompt eval time =     119.79 ms /    16 tokens (    7.49 ms per token,   133.57 tokens per second)
       eval time =   67659.44 ms /   740 tokens (   91.43 ms per token,    10.94 tokens per second)
      total time =   67779.23 ms /   756 tokens

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
@engrtipusultan
Copy link

llama-bench is not having -dio or -ndio switch. Be default it is now showing mmap = false, direct_io = true. Which is correct combination for me, but I just wanted to report. If you wanted to add these to llama-bench as well. In llama-server and llama-cli it is working fine.

@ggerganov
Copy link
Member

llama-bench is not having -dio or -ndio switch. Be default it is now showing mmap = false, direct_io = true.

Do we have a case in which dio = 1 leads to slower inference compared to dio = 0? If not, then, no need to change llama-bench.

@JTischbein
Copy link
Contributor Author

@ggerganov So far --no-mmap -dio lead to better performance with some backends, but no perf regression.

With --mmap enabled the CUDA backend shows slower cuda graph launches due to the thread being blocked. With --no-mmap and -dio the thread does not get blocked.

Copy link
Member

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok let's merge after CI

@ggerganov ggerganov merged commit 2038101 into ggml-org:master Jan 8, 2026
131 of 137 checks passed
gary149 pushed a commit to gary149/llama-agent that referenced this pull request Jan 8, 2026
* Adding --direct-io flag for model loading

* Fixing read_raw() calls

* Fixing Windows read_raw_at

* Changing type off_t to size_t for windows and Renaming functions

* disable direct io when mmap is explicitly enabled

* Use read_raw_unsafe when upload_backend is available, not functional on some devices with Vulkan and SYCL

* Fallback to std::fread in case O_DIRECT fails due to bad address

* Windows: remove const keywords and unused functions

* Update src/llama-mmap.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: jtischbein <jtischbein@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
@jukofyork
Copy link
Collaborator

Just recompiled today and found this is causing a big performance regression for NUMA on my dual Xeon Gold 6248 system, so might be worth investigating and/or turning off as default for NUMA too:

  • Before this change I got around 7-8 tokens/s generation using Kimi-K2 (with Q4_K MoE tensors off-loaded to CPU) by always dropping caches before loading.
  • I now only get around 1/2 this (~4 tokens/s) and it doesn't appear to be loading the data properly into the NUMA buffers during warmup (ie: dropping caches containing ~500GB after loading is instant now, rather than taking 20-30s before).
  • Adding --no-direct-io gets me back to the original 7-8 tokens/s generation and everything seems to work as before.

Sorry I can't really help much more as at holiday home for next 3 weeks and have 4 young kittens who refuse to let me near the PC! 😼

If it's any help, then here is my script:

#!/bin/bash

host_address=192.168.1.1
port_number=8080

# Turn off NUMA balancing
echo 0 | sudo tee /proc/sys/kernel/numa_balancing > /dev/null

# Ask for permission to drop caches
read -p "Do you want to drop caches? (y/n) " -n 1 -r
echo    # Move to a new line
if [[ $REPLY =~ ^[Yy]$ ]]
then
    echo "Dropping caches..."
    echo 3 | sudo tee /proc/sys/vm/drop_caches > /dev/null
fi

# Run the main command
CUDA_VISIBLE_DEVICES=0 GGML_OP_OFFLOAD_MIN_BATCH=2560 ~/llama.cpp/build/bin/llama-server \
        --host "$host_address" \
        --port "$port_number" \
        --alias "local/kimi-k2-0905" \
        --jinja \
        --model ~/models/gguf/Kimi-K2-Instruct-0905-Q4_X.gguf \
        --n-gpu-layers 99 \
        --no-direct-io \
        --numa distribute \
        --threads "$(nproc)" \
        --override-tensor exps=CPU \
        --flash-attn 1 \
        --parallel 1 \
        --ctx_size 131072 \
        --cache-type-k f16 \
        --cache-type-v q4_0 \
        --batch-size 14336 \
        --ubatch-size 14336 \
        --cache-ram 65536 \
        --temp 0.6 \
        --min-p 0.01

@jukofyork
Copy link
Collaborator

Weirdly, it may actually be giving me (a lot) better PP due improved RAM --> PCI-E 3.0 16x --> GPU bandwidth using the new code!? If nvtop is to be believed: I used to only get around 4-5GB/s out of a theoretical maximum of 16GB/s, but now I seem to be getting significantly more.

I will try remote logging in using a tablet and Bluetooth keyboard tomorrow and see if I can run some proper experiments to see if I can figure it out... I suspect that mmap with --numa distribute is somehow loading the model "nicely" for NUMA for TG and the new code is loading the model "nicely" for PCI transfers.

@jukofyork
Copy link
Collaborator

It seems to be the NUMA "first-touch" policy causing the problems (the instantaneous drop caches was a red herring and just because it's not getting stored in the usual RAM buffers I think).

It might be worth auto disabling this when using NUMA unless some better way can be found to use at least 1 thread per NUMA node to do the loading?

@jukofyork
Copy link
Collaborator

Weirdly, it may actually be giving me (a lot) better PP due improved RAM --> PCI-E 3.0 16x --> GPU bandwidth using the new code!? If nvtop is to be believed: I used to only get around 4-5GB/s out of a theoretical maximum of 16GB/s, but now I seem to be getting significantly more.

Yeah, this doesn't work at all with NUMA as it's putting all the tensors on whatever node the thread is running on and then having to pass everything through the QPI interconnect (around 80GB/s for my Xeon Gold 6248 CPUs). Not sure if it's bypassing NUMA altogether, as even numactl --interleave=all didn't seem to make any difference...

So I went back to using a single NUMA node only and these are the results for Kimi-K2 using Q4_K for offloaded MoE tensors:

  • TG: 7-8 tokens/s --> 6 tokens/s
  • PP: 33 tokens/s --> 17 tokens/s

At first these seemed disappointing, but whatever this PR does to lay the tensors out in RAM means that I can now get way better offload PP to the RTX 6000 Ada GPU:

  • RAM --> PCI-E 3.0 16x --> GPU: 4.7GB/s --> 12.7GB/s

and this now brought the the break-even batch size for GGML_OP_OFFLOAD_MIN_BATCH (see #18535) down from ~2500 tokens to ~650 tokens, and as a result for large prompt ingests I can get 150-200 tokens/s PP with ubatch = 14336 compared to ~75 tokens/s before.


Not had chance to run it yet, but the break-even batch size for GGML_OP_OFFLOAD_MIN_BATCH will reduce further for:

  • Deepseek using 8/256 experts per token.
  • GLM-4-MoE using 8/160 experts per token.
  • Qwen-3-MoE using 8/128 experts per token.

Compared to Kimi-K2 using 8/384 experts per token.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants