Skip to content

Bug: Qwen 3.5 context cache issue. #1383

@Quairon-Nailo

Description

@Quairon-Nailo

What happened?

I've been getting into Qwen lately, but I've noticed an issue whenever I try to regenerate a response, instead of creating a new response from the start (including the thinking), often the new response kind of continues after the end of the response I'm trying to replace (or does this weird thing where it picks up midway through the previous response). The only way to resolve it is to load and unload the model, or to use a different prompt to "clean" the cache. Here's an example of what I mean:

Attempt 1 - normal:
Image

Image

Attempt 2 - picks up as if continuing at the end of the thinking section, prints the </think> tag only.

Image

Attempt 3 - just tries to write as if continuing right after attempt 2, doesn't answer the question, just tries to append a note at the end of the previous response.

Image

I'll post the logs of this run at the logs section.

Name and Version

llama-server --version
version: 4266 (344688c)
built with cc (GCC) 15.2.1 20260209 for x86_64-pc-linux-gnu

What operating system are you seeing the problem on?

Linux

Relevant log output

❯ Q3.5-Q5_K_M.sh
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24126 MiB
  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24103 MiB
=============================== NCCL main communicator initialized
INFO [                    main] build info | tid="139723659595776" timestamp=1772931803 build=4266 commit="344688ce"
INFO [                    main] system info | tid="139723659595776" timestamp=1772931803 n_threads=16 n_threads_batch=-1 total_threads=32 system_info="AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | "
CUDA0: using device CUDA0 - 23737 MiB free
CUDA1: using device CUDA1 - 22363 MiB free
llama_model_loader: additional 2 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 50 key-value pairs and 879 tensors from /mnt/Speed/AI/Models/AesSedai/Qwen3.5-122B-A10B-GGUF/Qwen3.5-122B-A10B-Q5_K_M-00001-of-00003.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen35moe
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                     general.sampling.top_k i32              = 20
llama_model_loader: - kv   3:                     general.sampling.top_p f32              = 0.950000
llama_model_loader: - kv   4:                      general.sampling.temp f32              = 0.600000
llama_model_loader: - kv   5:                               general.name str              = Qwen3.5 122B A10B
llama_model_loader: - kv   6:                           general.basename str              = Qwen3.5
llama_model_loader: - kv   7:                         general.size_label str              = 122B-A10B
llama_model_loader: - kv   8:                            general.license str              = apache-2.0
llama_model_loader: - kv   9:                       general.license.link str              = https://huggingface.co/Qwen/Qwen3.5-1...
llama_model_loader: - kv  10:                               general.tags arr[str,1]       = ["image-text-to-text"]
llama_model_loader: - kv  11:                      qwen35moe.block_count u32              = 48
llama_model_loader: - kv  12:                   qwen35moe.context_length u32              = 262144
llama_model_loader: - kv  13:                 qwen35moe.embedding_length u32              = 3072
llama_model_loader: - kv  14:             qwen35moe.attention.head_count u32              = 32
llama_model_loader: - kv  15:          qwen35moe.attention.head_count_kv u32              = 2
llama_model_loader: - kv  16:          qwen35moe.rope.dimension_sections arr[i32,4]       = [11, 11, 10, 0]
llama_model_loader: - kv  17:                   qwen35moe.rope.freq_base f32              = 10000000.000000
llama_model_loader: - kv  18: qwen35moe.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  19:                     qwen35moe.expert_count u32              = 256
llama_model_loader: - kv  20:                qwen35moe.expert_used_count u32              = 8
llama_model_loader: - kv  21:             qwen35moe.attention.key_length u32              = 256
llama_model_loader: - kv  22:           qwen35moe.attention.value_length u32              = 256
llama_model_loader: - kv  23:       qwen35moe.expert_feed_forward_length u32              = 1024
llama_model_loader: - kv  24: qwen35moe.expert_shared_feed_forward_length u32              = 1024
llama_model_loader: - kv  25:                  qwen35moe.ssm.conv_kernel u32              = 4
llama_model_loader: - kv  26:                   qwen35moe.ssm.state_size u32              = 128
llama_model_loader: - kv  27:                  qwen35moe.ssm.group_count u32              = 16
llama_model_loader: - kv  28:               qwen35moe.ssm.time_step_rank u32              = 64
llama_model_loader: - kv  29:                   qwen35moe.ssm.inner_size u32              = 8192
llama_model_loader: - kv  30:          qwen35moe.full_attention_interval u32              = 4
llama_model_loader: - kv  31:             qwen35moe.rope.dimension_count u32              = 64
llama_model_loader: - kv  32:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  33:                         tokenizer.ggml.pre str              = qwen35
llama_model_loader: - kv  34:                      tokenizer.ggml.tokens arr[str,248320]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  35:                  tokenizer.ggml.token_type arr[i32,248320]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  36:                      tokenizer.ggml.merges arr[str,247587]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  37:                tokenizer.ggml.eos_token_id u32              = 248046
llama_model_loader: - kv  38:            tokenizer.ggml.padding_token_id u32              = 248044
llama_model_loader: - kv  39:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  40:                    tokenizer.chat_template str              = {%- set image_count = namespace(value...
llama_model_loader: - kv  41:               general.quantization_version u32              = 2
llama_model_loader: - kv  42:                          general.file_type u32              = 7
llama_model_loader: - kv  43:                      quantize.imatrix.file str              = /mnt/srv/snowdrift/fp16/Qwen3.5-122B-...
llama_model_loader: - kv  44:                   quantize.imatrix.dataset str              = /mnt/srv/host/resources/KLD/calibrati...
llama_model_loader: - kv  45:             quantize.imatrix.entries_count u32              = 612
llama_model_loader: - kv  46:              quantize.imatrix.chunks_count u32              = 100
llama_model_loader: - kv  47:                                   split.no u16              = 0
llama_model_loader: - kv  48:                        split.tensors.count i32              = 879
llama_model_loader: - kv  49:                                split.count u16              = 3
llama_model_loader: - type  f32:  361 tensors
llama_model_loader: - type q8_0:  374 tensors
llama_model_loader: - type q5_K:   96 tensors
llama_model_loader: - type q6_K:   48 tensors
load: printing all EOG tokens:
load:   - 248044 ('<|endoftext|>')
load:   - 248046 ('<|im_end|>')
load:   - 248063 ('<|fim_pad|>')
load:   - 248064 ('<|repo_name|>')
load:   - 248065 ('<|file_sep|>')
load: special tokens cache size = 33
load: token to piece cache size = 1.7581 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = qwen35moe
llm_load_print_meta: n_ctx_train      = 262144
llm_load_print_meta: n_embd           = 3072
llm_load_print_meta: n_layer          = 48
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 2
llm_load_print_meta: n_rot            = 64
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_swa_pattern    = 1
llm_load_print_meta: n_embd_head_k    = 256
llm_load_print_meta: n_embd_head_v    = 256
llm_load_print_meta: n_gqa            = 16
llm_load_print_meta: n_embd_k_gqa     = 512
llm_load_print_meta: n_embd_v_gqa     = 512
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 0
llm_load_print_meta: n_expert         = 256
llm_load_print_meta: n_expert_used    = 8
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 40
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 262144
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: mrope sections   = [11, 11, 10, 0]
llm_load_print_meta: ssm_d_conv       = 4
llm_load_print_meta: ssm_d_inner      = 8192
llm_load_print_meta: ssm_d_state      = 128
llm_load_print_meta: ssm_dt_rank      = 64
llm_load_print_meta: ssm_n_group      = 16
llm_load_print_meta: model type       = 122B.A10B
llm_load_print_meta: model ftype      = Q8_0
llm_load_print_meta: model params     = 122.112 B
llm_load_print_meta: model size       = 85.224 GiB (5.995 BPW) 
llm_load_print_meta: repeating layers = 83.714 GiB (5.963 BPW, 120.586 B parameters)
llm_load_print_meta: general.name     = Qwen3.5 122B A10B
print_info: vocab type       = BPE
print_info: n_vocab          = 248320
print_info: n_merges         = 247587
print_info: BOS token        = 11 ','
print_info: EOS token        = 248046 '<|im_end|>'
print_info: EOT token        = 248046 '<|im_end|>'
print_info: PAD token        = 248044 '<|endoftext|>'
print_info: LF token         = 198 'Ċ'
print_info: FIM PRE token    = 248060 '<|fim_prefix|>'
print_info: FIM SUF token    = 248062 '<|fim_suffix|>'
print_info: FIM MID token    = 248061 '<|fim_middle|>'
print_info: FIM PAD token    = 248063 '<|fim_pad|>'
print_info: FIM REP token    = 248064 '<|repo_name|>'
print_info: FIM SEP token    = 248065 '<|file_sep|>'
print_info: EOG token        = 248044 '<|endoftext|>'
print_info: EOG token        = 248046 '<|im_end|>'
print_info: EOG token        = 248063 '<|fim_pad|>'
print_info: EOG token        = 248064 '<|repo_name|>'
print_info: EOG token        = 248065 '<|file_sep|>'
print_info: max token length = 256
llm_load_tensors: ggml ctx size =    5.12 MiB
Tensor blk.15.ffn_gate_exps.weight (size = 528.00 MiB) buffer type overriden to CPU
Tensor blk.15.ffn_down_exps.weight (size = 630.00 MiB) buffer type overriden to CPU
Tensor blk.15.ffn_up_exps.weight (size = 528.00 MiB) buffer type overriden to CPU
Tensor blk.16.ffn_gate_exps.weight (size = 528.00 MiB) buffer type overriden to CPU
Tensor blk.16.ffn_down_exps.weight (size = 630.00 MiB) buffer type overriden to CPU
Tensor blk.16.ffn_up_exps.weight (size = 528.00 MiB) buffer type overriden to CPU
Tensor blk.17.ffn_gate_exps.weight (size = 528.00 MiB) buffer type overriden to CPU
Tensor blk.17.ffn_down_exps.weight (size = 630.00 MiB) buffer type overriden to CPU
Tensor blk.17.ffn_up_exps.weight (size = 528.00 MiB) buffer type overriden to CPU
Tensor blk.18.ffn_gate_exps.weight (size = 528.00 MiB) buffer type overriden to CPU
Tensor blk.18.ffn_down_exps.weight (size = 630.00 MiB) buffer type overriden to CPU
Tensor blk.18.ffn_up_exps.weight (size = 528.00 MiB) buffer type overriden to CPU
Tensor blk.19.ffn_gate_exps.weight (size = 528.00 MiB) buffer type overriden to CPU
Tensor blk.19.ffn_down_exps.weight (size = 630.00 MiB) buffer type overriden to CPU
Tensor blk.19.ffn_up_exps.weight (size = 528.00 MiB) buffer type overriden to CPU
Tensor blk.20.ffn_gate_exps.weight (size = 528.00 MiB) buffer type overriden to CPU
Tensor blk.20.ffn_down_exps.weight (size = 630.00 MiB) buffer type overriden to CPU
Tensor blk.20.ffn_up_exps.weight (size = 528.00 MiB) buffer type overriden to CPU
Tensor blk.21.ffn_gate_exps.weight (size = 528.00 MiB) buffer type overriden to CPU
Tensor blk.21.ffn_down_exps.weight (size = 630.00 MiB) buffer type overriden to CPU
Tensor blk.21.ffn_up_exps.weight (size = 528.00 MiB) buffer type overriden to CPU
Tensor blk.22.ffn_gate_exps.weight (size = 528.00 MiB) buffer type overriden to CPU
Tensor blk.22.ffn_down_exps.weight (size = 630.00 MiB) buffer type overriden to CPU
Tensor blk.22.ffn_up_exps.weight (size = 528.00 MiB) buffer type overriden to CPU
Tensor blk.23.ffn_gate_exps.weight (size = 528.00 MiB) buffer type overriden to CPU
Tensor blk.23.ffn_down_exps.weight (size = 630.00 MiB) buffer type overriden to CPU
Tensor blk.23.ffn_up_exps.weight (size = 528.00 MiB) buffer type overriden to CPU
Tensor blk.24.ffn_gate_exps.weight (size = 528.00 MiB) buffer type overriden to CPU
Tensor blk.24.ffn_down_exps.weight (size = 630.00 MiB) buffer type overriden to CPU
Tensor blk.24.ffn_up_exps.weight (size = 528.00 MiB) buffer type overriden to CPU
Tensor blk.25.ffn_gate_exps.weight (size = 528.00 MiB) buffer type overriden to CPU
Tensor blk.25.ffn_down_exps.weight (size = 630.00 MiB) buffer type overriden to CPU
Tensor blk.25.ffn_up_exps.weight (size = 528.00 MiB) buffer type overriden to CPU
Tensor blk.26.ffn_gate_exps.weight (size = 528.00 MiB) buffer type overriden to CPU
Tensor blk.26.ffn_down_exps.weight (size = 630.00 MiB) buffer type overriden to CPU
Tensor blk.26.ffn_up_exps.weight (size = 528.00 MiB) buffer type overriden to CPU
Tensor blk.27.ffn_gate_exps.weight (size = 528.00 MiB) buffer type overriden to CPU
Tensor blk.27.ffn_down_exps.weight (size = 630.00 MiB) buffer type overriden to CPU
Tensor blk.27.ffn_up_exps.weight (size = 528.00 MiB) buffer type overriden to CPU
Tensor blk.28.ffn_gate_exps.weight (size = 528.00 MiB) buffer type overriden to CPU
Tensor blk.28.ffn_down_exps.weight (size = 630.00 MiB) buffer type overriden to CPU
Tensor blk.28.ffn_up_exps.weight (size = 528.00 MiB) buffer type overriden to CPU
Tensor blk.29.ffn_gate_exps.weight (size = 528.00 MiB) buffer type overriden to CPU
Tensor blk.29.ffn_down_exps.weight (size = 630.00 MiB) buffer type overriden to CPU
Tensor blk.29.ffn_up_exps.weight (size = 528.00 MiB) buffer type overriden to CPU
Tensor blk.30.ffn_gate_exps.weight (size = 528.00 MiB) buffer type overriden to CPU
Tensor blk.30.ffn_down_exps.weight (size = 630.00 MiB) buffer type overriden to CPU
Tensor blk.30.ffn_up_exps.weight (size = 528.00 MiB) buffer type overriden to CPU
Tensor blk.31.ffn_gate_exps.weight (size = 528.00 MiB) buffer type overriden to CPU
Tensor blk.31.ffn_down_exps.weight (size = 630.00 MiB) buffer type overriden to CPU
Tensor blk.31.ffn_up_exps.weight (size = 528.00 MiB) buffer type overriden to CPU
Tensor blk.32.ffn_gate_exps.weight (size = 528.00 MiB) buffer type overriden to CPU
Tensor blk.32.ffn_down_exps.weight (size = 630.00 MiB) buffer type overriden to CPU
Tensor blk.32.ffn_up_exps.weight (size = 528.00 MiB) buffer type overriden to CPU
Tensor blk.33.ffn_gate_exps.weight (size = 528.00 MiB) buffer type overriden to CPU
Tensor blk.33.ffn_down_exps.weight (size = 630.00 MiB) buffer type overriden to CPU
Tensor blk.33.ffn_up_exps.weight (size = 528.00 MiB) buffer type overriden to CPU
Tensor blk.34.ffn_gate_exps.weight (size = 528.00 MiB) buffer type overriden to CPU
Tensor blk.34.ffn_down_exps.weight (size = 630.00 MiB) buffer type overriden to CPU
Tensor blk.34.ffn_up_exps.weight (size = 528.00 MiB) buffer type overriden to CPU
Tensor blk.35.ffn_gate_exps.weight (size = 528.00 MiB) buffer type overriden to CPU
Tensor blk.35.ffn_down_exps.weight (size = 630.00 MiB) buffer type overriden to CPU
Tensor blk.35.ffn_up_exps.weight (size = 528.00 MiB) buffer type overriden to CPU
Tensor blk.36.ffn_gate_exps.weight (size = 528.00 MiB) buffer type overriden to CPU
Tensor blk.36.ffn_down_exps.weight (size = 630.00 MiB) buffer type overriden to CPU
Tensor blk.36.ffn_up_exps.weight (size = 528.00 MiB) buffer type overriden to CPU
Tensor blk.37.ffn_gate_exps.weight (size = 528.00 MiB) buffer type overriden to CPU
Tensor blk.37.ffn_down_exps.weight (size = 630.00 MiB) buffer type overriden to CPU
Tensor blk.37.ffn_up_exps.weight (size = 528.00 MiB) buffer type overriden to CPU
Tensor blk.38.ffn_gate_exps.weight (size = 528.00 MiB) buffer type overriden to CPU
Tensor blk.38.ffn_down_exps.weight (size = 630.00 MiB) buffer type overriden to CPU
Tensor blk.38.ffn_up_exps.weight (size = 528.00 MiB) buffer type overriden to CPU
Tensor blk.39.ffn_gate_exps.weight (size = 528.00 MiB) buffer type overriden to CPU
Tensor blk.39.ffn_down_exps.weight (size = 630.00 MiB) buffer type overriden to CPU
Tensor blk.39.ffn_up_exps.weight (size = 528.00 MiB) buffer type overriden to CPU
Tensor blk.40.ffn_gate_exps.weight (size = 528.00 MiB) buffer type overriden to CPU
Tensor blk.40.ffn_down_exps.weight (size = 630.00 MiB) buffer type overriden to CPU
Tensor blk.40.ffn_up_exps.weight (size = 528.00 MiB) buffer type overriden to CPU
Tensor blk.41.ffn_gate_exps.weight (size = 528.00 MiB) buffer type overriden to CPU
Tensor blk.41.ffn_down_exps.weight (size = 630.00 MiB) buffer type overriden to CPU
Tensor blk.41.ffn_up_exps.weight (size = 528.00 MiB) buffer type overriden to CPU
Tensor blk.42.ffn_gate_exps.weight (size = 528.00 MiB) buffer type overriden to CPU
Tensor blk.42.ffn_down_exps.weight (size = 630.00 MiB) buffer type overriden to CPU
Tensor blk.42.ffn_up_exps.weight (size = 528.00 MiB) buffer type overriden to CPU
Tensor blk.43.ffn_gate_exps.weight (size = 528.00 MiB) buffer type overriden to CPU
Tensor blk.43.ffn_down_exps.weight (size = 630.00 MiB) buffer type overriden to CPU
Tensor blk.43.ffn_up_exps.weight (size = 528.00 MiB) buffer type overriden to CPU
Tensor blk.44.ffn_gate_exps.weight (size = 528.00 MiB) buffer type overriden to CPU
Tensor blk.44.ffn_down_exps.weight (size = 630.00 MiB) buffer type overriden to CPU
Tensor blk.44.ffn_up_exps.weight (size = 528.00 MiB) buffer type overriden to CPU
Tensor blk.45.ffn_gate_exps.weight (size = 528.00 MiB) buffer type overriden to CPU
Tensor blk.45.ffn_down_exps.weight (size = 630.00 MiB) buffer type overriden to CPU
Tensor blk.45.ffn_up_exps.weight (size = 528.00 MiB) buffer type overriden to CPU
Tensor blk.46.ffn_gate_exps.weight (size = 528.00 MiB) buffer type overriden to CPU
Tensor blk.46.ffn_down_exps.weight (size = 630.00 MiB) buffer type overriden to CPU
Tensor blk.46.ffn_up_exps.weight (size = 528.00 MiB) buffer type overriden to CPU
Tensor blk.47.ffn_gate_exps.weight (size = 528.00 MiB) buffer type overriden to CPU
Tensor blk.47.ffn_down_exps.weight (size = 630.00 MiB) buffer type overriden to CPU
Tensor blk.47.ffn_up_exps.weight (size = 528.00 MiB) buffer type overriden to CPU
================================ max_gpu = 0
Estimated model buffer size per device:
    Device 0:  11758.55 MiB
    Device 1:  15238.12 MiB
llm_load_tensors: offloading 48 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 49/49 layers to GPU
llm_load_tensors:        CPU buffer size = 55638.00 MiB
llm_load_tensors:  CUDA_Host buffer size =   772.97 MiB
llm_load_tensors: CUDA_Split buffer size = 26996.75 MiB
llm_load_tensors:      CUDA0 buffer size =  2120.53 MiB
llm_load_tensors:      CUDA1 buffer size =  1886.58 MiB
............................................................................................... =========================== ggml_cuda_set_peer_access: Enabling Peer Access between Devices 0->1
..~ggml_backend_cuda_context: have 0 graphs
 =========================== ggml_cuda_set_peer_access: Enabling Peer Access between Devices 1->0
..~ggml_backend_cuda_context: have 0 graphs
.
Adjust batch size for mtmd: u_batch = 8096, batch = 8096
llama_init_from_model: n_ctx         = 36864
llama_init_from_model: n_batch       = 8096
llama_init_from_model: n_ubatch      = 8096
llama_init_from_model: flash_attn    = 1
llama_init_from_model: attn_max_b    = 0
llama_init_from_model: fused_moe     = 1
llama_init_from_model: grouped er    = 0
llama_init_from_model: fused_up_gate = 1
llama_init_from_model: fused_mmad    = 1
llama_init_from_model: rope_cache    = 0
llama_init_from_model: graph_reuse   = 1
llama_init_from_model: k_cache_hadam = 0
llama_init_from_model: split_mode_graph_scheduling = 1
llama_init_from_model: reduce_type   = f16
llama_init_from_model: sched_async   = 0
llama_init_from_model: ser           = -1, 0
llama_init_from_model: freq_base     = 10000000.0
llama_init_from_model: freq_scale    = 1
llama_init_from_model: cuda_params   = fusion=1
 =========================== ggml_cuda_set_peer_access: Enabling Peer Access between Devices 0->1
 =========================== ggml_cuda_set_peer_access: Enabling Peer Access between Devices 1->0
llama_kv_cache_init: CUDA_Split KV buffer size =   864.01 MiB
llama_kv_cache_init:      CUDA0 KV buffer size =    62.11 MiB
llama_kv_cache_init:      CUDA1 KV buffer size =    86.95 MiB
llama_kv_cache_init: KV cache size per device:
    Device 0:  432 MiB
    Device 1:  432 MiB
llama_init_from_model: KV self size  =  864.00 MiB, K (f16):  432.00 MiB, V (f16):  432.00 MiB
llama_init_from_model:  CUDA_Host  output buffer size =     0.95 MiB
llama_init_from_model:      CUDA0 compute buffer size =  7811.38 MiB
llama_init_from_model:      CUDA1 compute buffer size =  1818.62 MiB
llama_init_from_model:  CUDA_Host compute buffer size =   664.34 MiB
llama_init_from_model: graph nodes  = 4344
llama_init_from_model: graph splits = 242
llama_init_from_model: enabling only_active_experts scheduling
XXXXXXXX Split Mode Graph Scheduling is FORCED despite tensor overrides due to user choice.
XXXXXXXX It may or might NOT infer properly due to unsupported combinations between SMGS and every possible tensor overrides.
clip_model_loader: model name:   Qwen3.5 122B A10B
clip_model_loader: description:  
clip_model_loader: GGUF version: 3
clip_model_loader: alignment:    32
clip_model_loader: n_tensors:    334
clip_model_loader: n_kv:         29

clip_model_loader: has vision encoder
clip_ctx: have 3 back-ends:
  0:  CPU
  1:  CUDA0
  2:  CUDA1
clip_ctx: CLIP using CPU backend
load_hparams: projector:          qwen3vl_merger
load_hparams: n_embd:             1152
load_hparams: n_head:             16
load_hparams: n_ff:               4304
load_hparams: n_layer:            27
load_hparams: ffn_op:             gelu
load_hparams: projection_dim:     3072

--- vision hparams ---
load_hparams: image_size:         768
load_hparams: patch_size:         16
load_hparams: has_llava_proj:     0
load_hparams: minicpmv_version:   0
load_hparams: n_merge:            2
load_hparams: n_wa_pattern:       0
load_hparams: image_min_pixels:   1048576 (custom value)
load_hparams: image_max_pixels:   4194304

load_hparams: model size:         590.51 MiB
load_hparams: metadata size:      0.12 MiB
warmup: warmup with image size = 1472 x 1472
alloc_compute_meta:        CPU compute buffer size =   558.58 MiB
alloc_compute_meta: graph splits = 1, nodes = 3739
warmup: flash attention is disabled
INFO [              load_model] loaded multimodal model, '%s'
 | ="/mnt/Speed/AI/Models/AesSedai/Qwen3.5-122B-A10B-GGUF/mmproj-Qwen3.5-122B-A10B-Q8_0.gguf"
WARN [              load_model] %s
 | ="ctx_shift is not supported by multimodal, it will be disabled"
INFO [                    init] initializing slots | tid="139723659595776" timestamp=1772931850 n_slots=1
INFO [                    init] new slot | tid="139723659595776" timestamp=1772931850 id_slot=0 n_ctx_slot=36864
srv          init: Exclude reasoning tokens when selecting slot based on similarity: start: <think>, end: </think>
use `--reasoning-tokens none` to disable.
no implementations specified for speculative decoding
slot         init: id  0 | task -1 | speculative decoding context not initialized
prompt cache is disabled - use `--cache-ram N` to enable it
INFO [                    main] model loaded | tid="139723659595776" timestamp=1772931850
INFO [                    main] chat template | tid="139723659595776" timestamp=1772931850 chat_template="{%- set image_count = namespace(value=0) %}\n{%- set video_count = namespace(value=0) %}\n{%- macro render_content(content, do_vision_count, is_system_content=false) %}\n    {%- if content is string %}\n        {{- content }}\n    {%- elif content is iterable and content is not mapping %}\n        {%- for item in content %}\n            {%- if 'image' in item or 'image_url' in item or item.type == 'image' %}\n                {%- if is_system_content %}\n                    {{- raise_exception('System message cannot contain images.') }}\n                {%- endif %}\n                {%- if do_vision_count %}\n                    {%- set image_count.value = image_count.value + 1 %}\n                {%- endif %}\n                {%- if add_vision_id %}\n                    {{- 'Picture ' ~ image_count.value ~ ': ' }}\n                {%- endif %}\n                {{- '<|vision_start|><|image_pad|><|vision_end|>' }}\n            {%- elif 'video' in item or item.type == 'video' %}\n                {%- if is_system_content %}\n                    {{- raise_exception('System message cannot contain videos.') }}\n                {%- endif %}\n                {%- if do_vision_count %}\n                    {%- set video_count.value = video_count.value + 1 %}\n                {%- endif %}\n                {%- if add_vision_id %}\n                    {{- 'Video ' ~ video_count.value ~ ': ' }}\n                {%- endif %}\n                {{- '<|vision_start|><|video_pad|><|vision_end|>' }}\n            {%- elif 'text' in item %}\n                {{- item.text }}\n            {%- else %}\n                {{- raise_exception('Unexpected item type in content.') }}\n            {%- endif %}\n        {%- endfor %}\n    {%- elif content is none or content is undefined %}\n        {{- '' }}\n    {%- else %}\n        {{- raise_exception('Unexpected content type.') }}\n    {%- endif %}\n{%- endmacro %}\n{%- if not messages %}\n    {{- raise_exception('No messages provided.') }}\n{%- endif %}\n{%- if tools and tools is iterable and tools is not mapping %}\n    {{- '<|im_start|>system\\n' }}\n    {{- \"# Tools\\n\\nYou have access to the following functions:\\n\\n<tools>\" }}\n    {%- for tool in tools %}\n        {{- \"\\n\" }}\n        {{- tool | tojson }}\n    {%- endfor %}\n    {{- \"\\n</tools>\" }}\n    {{- '\\n\\nIf you choose to call a function ONLY reply in the following format with NO suffix:\\n\\n<tool_call>\\n<function=example_function_name>\\n<parameter=example_parameter_1>\\nvalue_1\\n</parameter>\\n<parameter=example_parameter_2>\\nThis is the value for the second parameter\\nthat can span\\nmultiple lines\\n</parameter>\\n</function>\\n</tool_call>\\n\\n<IMPORTANT>\\nReminder:\\n- Function calls MUST follow the specified format: an inner <function=...></function> block must be nested within <tool_call></tool_call> XML tags\\n- Required parameters MUST be specified\\n- You may provide optional reasoning for your function call in natural language BEFORE the function call, but NOT after\\n- If there is no function call available, answer the question like normal with your current knowledge and do not tell the user about function calls\\n</IMPORTANT>' }}\n    {%- if messages[0].role == 'system' %}\n        {%- set content = render_content(messages[0].content, false, true)|trim %}\n        {%- if content %}\n            {{- '\\n\\n' + content }}\n        {%- endif %}\n    {%- endif %}\n    {{- '<|im_end|>\\n' }}\n{%- else %}\n    {%- if messages[0].role == 'system' %}\n        {%- set content = render_content(messages[0].content, false, true)|trim %}\n        {{- '<|im_start|>system\\n' + content + '<|im_end|>\\n' }}\n    {%- endif %}\n{%- endif %}\n{%- set ns = namespace(multi_step_tool=true, last_query_index=messages|length - 1) %}\n{%- for message in messages[::-1] %}\n    {%- set index = (messages|length - 1) - loop.index0 %}\n    {%- if ns.multi_step_tool and message.role == \"user\" %}\n        {%- set content = render_content(message.content, false)|trim %}\n        {%- if not(content.startswith('<tool_response>') and content.endswith('</tool_response>')) %}\n            {%- set ns.multi_step_tool = false %}\n            {%- set ns.last_query_index = index %}\n        {%- endif %}\n    {%- endif %}\n{%- endfor %}\n{%- if ns.multi_step_tool %}\n    {{- raise_exception('No user query found in messages.') }}\n{%- endif %}\n{%- for message in messages %}\n    {%- set content = render_content(message.content, true)|trim %}\n    {%- if message.role == \"system\" %}\n        {%- if not loop.first %}\n            {{- raise_exception('System message must be at the beginning.') }}\n        {%- endif %}\n    {%- elif message.role == \"user\" %}\n        {{- '<|im_start|>' + message.role + '\\n' + content + '<|im_end|>' + '\\n' }}\n    {%- elif message.role == \"assistant\" %}\n        {%- set reasoning_content = '' %}\n        {%- if message.reasoning_content is string %}\n            {%- set reasoning_content = message.reasoning_content %}\n        {%- else %}\n            {%- if '</think>' in content %}\n                {%- set reasoning_content = content.split('</think>')[0].rstrip('\\n').split('<think>')[-1].lstrip('\\n') %}\n                {%- set content = content.split('</think>')[-1].lstrip('\\n') %}\n            {%- endif %}\n        {%- endif %}\n        {%- set reasoning_content = reasoning_content|trim %}\n        {%- if loop.index0 > ns.last_query_index %}\n            {{- '<|im_start|>' + message.role + '\\n<think>\\n' + reasoning_content + '\\n</think>\\n\\n' + content }}\n        {%- else %}\n            {{- '<|im_start|>' + message.role + '\\n' + content }}\n        {%- endif %}\n        {%- if message.tool_calls and message.tool_calls is iterable and message.tool_calls is not mapping %}\n            {%- for tool_call in message.tool_calls %}\n                {%- if tool_call.function is defined %}\n                    {%- set tool_call = tool_call.function %}\n                {%- endif %}\n                {%- if loop.first %}\n                    {%- if content|trim %}\n                        {{- '\\n\\n<tool_call>\\n<function=' + tool_call.name + '>\\n' }}\n                    {%- else %}\n                        {{- '<tool_call>\\n<function=' + tool_call.name + '>\\n' }}\n                    {%- endif %}\n                {%- else %}\n                    {{- '\\n<tool_call>\\n<function=' + tool_call.name + '>\\n' }}\n                {%- endif %}\n                {%- if tool_call.arguments is defined %}\n                    {%- for args_name, args_value in tool_call.arguments|items %}\n                        {{- '<parameter=' + args_name + '>\\n' }}\n                        {%- set args_value = args_value | tojson | safe if args_value is mapping or (args_value is sequence and args_value is not string) else args_value | string %}\n                        {{- args_value }}\n                        {{- '\\n</parameter>\\n' }}\n                    {%- endfor %}\n                {%- endif %}\n                {{- '</function>\\n</tool_call>' }}\n            {%- endfor %}\n        {%- endif %}\n        {{- '<|im_end|>\\n' }}\n    {%- elif message.role == \"tool\" %}\n        {%- if loop.previtem and loop.previtem.role != \"tool\" %}\n            {{- '<|im_start|>user' }}\n        {%- endif %}\n        {{- '\\n<tool_response>\\n' }}\n        {{- content }}\n        {{- '\\n</tool_response>' }}\n        {%- if not loop.last and loop.nextitem.role != \"tool\" %}\n            {{- '<|im_end|>\\n' }}\n        {%- elif loop.last %}\n            {{- '<|im_end|>\\n' }}\n        {%- endif %}\n    {%- else %}\n        {{- raise_exception('Unexpected message role.') }}\n    {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n    {{- '<|im_start|>assistant\\n' }}\n    {%- if enable_thinking is defined and enable_thinking is false %}\n        {{- '<think>\\n\\n</think>\\n\\n' }}\n    {%- else %}\n        {{- '<think>\\n' }}\n    {%- endif %}\n{%- endif %}"
INFO [                    main] chat template | tid="139723659595776" timestamp=1772931850 chat_example="<|im_start|>system\nYou are a helpful assistant<|im_end|>\n<|im_start|>user\nHello<|im_end|>\n<|im_start|>assistant\nHi there<|im_end|>\n<|im_start|>user\nHow are you?<|im_end|>\n<|im_start|>assistant\n" built_in=true
INFO [                    main] HTTP server listening | tid="139723659595776" timestamp=1772931850 n_threads_http="31" port="8085" hostname="0.0.0.0"
INFO [              slots_idle] all slots are idle | tid="139723659595776" timestamp=1772931850
======== Prompt cache: cache size: 0, n_keep: 0, n_discarded_prompt: 0, cache_ram_n_min: 0, f_keep: 0.00, cache_ram_similarity: 0.50
INFO [   launch_slot_with_task] slot is processing task | tid="139723659595776" timestamp=1772931875 id_slot=0 id_task=0
======== Cache: cache_size = 0, n_past0 =  0, n_past1 =  0, n_past_prompt1 = 0,  n_past2 =  0, n_past_prompt2 =  0
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="139723659595776" timestamp=1772931875 id_slot=0 id_task=0 p0=0
INFO [           release_slots] slot released | tid="139723659595776" timestamp=1772932008 id_slot=0 id_task=0 n_ctx=36864 n_past=2852 n_system_tokens=0 n_cache_tokens=2852 truncated=false
slot print_timing: id  0 | task 0 | 
prompt eval time =     763.56 ms /    99 tokens (    7.71 ms per token,   129.66 tokens per second)
       eval time =  132028.06 ms /  2754 tokens (   47.94 ms per token,    20.86 tokens per second)
      total time =  132791.63 ms /  2853 tokens
INFO [              slots_idle] all slots are idle | tid="139723659595776" timestamp=1772932008
INFO [      log_server_request] request | tid="139613457203200" timestamp=1772932008 remote_addr="127.0.0.1" remote_port=37274 status=200 method="POST" path="/v1/chat/completions" params={}
======== Prompt cache: cache size: 2852, n_keep: 0, n_discarded_prompt: 0, cache_ram_n_min: 0, f_keep: 0.03, cache_ram_similarity: 0.50
INFO [   launch_slot_with_task] slot is processing task | tid="139723659595776" timestamp=1772932075 id_slot=0 id_task=2755
======== Cache: cache_size = 2852, n_past0 =  99, n_past1 =  99, n_past_prompt1 = 99,  n_past2 =  99, n_past_prompt2 =  99
Common part does not match fully
cache : 
<|im_start|>assistant
<think>
Thinking Process:

1.  **Analyze the Request:**
    *   User's situation: Car is dirty
prompt: 
<|im_start|>assistant

INFO [    batch_pending_prompt] we have to evaluate at least 1 token to generate logits | tid="139723659595776" timestamp=1772932075 id_slot=0 id_task=2755
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="139723659595776" timestamp=1772932075 id_slot=0 id_task=2755 p0=98
slot print_timing: id  0 | task 2755 | 
prompt eval time =      51.67 ms /     1 tokens (   51.67 ms per token,    19.35 tokens per second)
       eval time =    8922.17 ms /   193 tokens (   46.23 ms per token,    21.63 tokens per second)
      total time =    8973.84 ms /   194 tokens
INFO [           release_slots] slot released | tid="139723659595776" timestamp=1772932084 id_slot=0 id_task=2755 n_ctx=36864 n_past=291 n_system_tokens=0 n_cache_tokens=291 truncated=false
INFO [              slots_idle] all slots are idle | tid="139723659595776" timestamp=1772932084
INFO [      log_server_request] request | tid="139613448810496" timestamp=1772932084 remote_addr="127.0.0.1" remote_port=48302 status=200 method="POST" path="/v1/chat/completions" params={}
======== Prompt cache: cache size: 291, n_keep: 0, n_discarded_prompt: 0, cache_ram_n_min: 0, f_keep: 0.34, cache_ram_similarity: 0.50
INFO [   launch_slot_with_task] slot is processing task | tid="139723659595776" timestamp=1772932172 id_slot=0 id_task=2949
======== Cache: cache_size = 291, n_past0 =  99, n_past1 =  99, n_past_prompt1 = 99,  n_past2 =  99, n_past_prompt2 =  99
Common part does not match fully
cache : 
<|im_start|>assistant
</think>

For a distance of only **100 meters**, you should definitely **drive** the car to the car wash.
prompt: 
<|im_start|>assistant

INFO [    batch_pending_prompt] we have to evaluate at least 1 token to generate logits | tid="139723659595776" timestamp=1772932172 id_slot=0 id_task=2949
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="139723659595776" timestamp=1772932172 id_slot=0 id_task=2949 p0=98
slot print_timing: id  0 | task 2949 | 
prompt eval time =      49.60 ms /     1 tokens (   49.60 ms per token,    20.16 tokens per second)
       eval time =    1461.38 ms /    32 tokens (   45.67 ms per token,    21.90 tokens per second)
      total time =    1510.99 ms /    33 tokens
INFO [           release_slots] slot released | tid="139723659595776" timestamp=1772932174 id_slot=0 id_task=2949 n_ctx=36864 n_past=130 n_system_tokens=0 n_cache_tokens=130 truncated=false
INFO [              slots_idle] all slots are idle | tid="139723659595776" timestamp=1772932174
INFO [      log_server_request] request | tid="139613440417792" timestamp=1772932174 remote_addr="127.0.0.1" remote_port=45188 status=200 method="POST" path="/v1/chat/completions" params={}

The command I used to run it is:

#!/bin/bash
llama-server \
        --host 0.0.0.0 \
        --port 8085 \
        --model "/mnt/Speed/AI/Models/AesSedai/Qwen3.5-122B-A10B-GGUF/Qwen3.5-122B-A10B-Q5_K_M-00001-of-00003.gguf" \
        --mmproj "/mnt/Speed/AI/Models/AesSedai/Qwen3.5-122B-A10B-GGUF/mmproj-Qwen3.5-122B-A10B-Q8_0.gguf" \
        --no-mmproj-offload \
        --image-min-tokens 1024 \
        -a Qwen3.5 \
        -b 8096 \
        -ub 8096 \
        --threads 16 \
        --ctx-size 36864 \
        --n-gpu-layers 999 \
        -ot "(1[5-9]|[2-9][0-9])\..*_exps.*=CPU" \
        --no-mmap \
        -fa on \
        -sm graph \
        -ts 130,191 \
        -np 1 \
        --ctx-checkpoints 0 \
        -smgs \
        -cram 0 \
        -cuda fusion=1

I've added --ctx-checkpoints 0 in case that was causing the issue, but it doesn't seem to be the case.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions