Skip to content

Honor mmap setting when using tensor overrides#270

Merged
ikawrakow merged 1 commit intomainfrom
ik/tensor_override_honor_mmap
Mar 19, 2025
Merged

Honor mmap setting when using tensor overrides#270
ikawrakow merged 1 commit intomainfrom
ik/tensor_override_honor_mmap

Conversation

@ikawrakow
Copy link
Copy Markdown
Owner

The reason why mmap was disabled when using tensor overrides is this:

  • When the command line argument is parsed (and the override buffer is set to CPU), we get the buffer type returned by ggml_backend_cpu_buffer_type()
  • The tensor loading logic uses llama_default_buffer_type_cpu(true) instead to see if a buffer is a CPU buffer and hence can be memory mapped.
  • When CUDA (or some other backend) is enabled, llama_default_buffer_type_cpu(true) returns a different buffer type (CUDA_Host in the case of the CUDA backend).
  • As a result, the tensors set to be stored in the CPU memory buffer are not memory mapped

This PR fixes that by asking the buffer type to be either llama_default_buffer_type_cpu(true) or ggml_backend_cpu_buffer_type() to be eligible for using mmap.

Note, however, that -rtr still disables mmap because otherwise the model would be overwritten with the repacked tensors.

@ikawrakow ikawrakow merged commit 127c6ee into main Mar 19, 2025
@ubergarm
Copy link
Copy Markdown
Contributor

Wow sweet! I just got back home and saw this, pull'd and rebuilt and got my custom quant running locally on the 9950X + 96GB DDR5-6400 RAM + 3090TI 24GB! Got about 3 tok/sec generation on a quick initial test.

This quant is heavy (q8_0 on the GPU offload tensors) but still fits 32k context with enough left-over for x windows! Better perplexity than the unsloth UD-Q2_K_XL too.

Amazing that mmap() and Linux page cache can serve ~238GiB model weights off of a PCIe Gen 5 Crucial T700 2TB NVMe and 2x48GB tuned DIMMs.

This setup might benefit from -ser 6,1 too! Plenty to try out, thanks!

./build/bin/llama-server \
    --alias ubergarm/DeepSeek-R1-Q2_K_R4 \
    --model /mnt/ai/models/ubergarm/DeepSeek-R1-GGUF/DeepSeek-R1-GGUF-Q2_K_R4.gguf \
    --ctx-size 32768 \
    -ctk q8_0 \
    -mla 2 -fa \
    -amb 512 \
    -fmoe \
    --n-gpu-layers 63 \
    --override-tensor exps=CPU \
    --parallel 1 \
    --threads 16 \
    --host 127.0.0.1 \
    --port 8080

...
llama_model_loader: - type  f32:  361 tensors
llama_model_loader: - type q8_0:  612 tensors
llama_model_loader: - type q2_k_r4:  116 tensors
llama_model_loader: - type q3_k_r4:   58 tensors
...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants