Honor mmap setting when using tensor overrides#270
Merged
Conversation
Contributor
|
Wow sweet! I just got back home and saw this, pull'd and rebuilt and got my custom quant running locally on the 9950X + 96GB DDR5-6400 RAM + 3090TI 24GB! Got about 3 tok/sec generation on a quick initial test. This quant is heavy ( Amazing that This setup might benefit from ./build/bin/llama-server \
--alias ubergarm/DeepSeek-R1-Q2_K_R4 \
--model /mnt/ai/models/ubergarm/DeepSeek-R1-GGUF/DeepSeek-R1-GGUF-Q2_K_R4.gguf \
--ctx-size 32768 \
-ctk q8_0 \
-mla 2 -fa \
-amb 512 \
-fmoe \
--n-gpu-layers 63 \
--override-tensor exps=CPU \
--parallel 1 \
--threads 16 \
--host 127.0.0.1 \
--port 8080
...
llama_model_loader: - type f32: 361 tensors
llama_model_loader: - type q8_0: 612 tensors
llama_model_loader: - type q2_k_r4: 116 tensors
llama_model_loader: - type q3_k_r4: 58 tensors
... |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The reason why
mmapwas disabled when using tensor overrides is this:CPU), we get the buffer type returned byggml_backend_cpu_buffer_type()llama_default_buffer_type_cpu(true)instead to see if a buffer is a CPU buffer and hence can be memory mapped.llama_default_buffer_type_cpu(true)returns a different buffer type (CUDA_Hostin the case of the CUDA backend).This PR fixes that by asking the buffer type to be either
llama_default_buffer_type_cpu(true)orggml_backend_cpu_buffer_type()to be eligible for usingmmap.Note, however, that
-rtrstill disablesmmapbecause otherwise the model would be overwritten with the repacked tensors.