UPSTREAM PR #15667: convert : parse safetensors directly#111
UPSTREAM PR #15667: convert : parse safetensors directly#111
Conversation
Applies to both local and remote safetensors custom parsing. This matches the behavior of the official safetensors implementation. * convert : rename from_safetensors_meta to from_local_tensor For consistency with from_remote_tensor
|
Access the complete analysis in the LOCI Dashboard Performance Analysis SummaryOverviewAnalysis of version Key FindingsPerformance Metrics:
Core Function Impact: Power Consumption Analysis:
Flame Graph and CFG Analysis:
GitHub Code Review: Conclusion: |
9248736 to
4f73918
Compare
654fc56 to
35c840d
Compare
Mirrored from ggml-org/llama.cpp#15667
Should fix #15623
(originally targeted #14810, but was rebased)
This replaces the approach from #8482 to avoid using
get_slicebecause it turns out it eagerly memmaps tensors which means on Windows this uses a lot of memory, and on Linux this inflates the resident set size.Safetensors files are now parsed directly, since the format is simple enough. This will also eventually allow tracking the file ranges of tensors to maybe use
os.copy_file_rangewhen possible to make conversion of COW filesystems very fast (in #15727).On Linux, when using
memray(a memory profiler), this change reduces the peak heap memory usage by quite a lot, and with GNUtime, it also reduces the peak resident set size memory usage.The previous behavior when observed with
memrayseems to be thatsafe_openputs all of the model into the heap (likely memmaped, though since the resident set size is smaller and grows). The new behavior when observed withmemrayis more similar to what I thought happened in the first place (bumps of memory usage at each processed tensor, but it goes back down between each).Here's a table of the "Maximum resident set size (kbytes)" from
time -v(when using GNUtime) on a few models:$ $(which time) -v python3 convert_hf_to_gguf.py /path/to/model_dir --outfile /path/to/model.gguf --outtype f16master(kbytes)Safetensors are already directly parsed since #12820 for remote models. This is similar, but for local models.
TODO:
Make sure to read the contributing guidelines before submitting a PR