UPSTREAM PR #15667: convert : parse safetensors directly by DajanaV · Pull Request #111 · auroralabs-loci/llama.cpp

DajanaV · 2025-11-07T03:45:58Z

Should fix #15623
(originally targeted #14810, but was rebased)

This replaces the approach from #8482 to avoid using get_slice because it turns out it eagerly memmaps tensors which means on Windows this uses a lot of memory, and on Linux this inflates the resident set size.

Safetensors files are now parsed directly, since the format is simple enough. This will also eventually allow tracking the file ranges of tensors to maybe use os.copy_file_range when possible to make conversion of COW filesystems very fast (in #15727).

On Linux, when using memray (a memory profiler), this change reduces the peak heap memory usage by quite a lot, and with GNU time, it also reduces the peak resident set size memory usage.

The previous behavior when observed with memray seems to be that safe_open puts all of the model into the heap (likely memmaped, though since the resident set size is smaller and grows). The new behavior when observed with memray is more similar to what I thought happened in the first place (bumps of memory usage at each processed tensor, but it goes back down between each).

Here's a table of the "Maximum resident set size (kbytes)" from time -v (when using GNU time) on a few models:

$ $(which time) -v python3 convert_hf_to_gguf.py /path/to/model_dir --outfile /path/to/model.gguf --outtype f16

Model	Target type	`master` (kbytes)	This PR (kbytes)
https://huggingface.co/mistralai/Mistral-7B-v0.1	F16	10 334 248	1 129 248
https://huggingface.co/meta-llama/Llama-3.2-1B	F16	3 023 112	2 104 256
https://huggingface.co/Qwen/Qwen3-Coder-30B-A3B-Instruct	F16	9 165 048	2 680 124

Safetensors are already directly parsed since #12820 for remote models. This is similar, but for local models.

TODO:

Handle byteswapping on big-endian platforms?

Make sure to read the contributing guidelines before submitting a PR

Applies to both local and remote safetensors custom parsing. This matches the behavior of the official safetensors implementation. * convert : rename from_safetensors_meta to from_local_tensor For consistency with from_remote_tensor

loci-review · 2025-11-07T04:21:50Z

Access the complete analysis in the LOCI Dashboard

Performance Analysis Summary

Overview

Analysis of version 128d5cb8-641f-44a1-af71-4e4bf67bde8a compared to baseline 52cd5469-e814-4a51-818a-57d1618fc442 reveals minimal performance variations within measurement precision limits. The changes are confined to Python conversion scripts and do not affect C++ runtime performance.

Key Findings

Performance Metrics:

Highest Response Time change: llama_supports_rpc (+0.08%, +0.024 ns)
Highest Throughput change: std::make_unique<llm_graph_input_pos_bucket> (+0.12%, +0.12 ns)
Both functions are non-core utility functions unrelated to inference performance

Core Function Impact:
No changes detected in critical inference functions (llama_decode, llama_encode, llama_tokenize). The measured variations do not affect tokenization or inference pathways, therefore no impact on tokens per second performance is expected.

Power Consumption Analysis:
All binaries maintain identical power consumption profiles with 0.0% change across:

libllama.so, libggml.so (core inference libraries)
Command-line tools and utilities
Total estimated power consumption remains stable at ~1.77 million nanojoules

Flame Graph and CFG Analysis:

llama_supports_rpc shows identical assembly code and control flow structure between versions
Linear execution pattern with single external dependency (ggml_backend_reg_by_name@plt)
0.02 ns timing difference attributed to micro-architectural variations rather than code changes

GitHub Code Review:
The PR introduces Python-only changes for safetensors parsing optimization, achieving significant memory usage reductions (up to 89% for large models) during conversion. No C++ runtime code modifications were identified.

Conclusion:
The analysis confirms no meaningful performance impact on the LLaMA.cpp inference engine. Observed timing variations represent measurement precision limits rather than functional changes. The Python conversion improvements enhance memory efficiency without affecting runtime performance.

compilade added 3 commits November 6, 2025 22:33

convert : parse safetensors directly

c4b630f

gguf-py : order safetensors tensors by name

e7b7ed8

Applies to both local and remote safetensors custom parsing. This matches the behavior of the official safetensors implementation. * convert : rename from_safetensors_meta to from_local_tensor For consistency with from_remote_tensor

convert : fix no-lazy dtypes from direct safetensors

e996f3a

DajanaV temporarily deployed to PROD__AL_DEMO November 7, 2025 03:46 — with GitHub Actions Inactive

DajanaV force-pushed the main branch 25 times, most recently from 9248736 to 4f73918 Compare November 10, 2025 13:17

DajanaV force-pushed the main branch 30 times, most recently from 654fc56 to 35c840d Compare November 15, 2025 19:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UPSTREAM PR #15667: convert : parse safetensors directly#111

UPSTREAM PR #15667: convert : parse safetensors directly#111
DajanaV wants to merge 3 commits intomainfrom
upstream-PR15667-branch_ggml-org-compilade/convert-safetensors-parse

DajanaV commented Nov 7, 2025

Uh oh!

loci-review bot commented Nov 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

DajanaV commented Nov 7, 2025

Uh oh!

loci-review bot commented Nov 7, 2025

Performance Analysis Summary

Overview

Key Findings

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants