UPSTREAM PR #17580: Add safetensors support by loci-dev · Pull Request #351 · auroralabs-loci/llama.cpp

loci-dev · 2025-11-28T20:36:16Z

So we can load these natively just like gguf

loci-review · 2025-11-28T21:23:35Z

Explore the complete analysis inside the Version Insights

Performance Analysis Summary

PR #351: Safetensors Support Implementation

This PR introduces 1,663 lines of new code across 11 files to add safetensors format support. The implementation is incomplete and non-functional, with all model loading functions returning "not yet implemented" errors. No existing code paths are modified, resulting in zero performance impact on current operations.

Key Findings

Performance-Critical Areas Impact:

The changes do not affect any performance-critical functions identified in the project summary. Core inference functions (llama_decode, llama_encode, llama_tokenize) remain unmodified. Model loading functions (llama_model_load_from_file, llama_init_from_model) are unchanged. Memory management (llama_memory_clear, llama_kv_cache operations) and batch processing (llama_batch_init, llama_decode) show no modifications.

Tokens Per Second Impact:

No impact on inference throughput. The tokenization and inference pipeline remains untouched. Functions responsible for token processing (llama_tokenize, llama_detokenize, llama_decode, llama_encode) show no changes in response time or throughput. The reference benchmark (ollama://smollm:135m on 12th Gen Intel i7-1255U) would maintain current tokens per second performance.

Power Consumption Analysis:

Analysis shows a 10.90% increase in estimated power consumption for build.bin.libllama.so (214,109 nJ vs 193,066 nJ baseline, +21,043 nJ absolute change). This increase is attributed to STL container operations showing throughput regressions:

std::vector<size_t>::empty() +134 ns throughput
std::back_inserter<std::vector> +24 ns throughput
std::vector::back() +29 ns throughput
std::vector<llm_symbol>::end() +24 ns throughput

Other binaries show minimal changes: llama-tts (+0.07%), llama-gguf-split (+0.03%), llama-quantize (+0.02%), with llama-run (-0.10%) and llama-cvector-generator (-0.08%) showing slight improvements.

Code Implementation Analysis:

The PR adds infrastructure for parsing safetensors files (llama-safetensors.cpp, 398 lines), HuggingFace config parsing (llama-hf-config.cpp, 220 lines), type conversion utilities (llama-safetensors-types.cpp, 157 lines), and tensor name mapping (llama-safetensors-loader.cpp, 271 lines). The model builder (llama-model-from-safetensors.cpp, 218 lines) defines an 8-step loading pipeline but implements only steps 1-3. Steps 4-8 (create_model_structure, allocate_tensors, load_tensor_data, init_vocabulary, finalize_model) return false with error messages.

The implementation uses C-style FILE* operations for file I/O, nlohmann/json for parsing, and std::regex for tensor name mapping. Type conversion functions support F32, F16, BF16, I32, I16, I8 formats with element-wise loops for conversions.

loci-review · 2025-11-29T14:17:05Z

Explore the complete analysis inside the Version Insights

Performance Analysis Summary - PR #351: Safetensors Support

Overview

PR #351 introduces safetensors model loading capability through 11 new files (2296 lines). This is an additive feature with no modifications to existing inference paths. The performance analysis reveals no impact on runtime inference performance, as the new code affects only the model loading phase.

Key Findings

Inference Performance Impact

No impact on tokens per second. The safetensors loading path does not modify any inference-critical functions:

llama_decode - unchanged
llama_encode - unchanged
llama_tokenize - unchanged

All inference operations use the same GGML backend and tensor structures regardless of whether the model was loaded from GGUF or safetensors format. Once loaded, model execution is identical.

Model Loading Performance

The new safetensors loader exhibits different characteristics compared to GGUF:

Type Conversion Operations:

F32→F16 conversion: element-wise loop processing 16M-2B elements per model
F64→F32 downcast: similar element-wise processing
Direct memcpy for matching types (F32→F32, F16→F16)

For a 7B parameter model, type conversion adds approximately 10000-30000 ms to load time. This is a one-time cost during model initialization and does not affect subsequent inference.

Tensor Name Mapping:

Uses std::regex for pattern matching on 300+ tensor names
Adds 300-500 ms overhead during load
String concatenation for name construction

File I/O Pattern:

Sequential fread operations with fseek for tensor data
Temporary buffer allocation per tensor (peak memory = model size + largest tensor)
No memory-mapped file support

Power Consumption Analysis

Binary-level impact:

libllama.so: No change (0.0%) - safetensors code is separate module
New binaries: No new executables added, only library code

The safetensors loading functions are not included in the power consumption baseline as they represent new, optional code paths. When active, the loading phase will consume additional CPU cycles for type conversion and file I/O, but this is transient and does not affect steady-state inference power consumption.

Implementation Status

Incomplete vocabulary loading: The init_vocabulary() function returns success without loading tokenizer data. Models load structurally but lack vocabulary for text processing. This affects the usability of safetensors-loaded models but does not impact performance metrics of existing GGUF-based workflows.

Performance-Critical Areas

Model Loading Module:

New alternative path for safetensors format
Existing GGUF path unchanged
No shared code between loaders

Memory Management Module:

Uses standard ggml_backend_alloc_ctx_tensors() allocation
Same backend buffer management as GGUF
No changes to KV cache or memory recurrent systems

Token Processing Module:

Vocabulary initialization incomplete
No changes to existing tokenization functions
Safetensors path does not affect active tokenization performance

The implementation is architecturally sound as an isolated feature addition. The performance characteristics differ from GGUF loading but do not regress existing functionality. Inference performance remains unchanged as the new code operates exclusively in the model initialization phase.

loci-review · 2025-11-30T14:22:22Z

Explore the complete analysis inside the Version Insights

Performance Analysis Summary - PR #351: Safetensors Support

Condition Assessment: Condition 1 applies - No meaningful performance impact from code changes.

Summary

PR #351 adds native safetensors format support through 3,055 lines of new code across 20 files. The observed performance variations are artifacts of build differences rather than functional regressions. The STL iterator functions showing +130-226% changes (60-195 ns absolute) are compiler optimization issues affecting trivial inline operations. The common_get_hf_file function gained +1,333,000 ns response time due to new safetensors detection logic (additional HTTP API call and JSON parsing), which executes only during initial model download, not inference. Core inference functions (llama_decode, llama_encode, llama_tokenize) remain unmodified. Tokens per second is unaffected. Power consumption increased 13% in libllama.so due to new safetensors loading infrastructure, but this impacts model loading time only, not inference throughput.

So we can load these natively just like gguf Signed-off-by: Eric Curtin <[email protected]>

loci-review · 2025-11-30T18:20:23Z

Explore the complete analysis inside the Version Insights

Performance Analysis Summary: PR #351 - Safetensors Support

Project: llama.cpp
PR #351: Add safetensors support
Scope: 20 files changed (+3,322 lines, -24 lines)

Analysis Overview

This PR adds native safetensors model loading capability alongside existing GGUF support. The implementation introduces 5 new source files and modifies 3 existing modules in the model loading and download subsystems.

Key Findings

Impact on Performance-Critical Areas

Model Loading Module:

common_get_hf_file: +1,312,000 ns response time increase
- New safetensors detection logic adds HTTP request to HF API /tree/main endpoint
- JSON parsing for file list enumeration
- Extension checking for .safetensors files
- Impact: One-time cost during model initialization, not in inference path
~common_hf_file_res: +285 ns response time increase
- New std::vector<std::string> safetensors_files member added to struct
- Vector destructor deallocates heap memory for file list
- Impact: Minimal, occurs only during cleanup after model loading
common_params_handle_model: +912 ns throughput increase
- New branching logic for safetensors vs GGUF download paths
- Directory creation and multi-file download list construction
- String manipulation for path sanitization
- Impact: One-time cost during model parameter processing

Token Processing Module:

llama_vocab::load_from_hf_tokenizer: New function (224 lines)
- Parses HuggingFace tokenizer.json format
- Loads vocabulary, BPE merges, and special tokens
- Builds token_to_piece cache with atomic swap
- Impact: Load-time only, runtime tokenization unchanged

Inference Path Analysis:

llama_decode: No modifications
llama_encode: No modifications
llama_tokenize: No modifications
llama_batch_*: No modifications

Tokens Per Second Impact: Zero. The PR does not modify any inference or tokenization runtime functions. All changes are isolated to model loading and initialization paths. Using the reference that 2 ms slower llama_decode results in 7% tokens per second reduction, this PR introduces zero inference slowdown as llama_decode response time is unchanged.

Exception: Debug logging added to tools/run/run.cpp introduces fprintf calls in the generation loop. This adds 5,000-20,000 ns per token overhead, which would reduce tokens per second by approximately 10-30% for the reference model. This debug code should be removed.

Power Consumption Analysis

libllama.so: +26,189 nJ (+13.57%)

PR-attributable: ~3,000-6,000 nJ (~1.5-3%)
- New safetensors loading functions compiled into library
- HuggingFace tokenizer parsing functions
- Type conversion utilities
Non-PR attributable: ~20,000-23,000 nJ (~10-12%)
- STL iterator functions show debug-mode compilation pattern
- Functions like std::_Rb_tree::end(), std::vector::empty() show 150-226% throughput increases
- Root cause: Build configuration issue, not code changes

llama-cvector-generator: +219 nJ (+0.10%)

Download module changes: common_get_hf_file and ~common_hf_file_res
Impact: Negligible for inference workloads

llama-tts: +296 nJ (+0.13%)

Same download module impact as llama-cvector-generator
Impact: Negligible for inference workloads

llama-bench, llama-run, llama-quantize, llama-tokenize: <0.1% change

Minimal to no impact from PR changes

STL Function Regressions (Non-PR Related)

Eight STL functions show significant regressions unrelated to code changes:

std::_Rb_tree::end(): +135 ns (single basic block, 2→7 instructions)
std::_Rb_tree_const_iterator::_M_const_cast(): +131 ns (single basic block, 2→7 instructions)
std::vector::empty(): +134 ns (trivial operation with stack frame overhead)
std::vector::back(): +129 ns (similar pattern)
make_move_iterator: +117 ns (identity function with 10 instructions)
back_inserter: +31 ns (wrapper with unnecessary stack operations)

CFG analysis confirms these functions maintain single basic block structure with no control flow changes. Assembly comparison shows debug-mode compilation pattern with unnecessary stack frame setup and redundant store-reload cycles. This accounts for 75-88% of the observed power consumption increase in libllama.so.

Code Changes Summary

New Functionality:

Safetensors file parser with metadata validation
HuggingFace config.json parser with architecture detection
Tensor name mapper for HF→llama.cpp naming conventions
Model builder with 10-step loading pipeline
Type conversion utilities for safetensors→GGML formats
HuggingFace tokenizer.json loader
Multi-file download orchestration for safetensors models

Modified Functionality:

common_get_hf_file: Fallback to safetensors detection when GGUF not found
common_params_handle_model: Branch for safetensors vs GGUF download paths
llama_model_load_from_file_impl: Format detection and routing
llama_model: New methods for device initialization and buffer registration

Correctness Considerations:

Head permutation applied to Q/K attention weights reverses HF training-time transformation
Dimension reversal for PyTorch→GGML tensor layout conversion
Special token detection from multiple sources (tokenizer.json, tokenizer_config.json)

Inference Performance Impact

Runtime Functions: Unaffected

Tokenization algorithms unchanged (same BPE/WordPiece/Unigram implementations)
Inference graph construction unchanged
Batch processing unchanged
KV cache management unchanged
Backend computation unchanged

Load-Time Functions: Extended

Model loading: +10-20% for safetensors format (additional parsing and conversion)
Vocabulary loading: +50-100 ms for HF tokenizer format (JSON parsing)
Impact: One-time cost, amortized over inference lifetime

Generation Loop: Debug logging regression

fprintf calls add 5,000-20,000 ns per token
Estimated tokens per second reduction: 10-30% for reference model
This is the only PR change affecting inference performance

…afetensors

This reverts commit 98cb7a6.

loci-dev temporarily deployed to PROD__AL_DEMO November 28, 2025 20:36 — with GitHub Actions Inactive

loci-dev force-pushed the main branch 3 times, most recently from f077805 to eec18ea Compare November 29, 2025 13:13

loci-dev force-pushed the upstream-PR17580-branch_ericcurtin-support-safetensors branch from ff29a86 to a963646 Compare November 29, 2025 13:37

loci-dev temporarily deployed to PROD__AL_DEMO November 29, 2025 13:37 — with GitHub Actions Inactive

loci-dev force-pushed the main branch 7 times, most recently from f96421a to 1854a53 Compare November 30, 2025 13:13

loci-dev force-pushed the upstream-PR17580-branch_ericcurtin-support-safetensors branch from a963646 to 43efc4d Compare November 30, 2025 13:37

loci-dev temporarily deployed to PROD__AL_DEMO November 30, 2025 13:37 — with GitHub Actions Inactive

loci-dev force-pushed the main branch from 1854a53 to 1b177fe Compare November 30, 2025 15:08

Add safetensors support

34c53c1

So we can load these natively just like gguf Signed-off-by: Eric Curtin <[email protected]>

loci-dev force-pushed the upstream-PR17580-branch_ericcurtin-support-safetensors branch from 43efc4d to 34c53c1 Compare November 30, 2025 17:35

loci-dev temporarily deployed to PROD__AL_DEMO November 30, 2025 17:35 — with GitHub Actions Inactive

loci-dev force-pushed the main branch 7 times, most recently from 333626d to 82b1c0b Compare December 1, 2025 19:10

loci-dev force-pushed the main branch 27 times, most recently from e81a7eb to 806b364 Compare December 5, 2025 18:11

DajanaV and others added 3 commits December 17, 2025 14:09

Apply overlay (.github from overlay)

98cb7a6

Merge branch 'main' into upstream-PR17580-branch_ericcurtin-support-s…

9f34c9e

…afetensors

Revert "Apply overlay (.github from overlay)"

fb5ca5d

This reverts commit 98cb7a6.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UPSTREAM PR #17580: Add safetensors support#351

UPSTREAM PR #17580: Add safetensors support#351
loci-dev wants to merge 4 commits intomainfrom
upstream-PR17580-branch_ericcurtin-support-safetensors

loci-dev commented Nov 28, 2025

Uh oh!

loci-review bot commented Nov 28, 2025

Uh oh!

loci-review bot commented Nov 29, 2025

Uh oh!

loci-review bot commented Nov 30, 2025

Uh oh!

loci-review bot commented Nov 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

loci-dev commented Nov 28, 2025

Uh oh!

loci-review bot commented Nov 28, 2025

Performance Analysis Summary

Key Findings

Uh oh!

loci-review bot commented Nov 29, 2025

Performance Analysis Summary - PR #351: Safetensors Support

Overview

Key Findings

Inference Performance Impact

Model Loading Performance

Power Consumption Analysis

Implementation Status

Performance-Critical Areas

Uh oh!

loci-review bot commented Nov 30, 2025

Performance Analysis Summary - PR #351: Safetensors Support

Summary

Uh oh!

loci-review bot commented Nov 30, 2025

Performance Analysis Summary: PR #351 - Safetensors Support

Analysis Overview

Key Findings

Impact on Performance-Critical Areas

Power Consumption Analysis

STL Function Regressions (Non-PR Related)

Code Changes Summary

Inference Performance Impact

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants