Skip to content

UPSTREAM PR #17580: Add safetensors support#351

Open
loci-dev wants to merge 4 commits intomainfrom
upstream-PR17580-branch_ericcurtin-support-safetensors
Open

UPSTREAM PR #17580: Add safetensors support#351
loci-dev wants to merge 4 commits intomainfrom
upstream-PR17580-branch_ericcurtin-support-safetensors

Conversation

@loci-dev
Copy link
Copy Markdown

Mirrored from ggml-org/llama.cpp#17580

So we can load these natively just like gguf

@loci-review
Copy link
Copy Markdown

loci-review bot commented Nov 28, 2025

Explore the complete analysis inside the Version Insights

Performance Analysis Summary

PR #351: Safetensors Support Implementation

This PR introduces 1,663 lines of new code across 11 files to add safetensors format support. The implementation is incomplete and non-functional, with all model loading functions returning "not yet implemented" errors. No existing code paths are modified, resulting in zero performance impact on current operations.

Key Findings

Performance-Critical Areas Impact:

The changes do not affect any performance-critical functions identified in the project summary. Core inference functions (llama_decode, llama_encode, llama_tokenize) remain unmodified. Model loading functions (llama_model_load_from_file, llama_init_from_model) are unchanged. Memory management (llama_memory_clear, llama_kv_cache operations) and batch processing (llama_batch_init, llama_decode) show no modifications.

Tokens Per Second Impact:

No impact on inference throughput. The tokenization and inference pipeline remains untouched. Functions responsible for token processing (llama_tokenize, llama_detokenize, llama_decode, llama_encode) show no changes in response time or throughput. The reference benchmark (ollama://smollm:135m on 12th Gen Intel i7-1255U) would maintain current tokens per second performance.

Power Consumption Analysis:

Analysis shows a 10.90% increase in estimated power consumption for build.bin.libllama.so (214,109 nJ vs 193,066 nJ baseline, +21,043 nJ absolute change). This increase is attributed to STL container operations showing throughput regressions:

  • std::vector<size_t>::empty() +134 ns throughput
  • std::back_inserter<std::vector> +24 ns throughput
  • std::vector::back() +29 ns throughput
  • std::vector<llm_symbol>::end() +24 ns throughput

Other binaries show minimal changes: llama-tts (+0.07%), llama-gguf-split (+0.03%), llama-quantize (+0.02%), with llama-run (-0.10%) and llama-cvector-generator (-0.08%) showing slight improvements.

Code Implementation Analysis:

The PR adds infrastructure for parsing safetensors files (llama-safetensors.cpp, 398 lines), HuggingFace config parsing (llama-hf-config.cpp, 220 lines), type conversion utilities (llama-safetensors-types.cpp, 157 lines), and tensor name mapping (llama-safetensors-loader.cpp, 271 lines). The model builder (llama-model-from-safetensors.cpp, 218 lines) defines an 8-step loading pipeline but implements only steps 1-3. Steps 4-8 (create_model_structure, allocate_tensors, load_tensor_data, init_vocabulary, finalize_model) return false with error messages.

The implementation uses C-style FILE* operations for file I/O, nlohmann/json for parsing, and std::regex for tensor name mapping. Type conversion functions support F32, F16, BF16, I32, I16, I8 formats with element-wise loops for conversions.

@loci-dev loci-dev force-pushed the main branch 3 times, most recently from f077805 to eec18ea Compare November 29, 2025 13:13
@loci-dev loci-dev force-pushed the upstream-PR17580-branch_ericcurtin-support-safetensors branch from ff29a86 to a963646 Compare November 29, 2025 13:37
@loci-review
Copy link
Copy Markdown

loci-review bot commented Nov 29, 2025

Explore the complete analysis inside the Version Insights

Performance Analysis Summary - PR #351: Safetensors Support

Overview

PR #351 introduces safetensors model loading capability through 11 new files (2296 lines). This is an additive feature with no modifications to existing inference paths. The performance analysis reveals no impact on runtime inference performance, as the new code affects only the model loading phase.

Key Findings

Inference Performance Impact

No impact on tokens per second. The safetensors loading path does not modify any inference-critical functions:

  • llama_decode - unchanged
  • llama_encode - unchanged
  • llama_tokenize - unchanged

All inference operations use the same GGML backend and tensor structures regardless of whether the model was loaded from GGUF or safetensors format. Once loaded, model execution is identical.

Model Loading Performance

The new safetensors loader exhibits different characteristics compared to GGUF:

Type Conversion Operations:

  • F32→F16 conversion: element-wise loop processing 16M-2B elements per model
  • F64→F32 downcast: similar element-wise processing
  • Direct memcpy for matching types (F32→F32, F16→F16)

For a 7B parameter model, type conversion adds approximately 10000-30000 ms to load time. This is a one-time cost during model initialization and does not affect subsequent inference.

Tensor Name Mapping:

  • Uses std::regex for pattern matching on 300+ tensor names
  • Adds 300-500 ms overhead during load
  • String concatenation for name construction

File I/O Pattern:

  • Sequential fread operations with fseek for tensor data
  • Temporary buffer allocation per tensor (peak memory = model size + largest tensor)
  • No memory-mapped file support

Power Consumption Analysis

Binary-level impact:

  • libllama.so: No change (0.0%) - safetensors code is separate module
  • New binaries: No new executables added, only library code

The safetensors loading functions are not included in the power consumption baseline as they represent new, optional code paths. When active, the loading phase will consume additional CPU cycles for type conversion and file I/O, but this is transient and does not affect steady-state inference power consumption.

Implementation Status

Incomplete vocabulary loading: The init_vocabulary() function returns success without loading tokenizer data. Models load structurally but lack vocabulary for text processing. This affects the usability of safetensors-loaded models but does not impact performance metrics of existing GGUF-based workflows.

Performance-Critical Areas

Model Loading Module:

  • New alternative path for safetensors format
  • Existing GGUF path unchanged
  • No shared code between loaders

Memory Management Module:

  • Uses standard ggml_backend_alloc_ctx_tensors() allocation
  • Same backend buffer management as GGUF
  • No changes to KV cache or memory recurrent systems

Token Processing Module:

  • Vocabulary initialization incomplete
  • No changes to existing tokenization functions
  • Safetensors path does not affect active tokenization performance

The implementation is architecturally sound as an isolated feature addition. The performance characteristics differ from GGUF loading but do not regress existing functionality. Inference performance remains unchanged as the new code operates exclusively in the model initialization phase.

@loci-dev loci-dev force-pushed the main branch 7 times, most recently from f96421a to 1854a53 Compare November 30, 2025 13:13
@loci-dev loci-dev force-pushed the upstream-PR17580-branch_ericcurtin-support-safetensors branch from a963646 to 43efc4d Compare November 30, 2025 13:37
@loci-review
Copy link
Copy Markdown

loci-review bot commented Nov 30, 2025

Explore the complete analysis inside the Version Insights

Performance Analysis Summary - PR #351: Safetensors Support

Condition Assessment: Condition 1 applies - No meaningful performance impact from code changes.

Summary

PR #351 adds native safetensors format support through 3,055 lines of new code across 20 files. The observed performance variations are artifacts of build differences rather than functional regressions. The STL iterator functions showing +130-226% changes (60-195 ns absolute) are compiler optimization issues affecting trivial inline operations. The common_get_hf_file function gained +1,333,000 ns response time due to new safetensors detection logic (additional HTTP API call and JSON parsing), which executes only during initial model download, not inference. Core inference functions (llama_decode, llama_encode, llama_tokenize) remain unmodified. Tokens per second is unaffected. Power consumption increased 13% in libllama.so due to new safetensors loading infrastructure, but this impacts model loading time only, not inference throughput.

So we can load these natively just like gguf

Signed-off-by: Eric Curtin <[email protected]>
@loci-dev loci-dev force-pushed the upstream-PR17580-branch_ericcurtin-support-safetensors branch from 43efc4d to 34c53c1 Compare November 30, 2025 17:35
@loci-review
Copy link
Copy Markdown

loci-review bot commented Nov 30, 2025

Explore the complete analysis inside the Version Insights

Performance Analysis Summary: PR #351 - Safetensors Support

Project: llama.cpp
PR #351: Add safetensors support
Scope: 20 files changed (+3,322 lines, -24 lines)


Analysis Overview

This PR adds native safetensors model loading capability alongside existing GGUF support. The implementation introduces 5 new source files and modifies 3 existing modules in the model loading and download subsystems.


Key Findings

Impact on Performance-Critical Areas

Model Loading Module:

  • common_get_hf_file: +1,312,000 ns response time increase

    • New safetensors detection logic adds HTTP request to HF API /tree/main endpoint
    • JSON parsing for file list enumeration
    • Extension checking for .safetensors files
    • Impact: One-time cost during model initialization, not in inference path
  • ~common_hf_file_res: +285 ns response time increase

    • New std::vector<std::string> safetensors_files member added to struct
    • Vector destructor deallocates heap memory for file list
    • Impact: Minimal, occurs only during cleanup after model loading
  • common_params_handle_model: +912 ns throughput increase

    • New branching logic for safetensors vs GGUF download paths
    • Directory creation and multi-file download list construction
    • String manipulation for path sanitization
    • Impact: One-time cost during model parameter processing

Token Processing Module:

  • llama_vocab::load_from_hf_tokenizer: New function (224 lines)
    • Parses HuggingFace tokenizer.json format
    • Loads vocabulary, BPE merges, and special tokens
    • Builds token_to_piece cache with atomic swap
    • Impact: Load-time only, runtime tokenization unchanged

Inference Path Analysis:

  • llama_decode: No modifications
  • llama_encode: No modifications
  • llama_tokenize: No modifications
  • llama_batch_*: No modifications

Tokens Per Second Impact: Zero. The PR does not modify any inference or tokenization runtime functions. All changes are isolated to model loading and initialization paths. Using the reference that 2 ms slower llama_decode results in 7% tokens per second reduction, this PR introduces zero inference slowdown as llama_decode response time is unchanged.

Exception: Debug logging added to tools/run/run.cpp introduces fprintf calls in the generation loop. This adds 5,000-20,000 ns per token overhead, which would reduce tokens per second by approximately 10-30% for the reference model. This debug code should be removed.

Power Consumption Analysis

libllama.so: +26,189 nJ (+13.57%)

  • PR-attributable: ~3,000-6,000 nJ (~1.5-3%)
    • New safetensors loading functions compiled into library
    • HuggingFace tokenizer parsing functions
    • Type conversion utilities
  • Non-PR attributable: ~20,000-23,000 nJ (~10-12%)
    • STL iterator functions show debug-mode compilation pattern
    • Functions like std::_Rb_tree::end(), std::vector::empty() show 150-226% throughput increases
    • Root cause: Build configuration issue, not code changes

llama-cvector-generator: +219 nJ (+0.10%)

  • Download module changes: common_get_hf_file and ~common_hf_file_res
  • Impact: Negligible for inference workloads

llama-tts: +296 nJ (+0.13%)

  • Same download module impact as llama-cvector-generator
  • Impact: Negligible for inference workloads

llama-bench, llama-run, llama-quantize, llama-tokenize: <0.1% change

  • Minimal to no impact from PR changes

STL Function Regressions (Non-PR Related)

Eight STL functions show significant regressions unrelated to code changes:

  • std::_Rb_tree::end(): +135 ns (single basic block, 2→7 instructions)
  • std::_Rb_tree_const_iterator::_M_const_cast(): +131 ns (single basic block, 2→7 instructions)
  • std::vector::empty(): +134 ns (trivial operation with stack frame overhead)
  • std::vector::back(): +129 ns (similar pattern)
  • make_move_iterator: +117 ns (identity function with 10 instructions)
  • back_inserter: +31 ns (wrapper with unnecessary stack operations)

CFG analysis confirms these functions maintain single basic block structure with no control flow changes. Assembly comparison shows debug-mode compilation pattern with unnecessary stack frame setup and redundant store-reload cycles. This accounts for 75-88% of the observed power consumption increase in libllama.so.

Code Changes Summary

New Functionality:

  1. Safetensors file parser with metadata validation
  2. HuggingFace config.json parser with architecture detection
  3. Tensor name mapper for HF→llama.cpp naming conventions
  4. Model builder with 10-step loading pipeline
  5. Type conversion utilities for safetensors→GGML formats
  6. HuggingFace tokenizer.json loader
  7. Multi-file download orchestration for safetensors models

Modified Functionality:

  1. common_get_hf_file: Fallback to safetensors detection when GGUF not found
  2. common_params_handle_model: Branch for safetensors vs GGUF download paths
  3. llama_model_load_from_file_impl: Format detection and routing
  4. llama_model: New methods for device initialization and buffer registration

Correctness Considerations:

  • Head permutation applied to Q/K attention weights reverses HF training-time transformation
  • Dimension reversal for PyTorch→GGML tensor layout conversion
  • Special token detection from multiple sources (tokenizer.json, tokenizer_config.json)

Inference Performance Impact

Runtime Functions: Unaffected

  • Tokenization algorithms unchanged (same BPE/WordPiece/Unigram implementations)
  • Inference graph construction unchanged
  • Batch processing unchanged
  • KV cache management unchanged
  • Backend computation unchanged

Load-Time Functions: Extended

  • Model loading: +10-20% for safetensors format (additional parsing and conversion)
  • Vocabulary loading: +50-100 ms for HF tokenizer format (JSON parsing)
  • Impact: One-time cost, amortized over inference lifetime

Generation Loop: Debug logging regression

  • fprintf calls add 5,000-20,000 ns per token
  • Estimated tokens per second reduction: 10-30% for reference model
  • This is the only PR change affecting inference performance

@loci-dev loci-dev force-pushed the main branch 7 times, most recently from 333626d to 82b1c0b Compare December 1, 2025 19:10
@loci-dev loci-dev force-pushed the main branch 27 times, most recently from e81a7eb to 806b364 Compare December 5, 2025 18:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants