UPSTREAM PR #19283: Support Step3.5-Flash by loci-dev · Pull Request #1146 · auroralabs-loci/llama.cpp

loci-dev · 2026-02-03T04:16:24Z

Note

Source pull request: ggml-org/llama.cpp#19283

This PR adds support for the Step3.5-Flash model architecture.

github:

https://github.com/stepfun-ai/Step-3.5-Flash/tree/main

huggingface:

mtp3_bf16: https://huggingface.co/stepfun-ai/Step-3.5-Flash
mtp3_fp8: https://huggingface.co/stepfun-ai/Step-3.5-Flash-FP8
int4(Q4_K_S.gguf): https://huggingface.co/stepfun-ai/Step-3.5-Flash-Int4

loci-review · 2026-02-03T05:14:16Z

No summary available at this time. Visit Loci Inspector to review detailed analysis.

loci-review · 2026-02-03T13:19:40Z

Overview

Analysis of 115,472 functions across 15 binaries reveals minimal performance impact from two commits adding Step3.5-Flash architecture support and fixing normalization weight indexing. Only 36 functions modified (0.03%), 47 new, 11 removed, with 115,378 unchanged (99.92%).

Power Consumption Changes:

build.bin.libllama.so: +0.209% (+519.84 nJ)
build.bin.llama-tts: -0.0004% (-1.30 nJ)
build.bin.llama-cvector-generator: -0.0001% (-0.39 nJ)
build.bin.libmtmd.so: +0.00004% (+0.07 nJ)
build.bin.libggml-cpu.so: 0% (0 nJ)
build.bin.libggml-base.so: 0% (0 nJ)
build.bin.libggml.so: 0% (0 nJ)
build.bin.llama-tokenize: 0% (0 nJ)
build.bin.llama-quantize: 0% (0 nJ)
build.bin.llama-qwen2vl-cli: 0% (0 nJ)
build.bin.llama-bench: 0% (0 nJ)
build.bin.llama-gemma3-cli: 0% (0 nJ)
build.bin.llama-gguf-split: 0% (0 nJ)
build.bin.llama-llava-cli: 0% (0 nJ)
build.bin.llama-minicpmv-cli: 0% (0 nJ)

Critical inference paths (token processing, matrix operations, attention, KV cache) show zero changes.

Function Analysis

llama_hparams constructor (build.bin.libllama.so): Response time increased 493.71ns → 539.83ns (+9.34%, +46.12ns absolute). Throughput time identical. Added 5 new fields (~8.2KB) for Step3.5-Flash per-layer configuration: rope_freq_base_per_layer, rope_dim_per_layer, swiglu_limits, rope_scaling_apply_mask, and has_rope_freq_base_per_layer flag. One-time initialization cost during model loading, justified by enabling heterogeneous layer architectures.

get_rope_freq_base (build.bin.libllama.so): Response time increased 107.45ns → 133.58ns (+24.32%, +26.13ns). Throughput time increased 30.65ns → 56.39ns (+84.02%, +25.75ns). Added per-layer RoPE frequency lookup with new conditional branch checking has_rope_freq_base_per_layer before existing SWA logic. Called once per layer during graph construction (setup phase), not in per-token inference path.

llama_sampler_top_p_backend_apply (build.bin.libllama.so): Response time increased 808.24ns → 852.20ns (+5.44%, +43.96ns). Throughput time increased 294.39ns → 304.55ns (+3.45%, +10.16ns). No source code changes detected—regression attributed to binary layout changes from Step3.5-Flash code additions affecting instruction cache. Affects GPU-accelerated sampling graph construction, not GPU kernel execution.

STL Functions: Multiple C++ standard library functions showed mixed results. Improvements: _M_swap_data (-30.2%, -73ns), _M_allocate_buckets (-20.7%, -68ns), _M_is_line_terminator (-18.2%, -85ns). Regressions: _M_deallocate_buckets (+18.9%, +37ns), _M_complete_async (+18.0%, +188ns in optional validation), _S_check_init_len (+14.5%, +46ns). Net effect shows compensating improvements in hash table lifecycle (-31ns) and regex operations (-62ns).

Other analyzed functions (llama_supports_mlock, _M_apply, _M_realloc_insert) showed negligible changes in non-critical initialization and cleanup paths.

Additional Findings

All GGML GPU backend libraries (libggml-cpu.so, libggml-base.so, libggml.so) show 0% power consumption change, confirming zero impact on GPU kernels and tensor operations. The Step3.5-Flash architecture additions successfully integrate per-layer RoPE configuration, selective RoPE scaling via bitmask, and SwiGLU activation clamping without modifying performance-critical matrix multiplication, attention computation, or quantization kernels. Changes are appropriately isolated to the model architecture layer, preserving inference hot path performance while enabling advanced heterogeneous layer configurations for Step3.5 models.

🔎 Full breakdown: Loci Inspector.
💬 Questions? Tag @loci-dev.

Co-authored-by: Sigbjørn Skjæret <[email protected]>

…ama.cpp into pr/step3.5-flash

noamshiovits-Aurora · 2026-02-05T11:12:04Z

@loci-dev how many functions were changed added or deleted in the project source files?

noamshiovits-Aurora · 2026-02-05T11:12:23Z

@loci-dev how many functions were changed added or deleted in the compiled binaries ?

loci-review · 2026-02-05T11:12:35Z

Based on the performance review analysis of the compiled binaries, here are the function change statistics:

Function Changes Across All Binaries

Total functions analyzed: 115,472 functions across 15 binaries

Breakdown:

Modified functions: 36 (0.03% of total)
Added functions: 47 (0.04% of total)
Removed functions: 11 (0.01% of total)
Unchanged functions: 115,378 (99.92% of total)

Summary

This represents a very minimal change footprint - only 94 total functions were affected (modified, added, or removed) out of over 115,000 functions. The vast majority of the codebase (99.92%) remained unchanged, which is consistent with the focused nature of the changes: adding Step3.5-Flash architecture support and fixing a normalization weight indexing issue.

The small number of modifications (36 functions) and the relatively balanced additions (47) vs removals (11) indicate that this was a targeted enhancement rather than a broad refactoring, which aligns with the minimal performance impact observed across the binaries.

loci-review · 2026-02-05T11:13:27Z

Based on my analysis, here's the complete breakdown:

Compiled Binaries (Performance Analysis)

Across all 15 analyzed binaries:

Total functions analyzed: 115,472
Modified functions: 36 (0.03%)
Added functions: 47
Removed functions: 11
Unchanged functions: 115,378 (99.92%)

Source Code Changes

Files Changed: 6 files modified

Functions/Components Modified: 8 major functions

load_hparams() - Added Step3.5 hyperparameter loading case
load_tensors() - Added Step3.5 tensor loading (31 new tensor types)
get_rope_freq_base() - Added per-layer RoPE frequency lookup
build_graph() - Added Step3.5 graph builder case
llama_rope_type() - Added Step3.5 RoPE type mapping
build_ffn() - Added SwiGLU activation clamping logic
build_moe_ffn() - Modified expert weight normalization + added MoE SwiGLU clamping
set_gguf_parameters() - Added max_position_embedding parameter

New Components Added:

1 new class: Step35Model (188 lines) - HuggingFace model converter
4 new enum values: Architecture and hyperparameter keys
31 new tensor types: For Step3.5 architecture support
Multiple new constants: RoPE and model architecture mappings

Summary:

Source files modified: 6
Major functions modified: 8
New classes added: 1
New enum values: 4
New tensor types: 31

The key difference is that while only 6 source files were modified with 8 major function changes, the compiled binaries show 36 modified functions due to:

Template instantiations
Inlined functions
C++ standard library functions affected by binary layout changes
Constructor/destructor variants
Compiler-generated helper functions

Support Step3.5-Flash

2f0f12e

loci-dev temporarily deployed to PROD__AL_DEMO February 3, 2026 04:16 — with GitHub Actions Inactive

loci-dev force-pushed the main branch from 49ff2cd to 62226a3 Compare February 3, 2026 04:40

loci-dev force-pushed the main branch 7 times, most recently from cd152fa to ab12294 Compare February 3, 2026 11:18

fix: norm.weight + 1 (HF zero_centered=true)

26a875a

loci-dev temporarily deployed to PROD__AL_DEMO February 3, 2026 11:44 — with GitHub Actions Inactive

loci-dev force-pushed the main branch from ab12294 to 048ad94 Compare February 3, 2026 12:20

loci-dev force-pushed the main branch from 048ad94 to 6c1fde6 Compare February 3, 2026 13:32

lvyichen and others added 8 commits February 4, 2026 19:15

step35: simplify GGUF conversion + drop redundant rope KVs

05fb4ea

Address review feedback

cbab20f

Apply suggestions from code review

b910ecf

Co-authored-by: Sigbjørn Skjæret <[email protected]>

Apply suggestion from @CISC

262eef8

Co-authored-by: Sigbjørn Skjæret <[email protected]>

rename limits -> clamp

6842f78

Merge branch 'pr/step3.5-flash' of https://github.com/forforever73/ll…

538946b

…ama.cpp into pr/step3.5-flash

rename swiglu limits -> swiglu clamp in LLM_KV

b245de3

avoid CI fail

16fd940

lvyichen added 2 commits February 5, 2026 19:23

Apply suggestions from code review

ed2a1c4

Apply suggestions from code review

7f1c508

loci-dev force-pushed the main branch 9 times, most recently from 56aaa36 to 21147c2 Compare March 13, 2026 02:17

loci-dev force-pushed the main branch 13 times, most recently from f9aec49 to e6c519b Compare March 22, 2026 02:17

loci-dev force-pushed the main branch 8 times, most recently from 8527fd7 to 135dbe7 Compare March 28, 2026 02:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UPSTREAM PR #19283: Support Step3.5-Flash#1146

UPSTREAM PR #19283: Support Step3.5-Flash#1146
loci-dev wants to merge 13 commits intomainfrom
loci/pr-19283-pr-step3.5-flash

loci-dev commented Feb 3, 2026

Uh oh!

loci-review bot commented Feb 3, 2026

Uh oh!

loci-review bot commented Feb 3, 2026

Uh oh!

noamshiovits-Aurora commented Feb 5, 2026

Uh oh!

noamshiovits-Aurora commented Feb 5, 2026

Uh oh!

loci-review bot commented Feb 5, 2026

Uh oh!

loci-review bot commented Feb 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

loci-dev commented Feb 3, 2026

Uh oh!

loci-review bot commented Feb 3, 2026

Uh oh!

loci-review bot commented Feb 3, 2026

Overview

Function Analysis

Additional Findings

Uh oh!

noamshiovits-Aurora commented Feb 5, 2026

Uh oh!

noamshiovits-Aurora commented Feb 5, 2026

Uh oh!

loci-review bot commented Feb 5, 2026

Function Changes Across All Binaries

Breakdown:

Summary

Uh oh!

loci-review bot commented Feb 5, 2026

Compiled Binaries (Performance Analysis)

Source Code Changes

Functions/Components Modified: 8 major functions

New Components Added:

Summary:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants