Skip to content

UPSTREAM PR #19283: Support Step3.5-Flash#1146

Open
loci-dev wants to merge 13 commits intomainfrom
loci/pr-19283-pr-step3.5-flash
Open

UPSTREAM PR #19283: Support Step3.5-Flash#1146
loci-dev wants to merge 13 commits intomainfrom
loci/pr-19283-pr-step3.5-flash

Conversation

@loci-dev
Copy link
Copy Markdown

@loci-dev loci-dev commented Feb 3, 2026

@loci-review
Copy link
Copy Markdown

loci-review bot commented Feb 3, 2026

No summary available at this time. Visit Loci Inspector to review detailed analysis.

@loci-dev loci-dev force-pushed the main branch 7 times, most recently from cd152fa to ab12294 Compare February 3, 2026 11:18
@loci-review
Copy link
Copy Markdown

loci-review bot commented Feb 3, 2026

Overview

Analysis of 115,472 functions across 15 binaries reveals minimal performance impact from two commits adding Step3.5-Flash architecture support and fixing normalization weight indexing. Only 36 functions modified (0.03%), 47 new, 11 removed, with 115,378 unchanged (99.92%).

Power Consumption Changes:

  • build.bin.libllama.so: +0.209% (+519.84 nJ)
  • build.bin.llama-tts: -0.0004% (-1.30 nJ)
  • build.bin.llama-cvector-generator: -0.0001% (-0.39 nJ)
  • build.bin.libmtmd.so: +0.00004% (+0.07 nJ)
  • build.bin.libggml-cpu.so: 0% (0 nJ)
  • build.bin.libggml-base.so: 0% (0 nJ)
  • build.bin.libggml.so: 0% (0 nJ)
  • build.bin.llama-tokenize: 0% (0 nJ)
  • build.bin.llama-quantize: 0% (0 nJ)
  • build.bin.llama-qwen2vl-cli: 0% (0 nJ)
  • build.bin.llama-bench: 0% (0 nJ)
  • build.bin.llama-gemma3-cli: 0% (0 nJ)
  • build.bin.llama-gguf-split: 0% (0 nJ)
  • build.bin.llama-llava-cli: 0% (0 nJ)
  • build.bin.llama-minicpmv-cli: 0% (0 nJ)

Critical inference paths (token processing, matrix operations, attention, KV cache) show zero changes.

Function Analysis

llama_hparams constructor (build.bin.libllama.so): Response time increased 493.71ns → 539.83ns (+9.34%, +46.12ns absolute). Throughput time identical. Added 5 new fields (~8.2KB) for Step3.5-Flash per-layer configuration: rope_freq_base_per_layer, rope_dim_per_layer, swiglu_limits, rope_scaling_apply_mask, and has_rope_freq_base_per_layer flag. One-time initialization cost during model loading, justified by enabling heterogeneous layer architectures.

get_rope_freq_base (build.bin.libllama.so): Response time increased 107.45ns → 133.58ns (+24.32%, +26.13ns). Throughput time increased 30.65ns → 56.39ns (+84.02%, +25.75ns). Added per-layer RoPE frequency lookup with new conditional branch checking has_rope_freq_base_per_layer before existing SWA logic. Called once per layer during graph construction (setup phase), not in per-token inference path.

llama_sampler_top_p_backend_apply (build.bin.libllama.so): Response time increased 808.24ns → 852.20ns (+5.44%, +43.96ns). Throughput time increased 294.39ns → 304.55ns (+3.45%, +10.16ns). No source code changes detected—regression attributed to binary layout changes from Step3.5-Flash code additions affecting instruction cache. Affects GPU-accelerated sampling graph construction, not GPU kernel execution.

STL Functions: Multiple C++ standard library functions showed mixed results. Improvements: _M_swap_data (-30.2%, -73ns), _M_allocate_buckets (-20.7%, -68ns), _M_is_line_terminator (-18.2%, -85ns). Regressions: _M_deallocate_buckets (+18.9%, +37ns), _M_complete_async (+18.0%, +188ns in optional validation), _S_check_init_len (+14.5%, +46ns). Net effect shows compensating improvements in hash table lifecycle (-31ns) and regex operations (-62ns).

Other analyzed functions (llama_supports_mlock, _M_apply, _M_realloc_insert) showed negligible changes in non-critical initialization and cleanup paths.

Additional Findings

All GGML GPU backend libraries (libggml-cpu.so, libggml-base.so, libggml.so) show 0% power consumption change, confirming zero impact on GPU kernels and tensor operations. The Step3.5-Flash architecture additions successfully integrate per-layer RoPE configuration, selective RoPE scaling via bitmask, and SwiGLU activation clamping without modifying performance-critical matrix multiplication, attention computation, or quantization kernels. Changes are appropriately isolated to the model architecture layer, preserving inference hot path performance while enabling advanced heterogeneous layer configurations for Step3.5 models.

🔎 Full breakdown: Loci Inspector.
💬 Questions? Tag @loci-dev.

@noamshiovits-Aurora
Copy link
Copy Markdown

@loci-dev how many functions were changed added or deleted in the project source files?

@noamshiovits-Aurora
Copy link
Copy Markdown

@loci-dev how many functions were changed added or deleted in the compiled binaries ?

@loci-review
Copy link
Copy Markdown

loci-review bot commented Feb 5, 2026

Based on the performance review analysis of the compiled binaries, here are the function change statistics:

Function Changes Across All Binaries

Total functions analyzed: 115,472 functions across 15 binaries

Breakdown:

  • Modified functions: 36 (0.03% of total)
  • Added functions: 47 (0.04% of total)
  • Removed functions: 11 (0.01% of total)
  • Unchanged functions: 115,378 (99.92% of total)

Summary

This represents a very minimal change footprint - only 94 total functions were affected (modified, added, or removed) out of over 115,000 functions. The vast majority of the codebase (99.92%) remained unchanged, which is consistent with the focused nature of the changes: adding Step3.5-Flash architecture support and fixing a normalization weight indexing issue.

The small number of modifications (36 functions) and the relatively balanced additions (47) vs removals (11) indicate that this was a targeted enhancement rather than a broad refactoring, which aligns with the minimal performance impact observed across the binaries.

@loci-review
Copy link
Copy Markdown

loci-review bot commented Feb 5, 2026

Based on my analysis, here's the complete breakdown:


Compiled Binaries (Performance Analysis)

Across all 15 analyzed binaries:

  • Total functions analyzed: 115,472
  • Modified functions: 36 (0.03%)
  • Added functions: 47
  • Removed functions: 11
  • Unchanged functions: 115,378 (99.92%)

Source Code Changes

Files Changed: 6 files modified

Functions/Components Modified: 8 major functions

  1. load_hparams() - Added Step3.5 hyperparameter loading case
  2. load_tensors() - Added Step3.5 tensor loading (31 new tensor types)
  3. get_rope_freq_base() - Added per-layer RoPE frequency lookup
  4. build_graph() - Added Step3.5 graph builder case
  5. llama_rope_type() - Added Step3.5 RoPE type mapping
  6. build_ffn() - Added SwiGLU activation clamping logic
  7. build_moe_ffn() - Modified expert weight normalization + added MoE SwiGLU clamping
  8. set_gguf_parameters() - Added max_position_embedding parameter

New Components Added:

  • 1 new class: Step35Model (188 lines) - HuggingFace model converter
  • 4 new enum values: Architecture and hyperparameter keys
  • 31 new tensor types: For Step3.5 architecture support
  • Multiple new constants: RoPE and model architecture mappings

Summary:

  • Source files modified: 6
  • Major functions modified: 8
  • New classes added: 1
  • New enum values: 4
  • New tensor types: 31

The key difference is that while only 6 source files were modified with 8 major function changes, the compiled binaries show 36 modified functions due to:

  • Template instantiations
  • Inlined functions
  • C++ standard library functions affected by binary layout changes
  • Constructor/destructor variants
  • Compiler-generated helper functions

@loci-dev loci-dev force-pushed the main branch 9 times, most recently from 56aaa36 to 21147c2 Compare March 13, 2026 02:17
@loci-dev loci-dev force-pushed the main branch 13 times, most recently from f9aec49 to e6c519b Compare March 22, 2026 02:17
@loci-dev loci-dev force-pushed the main branch 8 times, most recently from 8527fd7 to 135dbe7 Compare March 28, 2026 02:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants