Conversation
|
No summary available at this time. Visit Loci Inspector to review detailed analysis. |
cd152fa to
ab12294
Compare
OverviewAnalysis of 115,472 functions across 15 binaries reveals minimal performance impact from two commits adding Step3.5-Flash architecture support and fixing normalization weight indexing. Only 36 functions modified (0.03%), 47 new, 11 removed, with 115,378 unchanged (99.92%). Power Consumption Changes:
Critical inference paths (token processing, matrix operations, attention, KV cache) show zero changes. Function Analysisllama_hparams constructor (build.bin.libllama.so): Response time increased 493.71ns → 539.83ns (+9.34%, +46.12ns absolute). Throughput time identical. Added 5 new fields (~8.2KB) for Step3.5-Flash per-layer configuration: get_rope_freq_base (build.bin.libllama.so): Response time increased 107.45ns → 133.58ns (+24.32%, +26.13ns). Throughput time increased 30.65ns → 56.39ns (+84.02%, +25.75ns). Added per-layer RoPE frequency lookup with new conditional branch checking llama_sampler_top_p_backend_apply (build.bin.libllama.so): Response time increased 808.24ns → 852.20ns (+5.44%, +43.96ns). Throughput time increased 294.39ns → 304.55ns (+3.45%, +10.16ns). No source code changes detected—regression attributed to binary layout changes from Step3.5-Flash code additions affecting instruction cache. Affects GPU-accelerated sampling graph construction, not GPU kernel execution. STL Functions: Multiple C++ standard library functions showed mixed results. Improvements: Other analyzed functions (llama_supports_mlock, _M_apply, _M_realloc_insert) showed negligible changes in non-critical initialization and cleanup paths. Additional FindingsAll GGML GPU backend libraries (libggml-cpu.so, libggml-base.so, libggml.so) show 0% power consumption change, confirming zero impact on GPU kernels and tensor operations. The Step3.5-Flash architecture additions successfully integrate per-layer RoPE configuration, selective RoPE scaling via bitmask, and SwiGLU activation clamping without modifying performance-critical matrix multiplication, attention computation, or quantization kernels. Changes are appropriately isolated to the model architecture layer, preserving inference hot path performance while enabling advanced heterogeneous layer configurations for Step3.5 models. 🔎 Full breakdown: Loci Inspector. |
Co-authored-by: Sigbjørn Skjæret <[email protected]>
Co-authored-by: Sigbjørn Skjæret <[email protected]>
|
@loci-dev how many functions were changed added or deleted in the project source files? |
|
@loci-dev how many functions were changed added or deleted in the compiled binaries ? |
|
Based on the performance review analysis of the compiled binaries, here are the function change statistics: Function Changes Across All BinariesTotal functions analyzed: 115,472 functions across 15 binaries Breakdown:
SummaryThis represents a very minimal change footprint - only 94 total functions were affected (modified, added, or removed) out of over 115,000 functions. The vast majority of the codebase (99.92%) remained unchanged, which is consistent with the focused nature of the changes: adding Step3.5-Flash architecture support and fixing a normalization weight indexing issue. The small number of modifications (36 functions) and the relatively balanced additions (47) vs removals (11) indicate that this was a targeted enhancement rather than a broad refactoring, which aligns with the minimal performance impact observed across the binaries. |
|
Based on my analysis, here's the complete breakdown: Compiled Binaries (Performance Analysis)Across all 15 analyzed binaries:
Source Code ChangesFiles Changed: 6 files modified Functions/Components Modified: 8 major functions
New Components Added:
Summary:
The key difference is that while only 6 source files were modified with 8 major function changes, the compiled binaries show 36 modified functions due to:
|
56aaa36 to
21147c2
Compare
f9aec49 to
e6c519b
Compare
8527fd7 to
135dbe7
Compare
Note
Source pull request: ggml-org/llama.cpp#19283
This PR adds support for the Step3.5-Flash model architecture.
github:
huggingface: