Skip to content

Conversation

@DajanaV
Copy link
Contributor

@DajanaV DajanaV commented Nov 4, 2025

Mirrored from ggml-org/llama.cpp#16985

This adds extra functions

ggml_conv_2d_circular
ggml_conv_2d_dw_circular
ggml_conv_2d_dw_direct_circular
ggml_conv_transpose_2d_p0_circular
ggml_conv_2d_direct_circular
ggml_pad_circular
ggml_pad_ext_circular

That have equivalent signatures to the non-circular versions (I considered modifying the existing ones, but didn't want to break existing code). Instead of padding with zeros, they act "on a torus" and loop x and y around.

I implemented this for CUDA, CPU, and Vulkan, as those are the primary backends people use in KoboldCpp/Stable Diffusion Cpp to generate images. For other backends, it'll fall back to non-circular.

This can be used to make seamless textures, see leejet/stable-diffusion.cpp#914 for an example and the changes needed on the image generation side. For some models (Stable Diffusion) simply calling the circular functions is sufficient, for other models (Qwen Image) you need to modify Rope embeddings slightly as well (so they cleanly loop).

I ran CI tests and added tests for these, but happy to answer any questions/modify things as needed.

@loci-agentic-ai
Copy link

Access the complete analysis in the LOCI Dashboard

Performance Analysis Summary: LLaMA.cpp Critical Functions

Critical Function Performance Changes

Primary Impact: Convolution Operations

Function: ggml_compute_forward_conv_2d_dw_cwhn in build.bin.libggml-cpu.so

  • Response Time: 2332 ns (+64% from 1423 ns baseline)
  • Throughput: 2271 ns (+60% from 1423 ns baseline)
  • Bottleneck: 1836 ns (+29% from 1423 ns baseline)
  • Control Flow Changes: Function modified with circular tiling support, introducing dual execution paths and coordinate wrapping function calls

Secondary Impact: Memory Management

Function: emplace_back (vector operations) in build.bin.libggml-base.so

  • Bottleneck: 88 ns (+70% from 52 ns baseline)
  • Throughput: 342 ns (+12% from 306 ns baseline)
  • Response Time: 25698 ns (minimal change from 25661 ns baseline)

KPI Impact Analysis

1. Tokens Per Second Impact

Reference Baseline: 7% reduction in tokens/second when llama_decode increases by 2 ms

Direct Impact Functions:

  • Convolution Operations: The 909 ns increase in ggml_compute_forward_conv_2d_dw_cwhn affects tensor operations within inference pipeline
  • No Direct Impact: Core inference functions (llama_decode, llama_encode, llama_tokenize) show no performance changes in the analyzed data

Assessment: Minimal direct impact on tokens/second as primary inference functions remain unchanged. Convolution degradation affects specific model architectures using depthwise convolutions.

2. Power Consumption Impact

Binary-Level Changes:

  • build.bin.libggml-cpu.so: +0.019% increase (150595 nJ vs 150567 nJ)
  • build.bin.libggml-base.so: +0.055% increase (87810 nJ vs 87761 nJ)
  • build.bin.libllama.so: -0.001% decrease (280662 nJ vs 280664 nJ)
  • All other binaries: No measurable changes

Total System Impact: <0.1% across all binaries, indicating minimal power consumption changes.

3. Quantization Efficiency

No Impact Detected: Analysis shows no changes to quantization-related functions:

  • llama_model_quantize() - No performance changes detected
  • Quantization format handling functions remain unchanged
  • GGML quantization operations show no degradation

4. Memory Usage Impact

Affected Functions:

  • emplace_back operations: +70% bottleneck increase suggests memory allocation overhead
  • Stack frame expansion: Convolution function shows 76% increase in stack allocation (704 vs 400 bytes)

Memory Management Functions: No changes detected in core memory functions:

  • llama_memory_clear(), llama_memory_seq_rm(), llama_memory_seq_cp() show no performance impact

5. Batch Processing Impact

No Direct Impact: Core batch processing functions show no performance changes:

  • llama_batch_init(), llama_batch_get_one(), llama_batch_free() remain unchanged
  • llama_decode() with batch processing shows no degradation

Root Cause Analysis

Convolution Performance Degradation

Primary Causes:

  • Code Duplication: Complete loop structure duplicated for circular vs non-circular execution paths
  • Function Call Overhead: 4 calls to ggml_wrap_coord() per kernel iteration (15 ns each)
  • Stack Memory Pressure: 304 bytes additional stack allocation per function call
  • Branch Complexity: Additional conditional logic impacts CPU branch prediction

Memory Allocation Bottleneck

Contributing Factors:

  • Vector Expansion: emplace_back operations show increased internal bottleneck time
  • Constructor Complexity: GGUF key-value object construction overhead increased

Action Items for Performance Optimization

Immediate Code-Level Optimizations

  1. Inline Coordinate Wrapping: Replace ggml_wrap_coord() function calls with inline modulo operations to eliminate 60 ns per iteration overhead
  2. Conditional Compilation: Use preprocessor directives to separate circular/non-circular code paths, reducing runtime branching
  3. Stack Frame Optimization: Reduce 304-byte stack overhead through better variable management and register utilization

Build System Optimizations

  1. Template Specialization: Create separate template specializations for circular vs non-circular operations to eliminate runtime path selection
  2. Link-Time Optimization: Enable LTO to optimize across compilation units and reduce function call overhead
  3. Profile-Guided Optimization: Use PGO to optimize branch prediction for the dual-path architecture

Memory Management Improvements

  1. Vector Pre-allocation: Pre-allocate vector capacity in GGUF parsing to reduce emplace_back reallocation overhead
  2. Memory Pool Usage: Implement memory pools for frequently allocated GGUF objects
  3. Stack-to-Register Migration: Move frequently accessed variables from stack to registers in convolution loops

Performance Impact Assessment

Overall System Impact: The changes introduce localized performance degradation in convolution operations without affecting core inference pipeline functions. The 64% increase in convolution response time primarily impacts models using depthwise convolutions, while standard transformer inference remains unaffected.

Critical Path Analysis: Core LLaMA.cpp inference functions (llama_decode, llama_tokenize, memory management) show no performance regression, maintaining inference throughput for standard language model operations.

@DajanaV DajanaV force-pushed the main branch 20 times, most recently from b16251e to 95f6e9b Compare November 6, 2025 13:17
@DajanaV DajanaV force-pushed the main branch 30 times, most recently from 87bfdb3 to a14857a Compare November 11, 2025 19:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants