UPSTREAM PR #16985: Add circular tiling support to conv2d and pad, for Vulkan, CUDA, and CPU (used for making seamless textures) #67

DajanaV · 2025-11-04T00:27:47Z

This adds extra functions

ggml_conv_2d_circular
ggml_conv_2d_dw_circular
ggml_conv_2d_dw_direct_circular
ggml_conv_transpose_2d_p0_circular
ggml_conv_2d_direct_circular
ggml_pad_circular
ggml_pad_ext_circular

That have equivalent signatures to the non-circular versions (I considered modifying the existing ones, but didn't want to break existing code). Instead of padding with zeros, they act "on a torus" and loop x and y around.

I implemented this for CUDA, CPU, and Vulkan, as those are the primary backends people use in KoboldCpp/Stable Diffusion Cpp to generate images. For other backends, it'll fall back to non-circular.

This can be used to make seamless textures, see leejet/stable-diffusion.cpp#914 for an example and the changes needed on the image generation side. For some models (Stable Diffusion) simply calling the circular functions is sufficient, for other models (Qwen Image) you need to modify Rope embeddings slightly as well (so they cleanly loop).

I ran CI tests and added tests for these, but happy to answer any questions/modify things as needed.

loci-agentic-ai · 2025-11-04T02:25:52Z

Access the complete analysis in the LOCI Dashboard

Performance Analysis Summary: LLaMA.cpp Critical Functions

Critical Function Performance Changes

Primary Impact: Convolution Operations

Function: ggml_compute_forward_conv_2d_dw_cwhn in build.bin.libggml-cpu.so

Response Time: 2332 ns (+64% from 1423 ns baseline)
Throughput: 2271 ns (+60% from 1423 ns baseline)
Bottleneck: 1836 ns (+29% from 1423 ns baseline)
Control Flow Changes: Function modified with circular tiling support, introducing dual execution paths and coordinate wrapping function calls

Secondary Impact: Memory Management

Function: emplace_back (vector operations) in build.bin.libggml-base.so

Bottleneck: 88 ns (+70% from 52 ns baseline)
Throughput: 342 ns (+12% from 306 ns baseline)
Response Time: 25698 ns (minimal change from 25661 ns baseline)

KPI Impact Analysis

1. Tokens Per Second Impact

Reference Baseline: 7% reduction in tokens/second when llama_decode increases by 2 ms

Direct Impact Functions:

Convolution Operations: The 909 ns increase in ggml_compute_forward_conv_2d_dw_cwhn affects tensor operations within inference pipeline
No Direct Impact: Core inference functions (llama_decode, llama_encode, llama_tokenize) show no performance changes in the analyzed data

Assessment: Minimal direct impact on tokens/second as primary inference functions remain unchanged. Convolution degradation affects specific model architectures using depthwise convolutions.

2. Power Consumption Impact

Binary-Level Changes:

build.bin.libggml-cpu.so: +0.019% increase (150595 nJ vs 150567 nJ)
build.bin.libggml-base.so: +0.055% increase (87810 nJ vs 87761 nJ)
build.bin.libllama.so: -0.001% decrease (280662 nJ vs 280664 nJ)
All other binaries: No measurable changes

Total System Impact: <0.1% across all binaries, indicating minimal power consumption changes.

3. Quantization Efficiency

No Impact Detected: Analysis shows no changes to quantization-related functions:

llama_model_quantize() - No performance changes detected
Quantization format handling functions remain unchanged
GGML quantization operations show no degradation

4. Memory Usage Impact

Affected Functions:

emplace_back operations: +70% bottleneck increase suggests memory allocation overhead
Stack frame expansion: Convolution function shows 76% increase in stack allocation (704 vs 400 bytes)

Memory Management Functions: No changes detected in core memory functions:

llama_memory_clear(), llama_memory_seq_rm(), llama_memory_seq_cp() show no performance impact

5. Batch Processing Impact

No Direct Impact: Core batch processing functions show no performance changes:

llama_batch_init(), llama_batch_get_one(), llama_batch_free() remain unchanged
llama_decode() with batch processing shows no degradation

Root Cause Analysis

Convolution Performance Degradation

Primary Causes:

Code Duplication: Complete loop structure duplicated for circular vs non-circular execution paths
Function Call Overhead: 4 calls to ggml_wrap_coord() per kernel iteration (15 ns each)
Stack Memory Pressure: 304 bytes additional stack allocation per function call
Branch Complexity: Additional conditional logic impacts CPU branch prediction

Memory Allocation Bottleneck

Contributing Factors:

Vector Expansion: emplace_back operations show increased internal bottleneck time
Constructor Complexity: GGUF key-value object construction overhead increased

Action Items for Performance Optimization

Immediate Code-Level Optimizations

Inline Coordinate Wrapping: Replace ggml_wrap_coord() function calls with inline modulo operations to eliminate 60 ns per iteration overhead
Conditional Compilation: Use preprocessor directives to separate circular/non-circular code paths, reducing runtime branching
Stack Frame Optimization: Reduce 304-byte stack overhead through better variable management and register utilization

Build System Optimizations

Template Specialization: Create separate template specializations for circular vs non-circular operations to eliminate runtime path selection
Link-Time Optimization: Enable LTO to optimize across compilation units and reduce function call overhead
Profile-Guided Optimization: Use PGO to optimize branch prediction for the dual-path architecture

Memory Management Improvements

Vector Pre-allocation: Pre-allocate vector capacity in GGUF parsing to reduce emplace_back reallocation overhead
Memory Pool Usage: Implement memory pools for frequently allocated GGUF objects
Stack-to-Register Migration: Move frequently accessed variables from stack to registers in convolution loops

Performance Impact Assessment

Overall System Impact: The changes introduce localized performance degradation in convolution operations without affecting core inference pipeline functions. The 64% increase in convolution response time primarily impacts models using depthwise convolutions, while standard transformer inference remains unaffected.

Critical Path Analysis: Core LLaMA.cpp inference functions (llama_decode, llama_tokenize, memory management) show no performance regression, maintaining inference throughput for standard language model operations.

Phylliida added 5 commits November 3, 2025 13:27

Feat: Added vulkan circular tiling support

f6ac084

Feat: Added cpu circular

d7f5958

Feat: Added cuda kernels

1b62b49

Added tests

60bed3b

Added tests

5700a4e

DajanaV temporarily deployed to PROD__AL_DEMO November 4, 2025 00:27 — with GitHub Actions Inactive

DajanaV temporarily deployed to PROD__AL_DEMO November 4, 2025 01:18 — with GitHub Actions Inactive

DajanaV temporarily deployed to PROD__AL_DEMO November 4, 2025 01:50 — with GitHub Actions Inactive

auroralabs-loci deleted a comment from loci-agentic-ai bot Nov 4, 2025

DajanaV force-pushed the main branch 20 times, most recently from b16251e to 95f6e9b Compare November 6, 2025 13:17

DajanaV force-pushed the main branch 30 times, most recently from 87bfdb3 to a14857a Compare November 11, 2025 19:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

UPSTREAM PR #16985: Add circular tiling support to conv2d and pad, for Vulkan, CUDA, and CPU (used for making seamless textures) #67

UPSTREAM PR #16985: Add circular tiling support to conv2d and pad, for Vulkan, CUDA, and CPU (used for making seamless textures) #67

Uh oh!

DajanaV commented Nov 4, 2025

Uh oh!

loci-agentic-ai bot commented Nov 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

UPSTREAM PR #16985: Add circular tiling support to conv2d and pad, for Vulkan, CUDA, and CPU (used for making seamless textures) #67

Are you sure you want to change the base?

UPSTREAM PR #16985: Add circular tiling support to conv2d and pad, for Vulkan, CUDA, and CPU (used for making seamless textures) #67

Uh oh!

Conversation

DajanaV commented Nov 4, 2025

Uh oh!

loci-agentic-ai bot commented Nov 4, 2025

Performance Analysis Summary: LLaMA.cpp Critical Functions

Critical Function Performance Changes

Primary Impact: Convolution Operations

Secondary Impact: Memory Management

KPI Impact Analysis

1. Tokens Per Second Impact

2. Power Consumption Impact

3. Quantization Efficiency

4. Memory Usage Impact

5. Batch Processing Impact

Root Cause Analysis

Convolution Performance Degradation

Memory Allocation Bottleneck

Action Items for Performance Optimization

Immediate Code-Level Optimizations

Build System Optimizations

Memory Management Improvements

Performance Impact Assessment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants