UPSTREAM PR #18151: ggml-hexagon: gelu optimization by loci-dev · Pull Request #609 · auroralabs-loci/llama.cpp

loci-dev · 2025-12-17T23:35:52Z

Following the discussion regarding the refactor idea here, I have re-evaluated the approach. Upon reviewing the current implementations for activation, unary, and binary functions, I observed that the existing code heavily relies on L2 prefetching while under-utilizing the VTCM and DMA.

While L2 prefetching offers some benefits, it is limited by two main factors:

The L2 cache is significantly smaller than the VTCM.
The L2 cache is shared between L1 program instructions and the L1 data cache.

This PR optimizes the code (using GELU as the initial implementation) by shifting the workload to heavily utilize the VTCM and DMA, thereby freeing up the L2 cache for L1 instruction and data cache

Optimization Strategy

Instead of relying on L2 prefetching, this implementation employs a DMA ping-pong buffering approach:

Fetch: DMA moves data from DDR to VTCM (buffer A).
Compute: DSP processes data in VTCM (buffer B) while DMA fetches the next chunk into buffer A.
Write back :DMA moves processed data from VTCM back to DDR.

This allows for overlapping computation and memory transfer, resulting in significantly higher throughput.

Performance Benchmarks

The performance improvements are significant. Below is a comparison between the existing implementation and the new DMA/VTCM approach:

Dimensions	Old Implementation (L2 Prefetch)	New Implementation (DMA/VTCM)
4096 x 4096	~5000 µs	~3800 µs
4096 x 4304	~5300 µs	~3700 µs

NOTE: I used the GELU as an example, but this approach can easily extend to other operations.

Unaligned Load Resolution:
This approach inherently solves the unaligned load issues encountered previously. Since data is fetched from DDR via DMA, the DMA engine ensures that data is stored into aligned addresses within the VTCM, even if the source data in DDR is unaligned.

@max-krasnyansky

loci-dev had a problem deploying to PROD__AL_DEMO December 17, 2025 23:35 — with GitHub Actions Failure

loci-dev force-pushed the main branch 28 times, most recently from f002844 to 25154fc Compare December 21, 2025 21:07

feat: working gelu with src0 put on vtcm

59c8869

loci-dev force-pushed the main branch 30 times, most recently from 799071d to dba3ea5 Compare December 27, 2025 20:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UPSTREAM PR #18151: ggml-hexagon: gelu optimization#609

UPSTREAM PR #18151: ggml-hexagon: gelu optimization#609
loci-dev wants to merge 9 commits intomainfrom
upstream-PR18151-branch_joeldushouyu-hexagon-gelu-optimization

loci-dev commented Dec 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

loci-dev commented Dec 17, 2025

Optimization Strategy

Performance Benchmarks

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants