Skip to content

Commit d63964d

Browse files
committed
Add loci-analysis workflow from overlay
1 parent bd69921 commit d63964d

2 files changed

Lines changed: 110 additions & 0 deletions

File tree

Lines changed: 109 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,109 @@
1+
name: LOCI Analysis
2+
on:
3+
push:
4+
branches:
5+
- loci/main-*
6+
pull_request:
7+
types: [opened, synchronize, reopened]
8+
9+
jobs:
10+
loci:
11+
if: vars.UPSTREAM_REPO != ''
12+
runs-on: ubuntu-latest
13+
14+
env:
15+
LOCI_PROJECT: 'Llama CPP'
16+
LOCI_API_KEY: '${{ secrets.LOCI_API_KEY }}'
17+
LOCI_BACKEND_URL: '${{ vars.LOCI_BACKEND_URL }}'
18+
GH_TOKEN: ${{ secrets.MIRROR_REPOS_WRITE_PAT }}
19+
20+
environment: ${{ vars.LOCI_ENV || 'PROD__AL_DEMO' }}
21+
22+
steps:
23+
- name: Checkout repository
24+
uses: actions/checkout@v4
25+
with:
26+
fetch-depth: 0
27+
ref: ${{ (github.event_name == 'pull_request' && github.event.pull_request.head.sha) || github.sha }}
28+
29+
- name: Compute target
30+
id: target
31+
if: github.event_name == 'push'
32+
run: |
33+
branch="${{ github.ref_name }}"
34+
sha="${branch#loci/main-}"
35+
echo "value=main@${sha}" >> "$GITHUB_OUTPUT"
36+
37+
- name: Compute base
38+
id: base
39+
if: github.event_name == 'pull_request'
40+
run: |
41+
git remote add upstream "https://github.com/${{ vars.UPSTREAM_REPO }}.git" 2>/dev/null || true
42+
git fetch upstream
43+
upstream_default=$(gh api "repos/${{ vars.UPSTREAM_REPO }}" --jq .default_branch)
44+
merge_base=$(git merge-base HEAD "upstream/${upstream_default}")
45+
short_sha="${merge_base:0:7}"
46+
echo "value=main@${short_sha}" >> "$GITHUB_OUTPUT"
47+
48+
- name: Install dependencies
49+
run: |
50+
sudo apt-get update
51+
sudo apt-get install -y \
52+
cmake \
53+
build-essential \
54+
gcc-aarch64-linux-gnu \
55+
g++-aarch64-linux-gnu \
56+
libcurl4-openssl-dev
57+
58+
- name: Create build directory and configure with CMake
59+
run: |
60+
mkdir build
61+
cd build
62+
cmake .. \
63+
-DCMAKE_SYSTEM_NAME=Linux \
64+
-DCMAKE_SYSTEM_PROCESSOR=aarch64 \
65+
-DCMAKE_C_COMPILER=aarch64-linux-gnu-gcc \
66+
-DCMAKE_CXX_COMPILER=aarch64-linux-gnu-g++ \
67+
-DCMAKE_OSX_SYSROOT= \
68+
-DCMAKE_OSX_DEPLOYMENT_TARGET= \
69+
-DBUILD_SHARED_LIBS=ON \
70+
-DLLAMA_BUILD_TESTS=OFF \
71+
-DLLAMA_BUILD_EXAMPLES=OFF \
72+
-DLLAMA_BUILD_SERVER=ON \
73+
-DLLAMA_BUILD_COMMON=ON \
74+
-DLLAMA_BUILD_TOOLS=ON \
75+
-DLLAMA_CURL=OFF \
76+
-DCMAKE_BUILD_TYPE=Debug \
77+
-DCMAKE_C_FLAGS="-march=armv8-a -Wl,-Bsymbolic" \
78+
-DCMAKE_CXX_FLAGS="-march=armv8-a -Wl,-Bsymbolic"
79+
80+
81+
- name: Build project
82+
run: |
83+
cd build
84+
cmake --build . -j4
85+
86+
- name: LOCI Upload
87+
uses: auroralabs-loci/loci-action@v1
88+
with:
89+
mode: upload
90+
binaries: |
91+
build/bin/libggml.so*
92+
build/bin/libllama.so*
93+
build/bin/libggml-cpu.so*
94+
build/bin/libggml-base.so*
95+
build/bin/libmtmd.so*
96+
build/bin/llama-bench
97+
build/bin/llama-cvector-generator
98+
build/bin/llama-gemma3-cli
99+
build/bin/llama-gguf-split
100+
build/bin/llama-llava-cli
101+
build/bin/llama-minicpmv-cli
102+
build/bin/llama-quantize
103+
build/bin/llama-qwen2vl-cli
104+
build/bin/llama-run
105+
build/bin/llama-tokenize
106+
build/bin/llama-tts
107+
project: '${{ env.LOCI_PROJECT }}'
108+
target: ${{ steps.target.outputs.value || ''}}
109+
base: ${{ steps.base.outputs.value || '' }}

pulls.ndjson

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
{"pull_number":"20644","title":"ggml-cuda: Add NVFP4 dp4a kernel","body":"This PR brings in the initial plumbing for basic CUDA support for NVFP4 - it includes one NVFP4xQ8_1 dp4a kernel. MMA or Blackwell kernels are not included here and were kept out for a separate PR. \r\n\r\n`vec_dot_mma` is is linked up the dp4a kernel so it still runs dp4a even when BLACKWELL_MMA_AVAILABLE.\r\nThere is a branch for NVFP4 in `mmq_write_back_mma` to push it back to the dp4a layout which will be removed when the MMA kernel is wired in.\r\n\r\nIt was tuned to bring as much performance as possible for DP4A. Comparisons below. \r\n**CPU vs DP4A**\r\n\r\n| Model | CPU (pp64) | DP4A (pp64) | Speedup | CPU (tg16) | DP4A (tg16) | Speedup |\r\n|---|---:|---:|---:|---:|---:|---:|\r\n| Qwen3.5-0.8B | 59.63 | 3557.08 | **59.65x** | 32.33 | 329.60 | **10.19x** |\r\n| Qwen3.5-27B | 1.27 | 594.14 | **467.83x** | 1.08 | 52.65 | **48.75x** |\r\n\r\n**pp512 / tg128**\r\n\r\n| Model | pp512 t/s | tg128 t/s |\r\n|---|---:|---:|\r\n| Qwen3-4B | 9031.22 | 271.69 |\r\n| Qwen3-8B | 4909.27 | 175.93 |\r\n| Qwen3.5-27B| 1482.89 | 63.47 |\r\n| Qwen3.5-0.8B | 25596.92 | 388.12 |\r\n| Qwen3.5-0.8B-Q4_K_M | 38339.57 | 521.69 |\r\n\r\nAI assistance was used in refactoring and writing some of this code. Each line has been scrutinized and this was hand-edited to be as neat and as minimal as possible. Test-backend-ops passes; CPU<>GPU parity was tested with a separate tool and is exact across multiple tile sizes, and kld/ppl was verified on several models and is as expected (tested on GPU only, CPU too slow).","pull_head_sha":"1d9aa514d5aca8f7670636fd6af8791e6810e99c","loci_pr_branch":"loci/pr-20644-nvfp4-dp4a","short_merge_base":"49bfdde","loci_main_branch":"loci/main-49bfdde","use_loci_base":0}

0 commit comments

Comments
 (0)