Add loci-analysis workflow from overlay

DajanaV · DajanaV · commit 4298c74b3f7a · 2026-03-06T02:17:02.000Z
diff --git a/.github/workflows/loci-analysis.yml b/.github/workflows/loci-analysis.yml
@@ -0,0 +1,109 @@
+name: LOCI Analysis
+on:
+  push:
+    branches:
+      - loci/main-*
+  pull_request:
+    types: [opened, synchronize, reopened]
+
+jobs:
+  loci:
+    if: vars.UPSTREAM_REPO != ''
+    runs-on: ubuntu-latest
+
+    env:
+      LOCI_PROJECT: 'Llama CPP'
+      LOCI_API_KEY: '${{ secrets.LOCI_API_KEY }}'
+      LOCI_BACKEND_URL: '${{ vars.LOCI_BACKEND_URL }}'
+      GH_TOKEN: ${{ secrets.MIRROR_REPOS_WRITE_PAT }}
+
+    environment: ${{ vars.LOCI_ENV || 'PROD__AL_DEMO' }}
+
+    steps:
+      - name: Checkout repository
+        uses: actions/checkout@v4
+        with:
+          fetch-depth: 0
+          ref: ${{ (github.event_name == 'pull_request' && github.event.pull_request.head.sha) || github.sha }}
+
+      - name: Compute target
+        id: target
+        if: github.event_name == 'push'
+        run: |
+          branch="${{ github.ref_name }}"
+          sha="${branch#loci/main-}"
+          echo "value=main@${sha}" >> "$GITHUB_OUTPUT"
+
+      - name: Compute base
+        id: base
+        if: github.event_name == 'pull_request'
+        run: |
+          git remote add upstream "https://github.com/${{ vars.UPSTREAM_REPO }}.git" 2>/dev/null || true   
+          git fetch upstream                                                                               
+          upstream_default=$(gh api "repos/${{ vars.UPSTREAM_REPO }}" --jq .default_branch)                
+          merge_base=$(git merge-base HEAD "upstream/${upstream_default}")                                 
+          short_sha="${merge_base:0:7}"                                                                    
+          echo "value=main@${short_sha}" >> "$GITHUB_OUTPUT"                                               
+
+      - name: Install dependencies
+        run: |
+          sudo apt-get update
+          sudo apt-get install -y \
+            cmake \
+            build-essential \
+            gcc-aarch64-linux-gnu \
+            g++-aarch64-linux-gnu \
+            libcurl4-openssl-dev
+     
+      - name: Create build directory and configure with CMake
+        run: |
+          mkdir build
+          cd build
+          cmake .. \
+            -DCMAKE_SYSTEM_NAME=Linux \
+            -DCMAKE_SYSTEM_PROCESSOR=aarch64 \
+            -DCMAKE_C_COMPILER=aarch64-linux-gnu-gcc \
+            -DCMAKE_CXX_COMPILER=aarch64-linux-gnu-g++ \
+            -DCMAKE_OSX_SYSROOT= \
+            -DCMAKE_OSX_DEPLOYMENT_TARGET= \
+            -DBUILD_SHARED_LIBS=ON \
+            -DLLAMA_BUILD_TESTS=OFF \
+            -DLLAMA_BUILD_EXAMPLES=OFF \
+            -DLLAMA_BUILD_SERVER=ON \
+            -DLLAMA_BUILD_COMMON=ON \
+            -DLLAMA_BUILD_TOOLS=ON \
+            -DLLAMA_CURL=OFF \
+            -DCMAKE_BUILD_TYPE=Debug \
+            -DCMAKE_C_FLAGS="-march=armv8-a -Wl,-Bsymbolic" \
+            -DCMAKE_CXX_FLAGS="-march=armv8-a -Wl,-Bsymbolic"
+
+
+      - name: Build project
+        run: |
+          cd build
+          cmake --build . -j4
+
+      - name: LOCI Upload
+        uses: auroralabs-loci/loci-action@v1
+        with:
+          mode: upload
+          binaries: |
+                build/bin/libggml.so*
+                build/bin/libllama.so*
+                build/bin/libggml-cpu.so*
+                build/bin/libggml-base.so*
+                build/bin/libmtmd.so*
+                build/bin/llama-bench
+                build/bin/llama-cvector-generator
+                build/bin/llama-gemma3-cli
+                build/bin/llama-gguf-split
+                build/bin/llama-llava-cli
+                build/bin/llama-minicpmv-cli
+                build/bin/llama-quantize
+                build/bin/llama-qwen2vl-cli
+                build/bin/llama-run
+                build/bin/llama-tokenize
+                build/bin/llama-tts
+          project: '${{ env.LOCI_PROJECT }}'
+          target: ${{ steps.target.outputs.value || ''}}
+          base: ${{ steps.base.outputs.value || '' }}
diff --git a/pulls.ndjson b/pulls.ndjson
@@ -0,0 +1 @@
+{"pull_number":"15307","title":"Add OpenVINO backend","body":"### Overview\r\n\r\nThis PR introduces an [OpenVINO backend](https://docs.openvino.ai/2025/index.html) for `llama.cpp`, enabling hardware-accelerated inference on **Intel® CPUs, GPUs, and NPUs**. The backend leverages OpenVINO to deliver optimized inference with the existing llama.cpp GGUF model ecosystem. Enables performance improvements via OpenVINO’s graph compilation and kernel fusion.\r\n\r\n* llama.cpp with OpenVINO backend: [Build Instructions](https://github.com/ravi9/llama.cpp/blob/dev_backend_openvino/docs/build.md#openvino)\r\n\r\n### Key Features:\r\n\r\n* **New backend implementation**\r\n  * Added OpenVINO backend in `ggml/src/ggml-openvino`.\r\n  * Implemented translations for core GGML operations\r\n\r\n* **Supported precisions**\r\n  * FP16/BF16 GGUF models supported.\r\n  * Q4_0, Q4_1, Q4_K_M, Q6_K models partially supported. (See notes below)\r\n\r\n* **Supported devices**\r\n  * Intel CPUs\r\n  * Intel integrated and discrete GPUs\r\n  * Intel NPUs (requires **UD32+ driver**).\r\n\r\n**For NPU: currently prompt processing is slow, a smaller context size is recommended for better performance, e.g., `-c 512`.**\r\n\r\n**For llama-bench: `-fa 1` is required.**\r\n\r\n### Tested Models\r\n\r\nThe following models are validated for functionality.\r\n\r\nAccuracy and performance are WIP.\r\n\r\n* [`Llama-3.2-1B-Instruct-GGUF`](https://huggingface.co/MaziyarPanahi/Llama-3.2-1B-Instruct-GGUF)\r\n* [`Llama-3.1-8B-Instruct`](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) \r\n* [`microsoft/Phi-3-mini-4k-instruct-gguf`](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-gguf)\r\n* [`Qwen/Qwen2.5-1.5B-Instruct-GGUF`](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct-GGUF)\r\n* [`Qwen/Qwen3-8B`](https://huggingface.co/Qwen/Qwen3-8B)\r\n* [`openbmb/MiniCPM-1B-sft-bf16`](https://huggingface.co/openbmb/MiniCPM-1B-sft-bf16)\r\n* [`tencent/Hunyuan-7B-Instruct`](https://huggingface.co/tencent/Hunyuan-7B-Instruct)\r\n* [`mistralai/Mistral-7B-Instruct-v0.3`](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3)\r\n\r\n### Work in Progress\r\n* Performance and memory optimizations \r\n* Broader quantization coverage.\r\n* Support for additional model architectures. \r\n* Extensive accuracy testing.\r\n\r\n### Notes on quantization support\r\n\r\n#### CPU\r\n* **Q4_0, Q4_1, Q4_K_M and Q6_K models are supported.**\r\n* Q6_K tensors (6bit gs16 sym) are converted to int8 gs16 sym.\r\n* Q5_K tensors (5bit gs32 asym) are converted to int8 gs32 asym.\r\n\r\n#### GPU\r\n* **Q4_0, Q4_1, Q4_K_M and Q6_K models are supported.**\r\n* Q6_K tensors (6bit gs16 sym) are requantized to int8 gs32 sym.\r\n* Q5_K tensors (5bit gs32 asym) are converted to int8 gs32 asym.\r\n\r\n#### NPU\r\n* **Main quantization scheme for the supported models in this PR is Q4_0.**\r\n* Q4_0 and Q4_1 tensors are requantized to int4 gs128 sym.\r\n* Q6_K tensors are dequantized to fp16.\r\n\r\nOther notes:\r\n* Both Q4_0 and Q4_1 models use Q6_K for the token_embedding tensor and the weight tensor in the last matmul (in most models it is the same tensor as token_emb).\r\n* Q4_0 models will produce some Q4_1 tensors if imatrix is provided as part of the quantization of the model using llama-quantize utility.\r\n* Q4_K_M models additionally have Q6_K tensors and Q5_K tensors (only in Phi3 in the validated model list of this PR).\r\n\r\nNOTE: Optimum-intel converts the fp16/bf16 token embedding tensor and the weight tensor in the last matmul to int8 asym channel-wise ([config code](https://github.com/huggingface/optimum-intel/blob/b60e4d4866509a1aeea2b7a3f26f2a70bc464354/optimum/commands/export/openvino.py#L183-L191)).\r\n\r\n\r\n","pull_head_sha":"db976265ce4da1c2bc3cf7bb45fc7ec4d1d02c29","loci_pr_branch":"loci/pr-15307-dev_backend_openvino","short_merge_base":"4d828bd","loci_main_branch":"loci/main-4d828bd","use_loci_base":0}

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1 @@`
	`1`	+{"pull_number":"15307","title":"Add OpenVINO backend","body":"### Overview\r\n\r\nThis PR introduces an [OpenVINO backend](https://docs.openvino.ai/2025/index.html) for `llama.cpp`, enabling hardware-accelerated inference on Intel® CPUs, GPUs, and NPUs. The backend leverages OpenVINO to deliver optimized inference with the existing llama.cpp GGUF model ecosystem. Enables performance improvements via OpenVINO’s graph compilation and kernel fusion.\r\n\r\n* llama.cpp with OpenVINO backend: [Build Instructions](https://github.com/ravi9/llama.cpp/blob/dev_backend_openvino/docs/build.md#openvino)\r\n\r\n### Key Features:\r\n\r\n* New backend implementation\r\n * Added OpenVINO backend in `ggml/src/ggml-openvino`.\r\n * Implemented translations for core GGML operations\r\n\r\n* Supported precisions\r\n * FP16/BF16 GGUF models supported.\r\n * Q4_0, Q4_1, Q4_K_M, Q6_K models partially supported. (See notes below)\r\n\r\n* Supported devices\r\n * Intel CPUs\r\n * Intel integrated and discrete GPUs\r\n * Intel NPUs (requires UD32+ driver).\r\n\r\nFor NPU: currently prompt processing is slow, a smaller context size is recommended for better performance, e.g., `-c 512`.\r\n\r\nFor llama-bench: `-fa 1` is required.\r\n\r\n### Tested Models\r\n\r\nThe following models are validated for functionality.\r\n\r\nAccuracy and performance are WIP.\r\n\r\n* [`Llama-3.2-1B-Instruct-GGUF`](https://huggingface.co/MaziyarPanahi/Llama-3.2-1B-Instruct-GGUF)\r\n* [`Llama-3.1-8B-Instruct`](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) \r\n* [`microsoft/Phi-3-mini-4k-instruct-gguf`](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-gguf)\r\n* [`Qwen/Qwen2.5-1.5B-Instruct-GGUF`](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct-GGUF)\r\n* [`Qwen/Qwen3-8B`](https://huggingface.co/Qwen/Qwen3-8B)\r\n* [`openbmb/MiniCPM-1B-sft-bf16`](https://huggingface.co/openbmb/MiniCPM-1B-sft-bf16)\r\n* [`tencent/Hunyuan-7B-Instruct`](https://huggingface.co/tencent/Hunyuan-7B-Instruct)\r\n* [`mistralai/Mistral-7B-Instruct-v0.3`](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3)\r\n\r\n### Work in Progress\r\n* Performance and memory optimizations \r\n* Broader quantization coverage.\r\n* Support for additional model architectures. \r\n* Extensive accuracy testing.\r\n\r\n### Notes on quantization support\r\n\r\n#### CPU\r\n* Q4_0, Q4_1, Q4_K_M and Q6_K models are supported.\r\n* Q6_K tensors (6bit gs16 sym) are converted to int8 gs16 sym.\r\n* Q5_K tensors (5bit gs32 asym) are converted to int8 gs32 asym.\r\n\r\n#### GPU\r\n* Q4_0, Q4_1, Q4_K_M and Q6_K models are supported.\r\n* Q6_K tensors (6bit gs16 sym) are requantized to int8 gs32 sym.\r\n* Q5_K tensors (5bit gs32 asym) are converted to int8 gs32 asym.\r\n\r\n#### NPU\r\n* Main quantization scheme for the supported models in this PR is Q4_0.\r\n* Q4_0 and Q4_1 tensors are requantized to int4 gs128 sym.\r\n* Q6_K tensors are dequantized to fp16.\r\n\r\nOther notes:\r\n* Both Q4_0 and Q4_1 models use Q6_K for the token_embedding tensor and the weight tensor in the last matmul (in most models it is the same tensor as token_emb).\r\n* Q4_0 models will produce some Q4_1 tensors if imatrix is provided as part of the quantization of the model using llama-quantize utility.\r\n* Q4_K_M models additionally have Q6_K tensors and Q5_K tensors (only in Phi3 in the validated model list of this PR).\r\n\r\nNOTE: Optimum-intel converts the fp16/bf16 token embedding tensor and the weight tensor in the last matmul to int8 asym channel-wise ([config code](https://github.com/huggingface/optimum-intel/blob/b60e4d4866509a1aeea2b7a3f26f2a70bc464354/optimum/commands/export/openvino.py#L183-L191)).\r\n\r\n\r\n","pull_head_sha":"db976265ce4da1c2bc3cf7bb45fc7ec4d1d02c29","loci_pr_branch":"loci/pr-15307-dev_backend_openvino","short_merge_base":"4d828bd","loci_main_branch":"loci/main-4d828bd","use_loci_base":0}