You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{"pull_number":"15307","title":"Add OpenVINO backend","body":"### Overview\r\n\r\nThis PR introduces an [OpenVINO backend](https://docs.openvino.ai/2025/index.html) for `llama.cpp`, enabling hardware-accelerated inference on **Intel® CPUs, GPUs, and NPUs**. The backend leverages OpenVINO to deliver optimized inference with the existing llama.cpp GGUF model ecosystem. Enables performance improvements via OpenVINO’s graph compilation and kernel fusion.\r\n\r\n* llama.cpp with OpenVINO backend: [Build Instructions](https://github.com/ravi9/llama.cpp/blob/dev_backend_openvino/docs/build.md#openvino)\r\n\r\n### Key Features:\r\n\r\n* **New backend implementation**\r\n * Added OpenVINO backend in `ggml/src/ggml-openvino`.\r\n * Implemented translations for core GGML operations\r\n\r\n* **Supported precisions**\r\n * FP16/BF16 GGUF models supported.\r\n * Q4_0, Q4_1, Q4_K_M, Q6_K models partially supported. (See notes below)\r\n\r\n* **Supported devices**\r\n * Intel CPUs\r\n * Intel integrated and discrete GPUs\r\n * Intel NPUs (requires **UD32+ driver**).\r\n\r\n**For NPU: currently prompt processing is slow, a smaller context size is recommended for better performance, e.g., `-c 512`.**\r\n\r\n**For llama-bench: `-fa 1` is required.**\r\n\r\n### Tested Models\r\n\r\nThe following models are validated for functionality.\r\n\r\nAccuracy and performance are WIP.\r\n\r\n* [`Llama-3.2-1B-Instruct-GGUF`](https://huggingface.co/MaziyarPanahi/Llama-3.2-1B-Instruct-GGUF)\r\n* [`Llama-3.1-8B-Instruct`](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) \r\n* [`microsoft/Phi-3-mini-4k-instruct-gguf`](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-gguf)\r\n* [`Qwen/Qwen2.5-1.5B-Instruct-GGUF`](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct-GGUF)\r\n* [`Qwen/Qwen3-8B`](https://huggingface.co/Qwen/Qwen3-8B)\r\n* [`openbmb/MiniCPM-1B-sft-bf16`](https://huggingface.co/openbmb/MiniCPM-1B-sft-bf16)\r\n* [`tencent/Hunyuan-7B-Instruct`](https://huggingface.co/tencent/Hunyuan-7B-Instruct)\r\n* [`mistralai/Mistral-7B-Instruct-v0.3`](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3)\r\n\r\n### Work in Progress\r\n* Performance and memory optimizations \r\n* Broader quantization coverage.\r\n* Support for additional model architectures. \r\n* Extensive accuracy testing.\r\n\r\n### Notes on quantization support\r\n\r\n#### CPU\r\n* **Q4_0, Q4_1, Q4_K_M and Q6_K models are supported.**\r\n* Q6_K tensors (6bit gs16 sym) are converted to int8 gs16 sym.\r\n* Q5_K tensors (5bit gs32 asym) are converted to int8 gs32 asym.\r\n\r\n#### GPU\r\n* **Q4_0, Q4_1, Q4_K_M and Q6_K models are supported.**\r\n* Q6_K tensors (6bit gs16 sym) are requantized to int8 gs32 sym.\r\n* Q5_K tensors (5bit gs32 asym) are converted to int8 gs32 asym.\r\n\r\n#### NPU\r\n* **Main quantization scheme for the supported models in this PR is Q4_0.**\r\n* Q4_0 and Q4_1 tensors are requantized to int4 gs128 sym.\r\n* Q6_K tensors are dequantized to fp16.\r\n\r\nOther notes:\r\n* Both Q4_0 and Q4_1 models use Q6_K for the token_embedding tensor and the weight tensor in the last matmul (in most models it is the same tensor as token_emb).\r\n* Q4_0 models will produce some Q4_1 tensors if imatrix is provided as part of the quantization of the model using llama-quantize utility.\r\n* Q4_K_M models additionally have Q6_K tensors and Q5_K tensors (only in Phi3 in the validated model list of this PR).\r\n\r\nNOTE: Optimum-intel converts the fp16/bf16 token embedding tensor and the weight tensor in the last matmul to int8 asym channel-wise ([config code](https://github.com/huggingface/optimum-intel/blob/b60e4d4866509a1aeea2b7a3f26f2a70bc464354/optimum/commands/export/openvino.py#L183-L191)).\r\n\r\n\r\n","pull_head_sha":"db976265ce4da1c2bc3cf7bb45fc7ec4d1d02c29","loci_pr_branch":"loci/pr-15307-dev_backend_openvino","short_merge_base":"4d828bd","loci_main_branch":"loci/main-4d828bd","use_loci_base":0}
0 commit comments