UPSTREAM PR #17318: ggml-cpu: extend support for RVV floating-point kernels#318
UPSTREAM PR #17318: ggml-cpu: extend support for RVV floating-point kernels#318
Conversation
Co-authored-by: Rehan Qasim <rehan.qasim@10xengineers.ai>
|
Explore the complete analysis inside the Version Insights Performance Analysis Summary - PR #318OverviewPR #318 introduces RISC-V Vector (RVV) optimized implementations for FP16 and BF16 floating-point operations in the GGML CPU backend. The changes add vectorized kernels for six functions previously using scalar fallbacks on RISC-V platforms. Analysis shows zero measurable performance impact on the x86_64 baseline build, as these optimizations are RISC-V-specific and conditionally compiled. Key FindingsPerformance Metrics - x86_64 Baseline: Impacted Binaries:
Inference Impact: Code Changes: Platform-Specific Nature: |
92ef8cd to
7dd50b8
Compare
|
Explore the complete analysis inside the Version Insights Performance Analysis Summary - PR #318OverviewPR #318 introduces RISC-V Vector (RVV) optimizations for 6 low-level floating-point kernels in the GGML backend. The changes add vectorized implementations for FP16/BF16 conversion and arithmetic operations, targeting RISC-V platforms with specific vector extensions. Code Changes AnalysisThe modifications implement hardware-accelerated vector operations through RVV intrinsics. Key implementations include:
All functions follow consistent patterns: pre-calculate aligned boundaries, process main loop with 2x unrolling, handle leftovers separately. Key FindingsPerformance-Critical Functions ImpactThe analyzed performance improvements (90-99% in chat parsing, 98% in graph building) are not caused by these RVV changes. The analyzed binaries are ARM64-based, while PR #318 targets RISC-V platforms exclusively through conditional compilation. On non-RISC-V platforms, these changes have zero impact. Modified functions and their scope:
None of these functions are llama_decode, llama_encode, or llama_tokenize. They are low-level GGML primitives called indirectly through the computation graph. Tokens Per Second ImpactOn analyzed ARM64 platform: Zero impact. Conditional compilation ensures RVV code paths are not compiled or executed. On RISC-V platforms (theoretical): The modified functions are several layers below the inference entry points. For a model like smollm:135m on RISC-V hardware:
Impacted inference functions (indirect, RISC-V only):
No direct changes to tokenization functions. The llama_tokenize function is unaffected. Power Consumption AnalysisThe power consumption improvements observed (10-23% reduction across binaries including libllama.so at 16.6%, libmtmd.so at 23%, llama-run at 22%) are not attributable to PR #318. These measurements reflect ARM64 binaries where RVV code is not active. On RISC-V platforms, vectorized operations typically reduce power consumption by 10-20% for compute-bound operations due to fewer instructions and better hardware utilization. Expected impact on RISC-V:
Impacted binaries (RISC-V only): All binaries linking libggml-cpu.so that perform tensor operations. Platform SpecificityThis PR is architecture-specific optimization with no cross-platform effects. The conditional compilation ( |
96dc574 to
9a74048
Compare
738bfbf to
f01b714
Compare
Mirrored from ggml-org/llama.cpp#17318
This PR extends the existing RISC-V Vector (RVV) floating-point support introduced introduced in (PR# 15075), adding new kernels.
Summary
BF16RVV Flag toggml-cpu/CMakeLists.txtto enable thezvfbfwmaextensionNewly Added Kernels
Testing
Kernels were functionally tested on QEMU for VLENs (128-bit, 256-bit, 512-bit and 1024-bit) for a range of input sizes.