[cpu][perf] Accelerate unquantized-linear for AArch64 through oneDNN/ACL and weight prepack #25948

fadara01 · 2025-09-30T10:31:42Z

Accelerate unquantized-linear for AArch64 through oneDNN/ACL and weight prepack

#24150 introduced weight prepack and a diect oneDNN path for linear ops.

This path is not currently active for AArch64 - i.e. linears are dispatched to PyTorch and as a result we have to pack weights each time a torch.linear op is executed.

This PR enables weight prepack and dispatches non-quantized linear ops to oneDNN (only) if oneDNN was built with Arm Compute Library (ACL) as its backend. If oneDNN was built without ACL we follow the current path where linears go through the PyTorch path as this is still much faster than oneDNN without ACL.

I had to make the following changes to the current oneDNN matmul path to make it compatible with ACL:

oneDNN/ACL matmul does not support runtime dimensions -> pass a default M=128 and input stride=K when creating the matmul primitive descriptor
oneDNN/ACL matmul does not support passing a bias -> c=matmul(a,b)+bias is handled as c=bias; c+=matmul(a,b) through attaching a fused sum post-op to the matmul primitive
oneDNN/ACL matmul does not support non-contiguous source tensors -> we make sure that source tensors are contiguous
oneDNN/ACL matmul API allows for the weight format to change when the input dimensions change, so we now check at execute whether we need to pack again. Note that ACL weight format does not tend to change in practice, so this won't be performance issue. We had to add this check because the API allows for such weight format changes.

This PR also ensures that the current cmake arg for enabling building oneDNN with ACL backend (VLLM_BUILD_ACL) is not discarded by setup.py.

To use this fast path you need to clone and build ACL:

git clone https://github.com/ARM-software/ComputeLibrary.git
cd ComputeLibrary
nice -n2 scons -j32 Werror=0 debug=0 neon=1 examples=1 embed_kernels=0 os=linux arch=armv8.2-a build=native openmp=1 cppthreads=0 benchmark_examples=1 fixed_format_kernels=1 multi_isa=1

then build vllm with ACL as oneDNN's backend:

ACL_ROOT_DIR=/path/to/ComputeLibrary CMAKE_ARGS="-DVLLM_BUILD_ACL=ON " VLLM_TARGET_DEVICE=cpu python3 setup.py bdist_wheel

Performance:

On 16 Neoverse-V2 cores, this PR yields ~ 78% higher throughput (with oneDNN built with ACL backend) than the current default path, in the throughput benchmarks for meta-llama/Llama3.1-8b-Instruct executed as follows:

LD_PRELOAD="/usr/lib/aarch64-linux-gnu/libtcmalloc_minimal.so.4:/usr/lib/aarch64-linux-gnu/libgomp.so.1" VLLM_TARGET_DEVICE=cpu VLLM_CPU_KVCACHE_SPACE=32 taskset -c 0-15 vllm bench throughput --num-prompts 64 --seed 0 \
       --dataset-name sharegpt --dataset-path ShareGPT_V3_unfiltered_cleaned_split_no_imsorry.json --max-model-len 4096 --model meta-llama/Llama-3.1-8B-Instruct

Future PRs will look into building oneDNN with ACL backend by default where appropriate

Test Plan

tests/kernels/test_onednn.py exercises the oneDNN path for linear ops.

Test Result

All pass with my changes when oneDNN is built with/without ACL backend (on Neoverse-V2)

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

gemini-code-assist

Code Review

This pull request introduces performance optimizations for unquantized linear operations on AArch64 by integrating oneDNN with the Arm Compute Library (ACL). The approach is sound and the implementation is mostly correct, aligning well with the goals outlined in the description. However, I've identified a critical compilation error due to a misplaced preprocessor directive and an extraneous backup file that should be removed. Addressing these issues will ensure the correctness and cleanliness of the codebase.

csrc/cpu/dnnl_helper.cpp

vllm/_custom_ops.py.orig

…ACL and weight prepack vllm-project#24150 introduced weight prepack and a diect oneDNN path for linear ops this path is not currently active for AArch64 - i.e. linears are dispatched to PyTorch and as a result we have to pack weights each time a torch.linear op is executed This PR enables weight prepack and dispatches non-quantized linear ops to oneDNN (only) if oneDNN was built with Arm Compute Library (ACL) as its backend. If oneDNN was built without ACL we follow the current path where linears go through the PyTorch path as this is still much faster than oneDNN without ACL. I had to make the following changes to the current oneDNN matmul path to make it compatible with ACL: - oneDNN/ACL matmul does not support runtime dimensions -> pass a default M=128 and input stride=K when creating the matmul primitive descriptor - oneDNN/ACL matmul does not support passing a bias -> c=matmul(a,b)+bias is handled as c=bias; c+=matmul(a,b) through attaching a fused sum post-op to the matmul primitive - oneDNN/ACL matmul does not support non-contiguous source tensors -> we make sure that source tensors are contiguous - oneDNN/ACL matmul API allows for the weight format to change when the input dimensions change, so we now check at execute whether we need to pack again. Note that ACL weight format does not tend to change in practice, so this won't be performance issue. We had to add this check because the API allows for such weight format changes. This PR also ensures that the current cmake arg for enabling building oneDNN with ACL backend (VLLM_BUILD_ACL) is not discarded by setup.py Test Plan: tests/kernels/test_onednn.py exercises the oneDNN path for linear ops. All pass with my changes when oneDNN is built with/without ACL backend Performace: On 16 Neoverse-V2 cores, this PR yields ~ 78% higher throughput (with oneDNN built with ACL backend) than the current default path, in the throughput benchmarks for meta-llama/Llama3.1-8b-Instruct executed as follows: ``` LD_PRELOAD="/usr/lib/aarch64-linux-gnu/libtcmalloc_minimal.so.4:/usr/lib/aarch64-linux-gnu/libgomp.so.1" VLLM_TARGET_DEVICE=cpu VLLM_CPU_KVCACHE_SPACE=32 taskset -c 0-15 vllm bench throughput --num-prompts 64 --seed 0 \ --dataset-name sharegpt --dataset-path ShareGPT_V3_unfiltered_cleaned_split_no_imsorry.json --max-model-len 4096 --model meta-llama/Llama-3.1-8B-Instruct ``` Future PRs will look into building oneDNN with ACL backend by default where appropriate Signed-off-by: Fadi Arafeh <[email protected]>

fadara01 · 2025-10-01T11:06:03Z

@bigPYJ1151 could you please have a look at this?

bigPYJ1151 · 2025-10-01T14:02:25Z

@fadara01 Sorry for the late. I will check it later as I am out of office these days.
Can we make oneDNN on arm built with ACL by default? Looks like ACL is required to get perf improvement.

fadara01 · 2025-10-01T14:04:46Z

@bigPYJ1151 thanks for your response.

Can we make oneDNN on arm built with ACL by default?

We're raising a separate PR soon to do this, but we're not blocked on that for this PR

fadara01 · 2025-10-01T15:05:22Z

Also, note that if you do not build oneDNN with ACL, you will just go to the current default path for linears (pytorch) thanks to the is_onednn_acl_supported()

bigPYJ1151

Thanks for the effort. Better to remove is_onednn_acl_supported and build oneDNN with ACL by default for ARM in further PRs.

bigPYJ1151 · 2025-10-03T12:05:42Z

setup.py

+        other_cmake_args = os.environ.get("CMAKE_ARGS")
+        if other_cmake_args:
+            cmake_args += other_cmake_args.split()


Why this change is required? Looks like no arg is added to CMAKE_ARGS in this PR.

To build oneDNN with ACL you currently need to set CMAKE_ARGS="-DVLLM_BUILD_ACL=ON "
I agree that it is not well documented on how one would build vllm's oneDNN with ACL.
A new PR is coming soon to build oneDNN with ACL by default.

fadara01 · 2025-10-03T15:31:07Z

Thanks for the effort. Better to remove is_onednn_acl_supported and build oneDNN with ACL by default for ARM in further PRs.

Will do, thanks a lot for taking the time to review this :)

…ACL and weight prepack (vllm-project#25948) Signed-off-by: Fadi Arafeh <[email protected]> Co-authored-by: Li, Jiang <[email protected]> Signed-off-by: Tomer Asida <[email protected]>

…ACL and weight prepack (vllm-project#25948) Signed-off-by: Fadi Arafeh <[email protected]> Co-authored-by: Li, Jiang <[email protected]> Signed-off-by: Karan Goel <[email protected]>

…ACL and weight prepack (vllm-project#25948) Signed-off-by: Fadi Arafeh <[email protected]> Co-authored-by: Li, Jiang <[email protected]>

…ACL and weight prepack (vllm-project#25948) Signed-off-by: Fadi Arafeh <[email protected]> Co-authored-by: Li, Jiang <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>

PR vllm-project#25948 accelerated un-quantized linears for AArch64 through oneDNN/ACL and weight prepack. However, it relied on the user building Arm Compute Library (ACL) themselves and setting `VLLM_BUILD_ACL` when building vllm. However, most users don't know about this and as a result they miss on the optimizations delivered by vllm-project#25948. This PR builds ACL as oneDNN's backend by default on AArch64 and allows for weight prepack and dispatch to oneDNN/ACL in an out-of-the-box fashion. Co-authored-by: Michael Yang <[email protected]> Signed-off-by: Fadi Arafeh <[email protected]>

…ACL and weight prepack (vllm-project#25948) Signed-off-by: Fadi Arafeh <[email protected]> Co-authored-by: Li, Jiang <[email protected]>

…ACL and weight prepack (vllm-project#25948) Signed-off-by: Fadi Arafeh <[email protected]> Co-authored-by: Li, Jiang <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>

…ACL and weight prepack (vllm-project#25948) Signed-off-by: Fadi Arafeh <[email protected]> Co-authored-by: Li, Jiang <[email protected]>

fadara01 requested a review from bigPYJ1151 as a code owner September 30, 2025 10:31

mergify bot added the ci/build label Sep 30, 2025

gemini-code-assist bot reviewed Sep 30, 2025

View reviewed changes

csrc/cpu/dnnl_helper.cpp Outdated Show resolved Hide resolved

vllm/_custom_ops.py.orig Outdated Show resolved Hide resolved

bigPYJ1151 self-assigned this Sep 30, 2025

fadara01 force-pushed the accelerate_linear_arm branch 6 times, most recently from 851b6b1 to 77329e3 Compare September 30, 2025 14:16

fadara01 force-pushed the accelerate_linear_arm branch from 77329e3 to fd75a23 Compare September 30, 2025 14:47

bigPYJ1151 approved these changes Oct 3, 2025

View reviewed changes

Merge branch 'main' into accelerate_linear_arm

635ff55

bigPYJ1151 added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 3, 2025

Merge branch 'main' into accelerate_linear_arm

70bd7d1

bigPYJ1151 merged commit 9705fba into vllm-project:main Oct 4, 2025
84 checks passed

fadara01 mentioned this pull request Oct 20, 2025

[cpu] Dispatch un-quantized linear to oneDNN/ACL by default for AArch64 #27183

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[cpu][perf] Accelerate unquantized-linear for AArch64 through oneDNN/ACL and weight prepack #25948

[cpu][perf] Accelerate unquantized-linear for AArch64 through oneDNN/ACL and weight prepack #25948

Uh oh!

fadara01 commented Sep 30, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

fadara01 commented Oct 1, 2025

Uh oh!

bigPYJ1151 commented Oct 1, 2025

Uh oh!

fadara01 commented Oct 1, 2025

Uh oh!

fadara01 commented Oct 1, 2025

Uh oh!

bigPYJ1151 left a comment

Uh oh!

bigPYJ1151 Oct 3, 2025

Uh oh!

fadara01 Oct 3, 2025

Uh oh!

fadara01 commented Oct 3, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

[cpu][perf] Accelerate unquantized-linear for AArch64 through oneDNN/ACL and weight prepack #25948

[cpu][perf] Accelerate unquantized-linear for AArch64 through oneDNN/ACL and weight prepack #25948

Uh oh!

Conversation

fadara01 commented Sep 30, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Accelerate unquantized-linear for AArch64 through oneDNN/ACL and weight prepack

Performance:

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

fadara01 commented Oct 1, 2025

Uh oh!

bigPYJ1151 commented Oct 1, 2025

Uh oh!

fadara01 commented Oct 1, 2025

Uh oh!

fadara01 commented Oct 1, 2025

Uh oh!

bigPYJ1151 left a comment

Choose a reason for hiding this comment

Uh oh!

bigPYJ1151 Oct 3, 2025

Choose a reason for hiding this comment

Uh oh!

fadara01 Oct 3, 2025

Choose a reason for hiding this comment

Uh oh!

fadara01 commented Oct 3, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fadara01 commented Sep 30, 2025 •

edited by github-actions bot

Loading