Skip to content

Conversation

@bigPYJ1151
Copy link
Member

@bigPYJ1151 bigPYJ1151 commented Sep 3, 2025

Purpose

  • Enable weight prepack for CPU unquantized linear

main

============ Serving Benchmark Result ============
Successful requests:                     64        
Maximum request concurrency:             16        
Benchmark duration (s):                  240.59    
Total input tokens:                      65472     
Total generated tokens:                  65536     
Request throughput (req/s):              0.27      
Output token throughput (tok/s):         272.39    
Total Token throughput (tok/s):          544.52    
---------------Time to First Token----------------
Mean TTFT (ms):                          3250.03   
Median TTFT (ms):                        3793.30   
P99 TTFT (ms):                           4976.67   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          55.60     
Median TPOT (ms):                        55.08     
P99 TPOT (ms):                           57.22     
---------------Inter-token Latency----------------
Mean ITL (ms):                           55.60     
Median ITL (ms):                         53.65     
P99 ITL (ms):                            58.88     
==================================================

this

============ Serving Benchmark Result ============
Successful requests:                     64        
Maximum request concurrency:             16        
Benchmark duration (s):                  199.05    
Total input tokens:                      65472     
Total generated tokens:                  65536     
Request throughput (req/s):              0.32      
Output token throughput (tok/s):         329.24    
Total Token throughput (tok/s):          658.17    
---------------Time to First Token----------------
Mean TTFT (ms):                          3185.62   
Median TTFT (ms):                        2737.65   
P99 TTFT (ms):                           4514.55   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          45.51     
Median TPOT (ms):                        46.10     
P99 TPOT (ms):                           46.72     
---------------Inter-token Latency----------------
Mean ITL (ms):                           45.51     
Median ITL (ms):                         44.16     
P99 ITL (ms):                            47.66     
==================================================

Test Plan

Unit tests

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: jiang1.li <[email protected]>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the CPU unquantized linear operations to enable weight prepacking using oneDNN, which yields significant performance improvements as shown in the benchmarks. The changes are well-structured and introduce a DNNLScratchPadManager to handle memory for oneDNN primitives.

My review focuses on memory management within the new DNNLScratchPadManager. I've identified critical memory leaks that need to be addressed. Specifically, the manager leaks memory upon reallocation and lacks a destructor to free its buffer upon program exit.

Comment on lines +29 to +35
void DNNLScratchPadManager::realloc(size_t new_size) {
new_size = round(new_size);
if (new_size > size_) {
ptr_ = std::aligned_alloc(64, new_size);
size_ = new_size;
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This realloc implementation has a memory leak. When a larger buffer is needed, it allocates a new one but doesn't free the old buffer pointed to by ptr_.

Additionally, std::aligned_alloc can return nullptr on allocation failure, which is not handled and could lead to crashes.

Please ensure the old buffer is freed and that allocation failures are handled. It's safer to allocate the new buffer first before freeing the old one.

Suggested change
void DNNLScratchPadManager::realloc(size_t new_size) {
new_size = round(new_size);
if (new_size > size_) {
ptr_ = std::aligned_alloc(64, new_size);
size_ = new_size;
}
}
void DNNLScratchPadManager::realloc(size_t new_size) {
new_size = round(new_size);
if (new_size > size_) {
void* new_ptr = std::aligned_alloc(64, new_size);
if (!new_ptr) {
throw std::bad_alloc();
}
if (ptr_) {
std::free(ptr_);
}
ptr_ = new_ptr;
size_ = new_size;
}
}

static DNNLScratchPadManager* get_dnnl_scratchpad_manager();

DNNLScratchPadManager();

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The DNNLScratchPadManager class allocates memory via std::aligned_alloc and manages it with the raw pointer ptr_, but it's missing a destructor. This will cause a memory leak when the static manager instance in get_dnnl_scratchpad_manager is destroyed at program exit.

To fix this, please add a destructor that frees the allocated memory.

Suggested change
~DNNLScratchPadManager() {
if (ptr_) {
std::free(ptr_);
}
}

Signed-off-by: jiang1.li <[email protected]>
Signed-off-by: jiang1.li <[email protected]>
@bigPYJ1151 bigPYJ1151 enabled auto-merge (squash) September 4, 2025 03:27
@bigPYJ1151 bigPYJ1151 disabled auto-merge September 4, 2025 04:34
@bigPYJ1151 bigPYJ1151 added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 4, 2025
@bigPYJ1151 bigPYJ1151 merged commit 57b1ce9 into vllm-project:main Sep 4, 2025
80 of 81 checks passed
eicherseiji pushed a commit to eicherseiji/vllm that referenced this pull request Sep 9, 2025
FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025
fadara01 added a commit to fadara01/vllm that referenced this pull request Sep 30, 2025
…ACL and weight prepack

vllm-project#24150 introduced weight prepack and a diect oneDNN path for linear ops
this path is not currently active for AArch64 - i.e. linears are dispatched to PyTorch and as a result we have to pack weights each time a torch.linear op is executed

This PR enables weight prepack and dispatches non-quantized linear ops to oneDNN (only) if oneDNN was built with Arm Compute Library (ACL) as its backend.
If oneDNN was built without ACL we follow the current path where linears go through the PyTorch path as this is still much faster than oneDNN without ACL.

I had to make the following changes to the current oneDNN matmul path to make it compatible with ACL:
oneDNN/ACL matmul does not support runtime dimensions -> pass a default M=128 and input stride=K when creating the matmul primitive descriptor
oneDNN/ACL matmul does not support passing a bias -> c=matmul(a,b)+bias is handled as c=bias; c+=matmul(a,b) through attaching a fused sum post-op to the matmul primitive
oneDNN/ACL matmul does not support non-contiguous source tensors -> we make sure that source tensors are contiguous
oneDNN/ACL matmul API allows for the weight format to change when the input dimensions change, so we now check at execute whether we need to pack again. Note that ACL weight format does not tend to change in practice, so this won't be performance issue. We had to add this check because the API allows for such weight format changes.

This PR also ensures that the current cmake arg for enabling building oneDNN with ACL backend (VLLM_BUILD_ACL) is not discarded by setup.py

Test Plan:
 tests/kernels/test_onednn.py exercises the oneDNN path for linear ops. All pass with my changes when oneDNN is built with/without ACL backend

Performace:
On 16 Neoverse-V2 cores, this PR yields ~ 78% higher throughput (with oneDNN built with ACL backend) than the current default path, in the throughput benchmarks for meta-llama/Llama3.1-8b-Instruct executed as follows:
```
LD_PRELOAD="/usr/lib/aarch64-linux-gnu/libtcmalloc_minimal.so.4:/usr/lib/aarch64-linux-gnu/libgomp.so.1" VLLM_TARGET_DEVICE=cpu VLLM_CPU_KVCACHE_SPACE=32 taskset -c 0-15 vllm bench throughput --num-prompts 64 --seed 0 \
       --dataset-name sharegpt --dataset-path ShareGPT_V3_unfiltered_cleaned_split_no_imsorry.json --max-model-len 4096 --model meta-llama/Llama-3.1-8B-Instruct
```
Future PRs will look into building oneDNN with ACL backend by default where appropriate
fadara01 added a commit to fadara01/vllm that referenced this pull request Sep 30, 2025
…ACL and weight prepack

vllm-project#24150 introduced weight prepack and a diect oneDNN path for linear ops
this path is not currently active for AArch64 - i.e. linears are dispatched to PyTorch and as a result we have to pack weights each time a torch.linear op is executed

This PR enables weight prepack and dispatches non-quantized linear ops to oneDNN (only) if oneDNN was built with Arm Compute Library (ACL) as its backend.
If oneDNN was built without ACL we follow the current path where linears go through the PyTorch path as this is still much faster than oneDNN without ACL.

I had to make the following changes to the current oneDNN matmul path to make it compatible with ACL:
- oneDNN/ACL matmul does not support runtime dimensions -> pass a default M=128 and input stride=K when creating the matmul primitive descriptor
- oneDNN/ACL matmul does not support passing a bias -> c=matmul(a,b)+bias is handled as c=bias; c+=matmul(a,b) through attaching a fused sum post-op to the matmul primitive
- oneDNN/ACL matmul does not support non-contiguous source tensors -> we make sure that source tensors are contiguous
- oneDNN/ACL matmul API allows for the weight format to change when the input dimensions change, so we now check at execute whether we need to pack again. Note that ACL weight format does not tend to change in practice, so this won't be performance issue. We had to add this check because the API allows for such weight format changes.

This PR also ensures that the current cmake arg for enabling building oneDNN with ACL backend (VLLM_BUILD_ACL) is not discarded by setup.py

Test Plan:
 tests/kernels/test_onednn.py exercises the oneDNN path for linear ops. All pass with my changes when oneDNN is built with/without ACL backend

Performace:
On 16 Neoverse-V2 cores, this PR yields ~ 78% higher throughput (with oneDNN built with ACL backend) than the current default path, in the throughput benchmarks for meta-llama/Llama3.1-8b-Instruct executed as follows:
```
LD_PRELOAD="/usr/lib/aarch64-linux-gnu/libtcmalloc_minimal.so.4:/usr/lib/aarch64-linux-gnu/libgomp.so.1" VLLM_TARGET_DEVICE=cpu VLLM_CPU_KVCACHE_SPACE=32 taskset -c 0-15 vllm bench throughput --num-prompts 64 --seed 0 \
       --dataset-name sharegpt --dataset-path ShareGPT_V3_unfiltered_cleaned_split_no_imsorry.json --max-model-len 4096 --model meta-llama/Llama-3.1-8B-Instruct
```
Future PRs will look into building oneDNN with ACL backend by default where appropriate

remove .orig file
fadara01 added a commit to fadara01/vllm that referenced this pull request Sep 30, 2025
…ACL and weight prepack

vllm-project#24150 introduced weight prepack and a diect oneDNN path for linear ops
this path is not currently active for AArch64 - i.e. linears are dispatched to PyTorch and as a result we have to pack weights each time a torch.linear op is executed

This PR enables weight prepack and dispatches non-quantized linear ops to oneDNN (only) if oneDNN was built with Arm Compute Library (ACL) as its backend.
If oneDNN was built without ACL we follow the current path where linears go through the PyTorch path as this is still much faster than oneDNN without ACL.

I had to make the following changes to the current oneDNN matmul path to make it compatible with ACL:
- oneDNN/ACL matmul does not support runtime dimensions -> pass a default M=128 and input stride=K when creating the matmul primitive descriptor
- oneDNN/ACL matmul does not support passing a bias -> c=matmul(a,b)+bias is handled as c=bias; c+=matmul(a,b) through attaching a fused sum post-op to the matmul primitive
- oneDNN/ACL matmul does not support non-contiguous source tensors -> we make sure that source tensors are contiguous
- oneDNN/ACL matmul API allows for the weight format to change when the input dimensions change, so we now check at execute whether we need to pack again. Note that ACL weight format does not tend to change in practice, so this won't be performance issue. We had to add this check because the API allows for such weight format changes.

This PR also ensures that the current cmake arg for enabling building oneDNN with ACL backend (VLLM_BUILD_ACL) is not discarded by setup.py

Test Plan:
 tests/kernels/test_onednn.py exercises the oneDNN path for linear ops. All pass with my changes when oneDNN is built with/without ACL backend

Performace:
On 16 Neoverse-V2 cores, this PR yields ~ 78% higher throughput (with oneDNN built with ACL backend) than the current default path, in the throughput benchmarks for meta-llama/Llama3.1-8b-Instruct executed as follows:
```
LD_PRELOAD="/usr/lib/aarch64-linux-gnu/libtcmalloc_minimal.so.4:/usr/lib/aarch64-linux-gnu/libgomp.so.1" VLLM_TARGET_DEVICE=cpu VLLM_CPU_KVCACHE_SPACE=32 taskset -c 0-15 vllm bench throughput --num-prompts 64 --seed 0 \
       --dataset-name sharegpt --dataset-path ShareGPT_V3_unfiltered_cleaned_split_no_imsorry.json --max-model-len 4096 --model meta-llama/Llama-3.1-8B-Instruct
```
Future PRs will look into building oneDNN with ACL backend by default where appropriate
fadara01 added a commit to fadara01/vllm that referenced this pull request Sep 30, 2025
…ACL and weight prepack

vllm-project#24150 introduced weight prepack and a diect oneDNN path for linear ops
this path is not currently active for AArch64 - i.e. linears are dispatched to PyTorch and as a result we have to pack weights each time a torch.linear op is executed

This PR enables weight prepack and dispatches non-quantized linear ops to oneDNN (only) if oneDNN was built with Arm Compute Library (ACL) as its backend.
If oneDNN was built without ACL we follow the current path where linears go through the PyTorch path as this is still much faster than oneDNN without ACL.

I had to make the following changes to the current oneDNN matmul path to make it compatible with ACL:
- oneDNN/ACL matmul does not support runtime dimensions -> pass a default M=128 and input stride=K when creating the matmul primitive descriptor
- oneDNN/ACL matmul does not support passing a bias -> c=matmul(a,b)+bias is handled as c=bias; c+=matmul(a,b) through attaching a fused sum post-op to the matmul primitive
- oneDNN/ACL matmul does not support non-contiguous source tensors -> we make sure that source tensors are contiguous
- oneDNN/ACL matmul API allows for the weight format to change when the input dimensions change, so we now check at execute whether we need to pack again. Note that ACL weight format does not tend to change in practice, so this won't be performance issue. We had to add this check because the API allows for such weight format changes.

This PR also ensures that the current cmake arg for enabling building oneDNN with ACL backend (VLLM_BUILD_ACL) is not discarded by setup.py

Test Plan:
 tests/kernels/test_onednn.py exercises the oneDNN path for linear ops. All pass with my changes when oneDNN is built with/without ACL backend

Performace:
On 16 Neoverse-V2 cores, this PR yields ~ 78% higher throughput (with oneDNN built with ACL backend) than the current default path, in the throughput benchmarks for meta-llama/Llama3.1-8b-Instruct executed as follows:
```
LD_PRELOAD="/usr/lib/aarch64-linux-gnu/libtcmalloc_minimal.so.4:/usr/lib/aarch64-linux-gnu/libgomp.so.1" VLLM_TARGET_DEVICE=cpu VLLM_CPU_KVCACHE_SPACE=32 taskset -c 0-15 vllm bench throughput --num-prompts 64 --seed 0 \
       --dataset-name sharegpt --dataset-path ShareGPT_V3_unfiltered_cleaned_split_no_imsorry.json --max-model-len 4096 --model meta-llama/Llama-3.1-8B-Instruct
```
Future PRs will look into building oneDNN with ACL backend by default where appropriate
fadara01 added a commit to fadara01/vllm that referenced this pull request Sep 30, 2025
…ACL and weight prepack

vllm-project#24150 introduced weight prepack and a diect oneDNN path for linear ops
this path is not currently active for AArch64 - i.e. linears are dispatched to PyTorch and as a result we have to pack weights each time a torch.linear op is executed

This PR enables weight prepack and dispatches non-quantized linear ops to oneDNN (only) if oneDNN was built with Arm Compute Library (ACL) as its backend.
If oneDNN was built without ACL we follow the current path where linears go through the PyTorch path as this is still much faster than oneDNN without ACL.

I had to make the following changes to the current oneDNN matmul path to make it compatible with ACL:
- oneDNN/ACL matmul does not support runtime dimensions -> pass a default M=128 and input stride=K when creating the matmul primitive descriptor
- oneDNN/ACL matmul does not support passing a bias -> c=matmul(a,b)+bias is handled as c=bias; c+=matmul(a,b) through attaching a fused sum post-op to the matmul primitive
- oneDNN/ACL matmul does not support non-contiguous source tensors -> we make sure that source tensors are contiguous
- oneDNN/ACL matmul API allows for the weight format to change when the input dimensions change, so we now check at execute whether we need to pack again. Note that ACL weight format does not tend to change in practice, so this won't be performance issue. We had to add this check because the API allows for such weight format changes.

This PR also ensures that the current cmake arg for enabling building oneDNN with ACL backend (VLLM_BUILD_ACL) is not discarded by setup.py

Test Plan:
 tests/kernels/test_onednn.py exercises the oneDNN path for linear ops. All pass with my changes when oneDNN is built with/without ACL backend

Performace:
On 16 Neoverse-V2 cores, this PR yields ~ 78% higher throughput (with oneDNN built with ACL backend) than the current default path, in the throughput benchmarks for meta-llama/Llama3.1-8b-Instruct executed as follows:
```
LD_PRELOAD="/usr/lib/aarch64-linux-gnu/libtcmalloc_minimal.so.4:/usr/lib/aarch64-linux-gnu/libgomp.so.1" VLLM_TARGET_DEVICE=cpu VLLM_CPU_KVCACHE_SPACE=32 taskset -c 0-15 vllm bench throughput --num-prompts 64 --seed 0 \
       --dataset-name sharegpt --dataset-path ShareGPT_V3_unfiltered_cleaned_split_no_imsorry.json --max-model-len 4096 --model meta-llama/Llama-3.1-8B-Instruct
```
Future PRs will look into building oneDNN with ACL backend by default where appropriate
fadara01 added a commit to fadara01/vllm that referenced this pull request Sep 30, 2025
…ACL and weight prepack

vllm-project#24150 introduced weight prepack and a diect oneDNN path for linear ops
this path is not currently active for AArch64 - i.e. linears are dispatched to PyTorch and as a result we have to pack weights each time a torch.linear op is executed

This PR enables weight prepack and dispatches non-quantized linear ops to oneDNN (only) if oneDNN was built with Arm Compute Library (ACL) as its backend.
If oneDNN was built without ACL we follow the current path where linears go through the PyTorch path as this is still much faster than oneDNN without ACL.

I had to make the following changes to the current oneDNN matmul path to make it compatible with ACL:
- oneDNN/ACL matmul does not support runtime dimensions -> pass a default M=128 and input stride=K when creating the matmul primitive descriptor
- oneDNN/ACL matmul does not support passing a bias -> c=matmul(a,b)+bias is handled as c=bias; c+=matmul(a,b) through attaching a fused sum post-op to the matmul primitive
- oneDNN/ACL matmul does not support non-contiguous source tensors -> we make sure that source tensors are contiguous
- oneDNN/ACL matmul API allows for the weight format to change when the input dimensions change, so we now check at execute whether we need to pack again. Note that ACL weight format does not tend to change in practice, so this won't be performance issue. We had to add this check because the API allows for such weight format changes.

This PR also ensures that the current cmake arg for enabling building oneDNN with ACL backend (VLLM_BUILD_ACL) is not discarded by setup.py

Test Plan:
 tests/kernels/test_onednn.py exercises the oneDNN path for linear ops. All pass with my changes when oneDNN is built with/without ACL backend

Performace:
On 16 Neoverse-V2 cores, this PR yields ~ 78% higher throughput (with oneDNN built with ACL backend) than the current default path, in the throughput benchmarks for meta-llama/Llama3.1-8b-Instruct executed as follows:
```
LD_PRELOAD="/usr/lib/aarch64-linux-gnu/libtcmalloc_minimal.so.4:/usr/lib/aarch64-linux-gnu/libgomp.so.1" VLLM_TARGET_DEVICE=cpu VLLM_CPU_KVCACHE_SPACE=32 taskset -c 0-15 vllm bench throughput --num-prompts 64 --seed 0 \
       --dataset-name sharegpt --dataset-path ShareGPT_V3_unfiltered_cleaned_split_no_imsorry.json --max-model-len 4096 --model meta-llama/Llama-3.1-8B-Instruct
```
Future PRs will look into building oneDNN with ACL backend by default where appropriate
fadara01 added a commit to fadara01/vllm that referenced this pull request Sep 30, 2025
…ACL and weight prepack

vllm-project#24150 introduced weight prepack and a diect oneDNN path for linear ops
this path is not currently active for AArch64 - i.e. linears are dispatched to PyTorch and as a result we have to pack weights each time a torch.linear op is executed

This PR enables weight prepack and dispatches non-quantized linear ops to oneDNN (only) if oneDNN was built with Arm Compute Library (ACL) as its backend.
If oneDNN was built without ACL we follow the current path where linears go through the PyTorch path as this is still much faster than oneDNN without ACL.

I had to make the following changes to the current oneDNN matmul path to make it compatible with ACL:
- oneDNN/ACL matmul does not support runtime dimensions -> pass a default M=128 and input stride=K when creating the matmul primitive descriptor
- oneDNN/ACL matmul does not support passing a bias -> c=matmul(a,b)+bias is handled as c=bias; c+=matmul(a,b) through attaching a fused sum post-op to the matmul primitive
- oneDNN/ACL matmul does not support non-contiguous source tensors -> we make sure that source tensors are contiguous
- oneDNN/ACL matmul API allows for the weight format to change when the input dimensions change, so we now check at execute whether we need to pack again. Note that ACL weight format does not tend to change in practice, so this won't be performance issue. We had to add this check because the API allows for such weight format changes.

This PR also ensures that the current cmake arg for enabling building oneDNN with ACL backend (VLLM_BUILD_ACL) is not discarded by setup.py

Test Plan:
 tests/kernels/test_onednn.py exercises the oneDNN path for linear ops. All pass with my changes when oneDNN is built with/without ACL backend

Performace:
On 16 Neoverse-V2 cores, this PR yields ~ 78% higher throughput (with oneDNN built with ACL backend) than the current default path, in the throughput benchmarks for meta-llama/Llama3.1-8b-Instruct executed as follows:
```
LD_PRELOAD="/usr/lib/aarch64-linux-gnu/libtcmalloc_minimal.so.4:/usr/lib/aarch64-linux-gnu/libgomp.so.1" VLLM_TARGET_DEVICE=cpu VLLM_CPU_KVCACHE_SPACE=32 taskset -c 0-15 vllm bench throughput --num-prompts 64 --seed 0 \
       --dataset-name sharegpt --dataset-path ShareGPT_V3_unfiltered_cleaned_split_no_imsorry.json --max-model-len 4096 --model meta-llama/Llama-3.1-8B-Instruct
```
Future PRs will look into building oneDNN with ACL backend by default where appropriate

Signed-off-by: Fadi Arafeh <[email protected]>
fadara01 added a commit to fadara01/vllm that referenced this pull request Sep 30, 2025
…ACL and weight prepack

vllm-project#24150 introduced weight prepack and a diect oneDNN path for linear ops
this path is not currently active for AArch64 - i.e. linears are dispatched to PyTorch and as a result we have to pack weights each time a torch.linear op is executed

This PR enables weight prepack and dispatches non-quantized linear ops to oneDNN (only) if oneDNN was built with Arm Compute Library (ACL) as its backend.
If oneDNN was built without ACL we follow the current path where linears go through the PyTorch path as this is still much faster than oneDNN without ACL.

I had to make the following changes to the current oneDNN matmul path to make it compatible with ACL:
- oneDNN/ACL matmul does not support runtime dimensions -> pass a default M=128 and input stride=K when creating the matmul primitive descriptor
- oneDNN/ACL matmul does not support passing a bias -> c=matmul(a,b)+bias is handled as c=bias; c+=matmul(a,b) through attaching a fused sum post-op to the matmul primitive
- oneDNN/ACL matmul does not support non-contiguous source tensors -> we make sure that source tensors are contiguous
- oneDNN/ACL matmul API allows for the weight format to change when the input dimensions change, so we now check at execute whether we need to pack again. Note that ACL weight format does not tend to change in practice, so this won't be performance issue. We had to add this check because the API allows for such weight format changes.

This PR also ensures that the current cmake arg for enabling building oneDNN with ACL backend (VLLM_BUILD_ACL) is not discarded by setup.py

Test Plan:
 tests/kernels/test_onednn.py exercises the oneDNN path for linear ops. All pass with my changes when oneDNN is built with/without ACL backend

Performace:
On 16 Neoverse-V2 cores, this PR yields ~ 78% higher throughput (with oneDNN built with ACL backend) than the current default path, in the throughput benchmarks for meta-llama/Llama3.1-8b-Instruct executed as follows:
```
LD_PRELOAD="/usr/lib/aarch64-linux-gnu/libtcmalloc_minimal.so.4:/usr/lib/aarch64-linux-gnu/libgomp.so.1" VLLM_TARGET_DEVICE=cpu VLLM_CPU_KVCACHE_SPACE=32 taskset -c 0-15 vllm bench throughput --num-prompts 64 --seed 0 \
       --dataset-name sharegpt --dataset-path ShareGPT_V3_unfiltered_cleaned_split_no_imsorry.json --max-model-len 4096 --model meta-llama/Llama-3.1-8B-Instruct
```
Future PRs will look into building oneDNN with ACL backend by default where appropriate

Signed-off-by: Fadi Arafeh <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants