Skip to content

cpu: x64: fix perf issue for f32 conv#4107

Merged
xuxinzen merged 1 commit intomainfrom
xzeng/fixup_f32_perf
Oct 13, 2025
Merged

cpu: x64: fix perf issue for f32 conv#4107
xuxinzen merged 1 commit intomainfrom
xzeng/fixup_f32_perf

Conversation

@xuxinzen
Copy link
Copy Markdown
Contributor

@xuxinzen xuxinzen commented Oct 10, 2025

Partially fixes MFDNN-14037, MFDNN-14054

This patch intends to fix the performance issue in f32 conv.
brgconv_f32_perf.xlsx

@xuxinzen xuxinzen requested a review from a team as a code owner October 10, 2025 15:58
@xuxinzen xuxinzen added the platform:cpu-x64 Intel64/AMD64 processors. Codeowner: @oneapi-src/onednn-cpu-x64 label Oct 10, 2025
Comment thread src/cpu/x64/brgemm/jit_brgemm_kernel.cpp Outdated
// save its content.
const dim_t max_prefetch_offset = B_offset(ld_block2 - 1, rd_loop - 1)
+ static_cast<dim_t>(brg.LDB) * brg.rd_block * brg.typesize_B;
if (max_prefetch_offset > INT_MAX) reg_aux_C.save();
Copy link
Copy Markdown
Contributor

@densamoilov densamoilov Oct 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the regression is related to introducing safe prefetching for prefetch offsets greater than INT_MAX, then the main contributor to the degradation could be register spilling.

Have you tried to move this spill to the higher level, for example to ld_loop_body? Or even higher.

auto ld_loop_body = [&](dim_t vpad, bool last_bdb) {

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I tried to move the spill to higher level but the perf gap were still there.

Spill in ldb_loop():
export  OMP_NUM_THREADS=56 ;export KMP_AFFINITY=granularity=fine,compact; export OMP_PROC_BIND close ; numactl -m 0 -N 0 ./tests/benchdnn/benchdnn -v5 --conv --dir=FWD_I --mode=p mb21_ic256oc128_ih40oh40kh3sh1dh0ph1_iw40ow40kw3sw1dw0pw1
create: --mode=P --conv --dir=FWD_I mb21ic256ih40oc128oh40kh3ph1
oneDNN implementation: brg_conv_fwd:avx512_core
run: --mode=P --conv --dir=FWD_I mb21ic256ih40oc128oh40kh3ph1
Output template: perf,%engine%,%impl%,%name%,%prb%,%Gops%,%+ctime%,%-time%,%-Gflops%,%0time%,%0Gflops%
perf,cpu,brg_conv_fwd:avx512_core,,--mode=P --conv --dir=FWD_I mb21ic256ih40oc128oh40kh3ph1,19.163,100.7,4.60156,4164.45,4.69376,4082.65
============================================================
= Implementation statistics (--summary=no-impl to disable) =
============================================================
| brg_conv_fwd:avx512_core : 1 (100%)                      |
============================================================
tests:1 passed:1 skipped:0 mistrusted:0 unimplemented:0 invalid_arguments:0 failed:0 listed:0
total perf: min(ms):4.60156 avg(ms):4.69376
Disable safe prefetching when offset <= INT_MAX:
export  OMP_NUM_THREADS=56 ;export KMP_AFFINITY=granularity=fine,compact; export OMP_PROC_BIND close ; numactl -m 0 -N 0 ./tests/benchdnn/benchdnn -v5 --conv --dir=FWD_I --mode=p mb21_ic256oc128_ih40oh40kh3sh1dh0ph1_iw40ow40kw3sw1dw0pw1
create: --mode=P --conv --dir=FWD_I mb21ic256ih40oc128oh40kh3ph1
oneDNN implementation: brg_conv_fwd:avx512_core
run: --mode=P --conv --dir=FWD_I mb21ic256ih40oc128oh40kh3ph1
Output template: perf,%engine%,%impl%,%name%,%prb%,%Gops%,%+ctime%,%-time%,%-Gflops%,%0time%,%0Gflops%
perf,cpu,brg_conv_fwd:avx512_core,,--mode=P --conv --dir=FWD_I mb21ic256ih40oc128oh40kh3ph1,19.163,3.57324,4.14893,4618.78,4.25633,4502.23
============================================================
= Implementation statistics (--summary=no-impl to disable) =
============================================================
| brg_conv_fwd:avx512_core : 1 (100%)                      |
============================================================
tests:1 passed:1 skipped:0 mistrusted:0 unimplemented:0 invalid_arguments:0 failed:0 listed:0
total perf: min(ms):4.14893 avg(ms):4.25633
total: 3.13s; create_pd: 0.00s (0%); create_prim: 0.00s (0%); fill: 0.06s (2%); execute: 0.01s (0%);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Btw, if you collected the performance data before the fix then it could affect the data because we skipped some prefetch blocks.

@xuxinzen xuxinzen force-pushed the xzeng/fixup_f32_perf branch from 28ab7d6 to fd8cde8 Compare October 10, 2025 20:41
@xuxinzen
Copy link
Copy Markdown
Contributor Author

make test

@xuxinzen xuxinzen merged commit 7aed66d into main Oct 13, 2025
21 of 22 checks passed
@xuxinzen xuxinzen deleted the xzeng/fixup_f32_perf branch October 13, 2025 21:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

platform:cpu-x64 Intel64/AMD64 processors. Codeowner: @oneapi-src/onednn-cpu-x64

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants