cpu: x64: fix perf issue for f32 conv by xuxinzen · Pull Request #4107 · uxlfoundation/oneDNN

xuxinzen · 2025-10-10T15:58:36Z

Partially fixes MFDNN-14037, MFDNN-14054

This patch intends to fix the performance issue in f32 conv.
brgconv_f32_perf.xlsx

densamoilov · 2025-10-10T16:26:34Z

    // save its content.
    const dim_t max_prefetch_offset = B_offset(ld_block2 - 1, rd_loop - 1)
            + static_cast<dim_t>(brg.LDB) * brg.rd_block * brg.typesize_B;
    if (max_prefetch_offset > INT_MAX) reg_aux_C.save();


If the regression is related to introducing safe prefetching for prefetch offsets greater than INT_MAX, then the main contributor to the degradation could be register spilling.

Have you tried to move this spill to the higher level, for example to ld_loop_body? Or even higher.

oneDNN/src/cpu/x64/brgemm/jit_brgemm_kernel.cpp

Line 2606 in 48872b1

auto ld_loop_body = [&](dim_t vpad, bool last_bdb) {

Yes, I tried to move the spill to higher level but the perf gap were still there.

Spill in ldb_loop(): export OMP_NUM_THREADS=56 ;export KMP_AFFINITY=granularity=fine,compact; export OMP_PROC_BIND close ; numactl -m 0 -N 0 ./tests/benchdnn/benchdnn -v5 --conv --dir=FWD_I --mode=p mb21_ic256oc128_ih40oh40kh3sh1dh0ph1_iw40ow40kw3sw1dw0pw1 create: --mode=P --conv --dir=FWD_I mb21ic256ih40oc128oh40kh3ph1 oneDNN implementation: brg_conv_fwd:avx512_core run: --mode=P --conv --dir=FWD_I mb21ic256ih40oc128oh40kh3ph1 Output template: perf,%engine%,%impl%,%name%,%prb%,%Gops%,%+ctime%,%-time%,%-Gflops%,%0time%,%0Gflops% perf,cpu,brg_conv_fwd:avx512_core,,--mode=P --conv --dir=FWD_I mb21ic256ih40oc128oh40kh3ph1,19.163,100.7,4.60156,4164.45,4.69376,4082.65 ============================================================ = Implementation statistics (--summary=no-impl to disable) = ============================================================ | brg_conv_fwd:avx512_core : 1 (100%) | ============================================================ tests:1 passed:1 skipped:0 mistrusted:0 unimplemented:0 invalid_arguments:0 failed:0 listed:0 total perf: min(ms):4.60156 avg(ms):4.69376

Disable safe prefetching when offset <= INT_MAX: export OMP_NUM_THREADS=56 ;export KMP_AFFINITY=granularity=fine,compact; export OMP_PROC_BIND close ; numactl -m 0 -N 0 ./tests/benchdnn/benchdnn -v5 --conv --dir=FWD_I --mode=p mb21_ic256oc128_ih40oh40kh3sh1dh0ph1_iw40ow40kw3sw1dw0pw1 create: --mode=P --conv --dir=FWD_I mb21ic256ih40oc128oh40kh3ph1 oneDNN implementation: brg_conv_fwd:avx512_core run: --mode=P --conv --dir=FWD_I mb21ic256ih40oc128oh40kh3ph1 Output template: perf,%engine%,%impl%,%name%,%prb%,%Gops%,%+ctime%,%-time%,%-Gflops%,%0time%,%0Gflops% perf,cpu,brg_conv_fwd:avx512_core,,--mode=P --conv --dir=FWD_I mb21ic256ih40oc128oh40kh3ph1,19.163,3.57324,4.14893,4618.78,4.25633,4502.23 ============================================================ = Implementation statistics (--summary=no-impl to disable) = ============================================================ | brg_conv_fwd:avx512_core : 1 (100%) | ============================================================ tests:1 passed:1 skipped:0 mistrusted:0 unimplemented:0 invalid_arguments:0 failed:0 listed:0 total perf: min(ms):4.14893 avg(ms):4.25633 total: 3.13s; create_pd: 0.00s (0%); create_prim: 0.00s (0%); fill: 0.06s (2%); execute: 0.01s (0%);

I see. Btw, if you collected the performance data before the fix then it could affect the data because we skipped some prefetch blocks.

xuxinzen · 2025-10-10T20:41:49Z

make test

xuxinzen requested a review from a team as a code owner October 10, 2025 15:58

xuxinzen added the platform:cpu-x64 Intel64/AMD64 processors. Codeowner: @oneapi-src/onednn-cpu-x64 label Oct 10, 2025

This was referenced Oct 10, 2025

Backport: cpu: x64: fix perf issue for f32 conv #4108

Merged

Backport: cpu: x64: fix perf issue for f32 conv #4109

Merged

densamoilov reviewed Oct 10, 2025

View reviewed changes

Comment thread src/cpu/x64/brgemm/jit_brgemm_kernel.cpp Outdated

densamoilov reviewed Oct 10, 2025

View reviewed changes

cpu: x64: fix perf issue for f32 conv

fd8cde8

xuxinzen force-pushed the xzeng/fixup_f32_perf branch from 28ab7d6 to fd8cde8 Compare October 10, 2025 20:41

tczeszun approved these changes Oct 13, 2025

View reviewed changes

densamoilov approved these changes Oct 13, 2025

View reviewed changes

xuxinzen merged commit 7aed66d into main Oct 13, 2025
21 of 22 checks passed

xuxinzen deleted the xzeng/fixup_f32_perf branch October 13, 2025 21:48

dzarukin mentioned this pull request Oct 19, 2025

Backport: cpu: x64: fix perf issue for f32 conv (#4108) #4165

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cpu: x64: fix perf issue for f32 conv#4107

cpu: x64: fix perf issue for f32 conv#4107
xuxinzen merged 1 commit intomainfrom
xzeng/fixup_f32_perf

xuxinzen commented Oct 10, 2025 •

edited

Loading

Uh oh!

Uh oh!

densamoilov Oct 10, 2025 •

edited

Loading

Uh oh!

xuxinzen Oct 10, 2025

Uh oh!

densamoilov Oct 13, 2025

Uh oh!

xuxinzen commented Oct 10, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

xuxinzen commented Oct 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

densamoilov Oct 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xuxinzen Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

densamoilov Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

xuxinzen commented Oct 10, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

xuxinzen commented Oct 10, 2025 •

edited

Loading

densamoilov Oct 10, 2025 •

edited

Loading