Merged
Conversation
This was referenced Oct 10, 2025
densamoilov
reviewed
Oct 10, 2025
densamoilov
reviewed
Oct 10, 2025
| // save its content. | ||
| const dim_t max_prefetch_offset = B_offset(ld_block2 - 1, rd_loop - 1) | ||
| + static_cast<dim_t>(brg.LDB) * brg.rd_block * brg.typesize_B; | ||
| if (max_prefetch_offset > INT_MAX) reg_aux_C.save(); |
Contributor
There was a problem hiding this comment.
If the regression is related to introducing safe prefetching for prefetch offsets greater than INT_MAX, then the main contributor to the degradation could be register spilling.
Have you tried to move this spill to the higher level, for example to ld_loop_body? Or even higher.
oneDNN/src/cpu/x64/brgemm/jit_brgemm_kernel.cpp
Line 2606 in 48872b1
Contributor
Author
There was a problem hiding this comment.
Yes, I tried to move the spill to higher level but the perf gap were still there.
Spill in ldb_loop():
export OMP_NUM_THREADS=56 ;export KMP_AFFINITY=granularity=fine,compact; export OMP_PROC_BIND close ; numactl -m 0 -N 0 ./tests/benchdnn/benchdnn -v5 --conv --dir=FWD_I --mode=p mb21_ic256oc128_ih40oh40kh3sh1dh0ph1_iw40ow40kw3sw1dw0pw1
create: --mode=P --conv --dir=FWD_I mb21ic256ih40oc128oh40kh3ph1
oneDNN implementation: brg_conv_fwd:avx512_core
run: --mode=P --conv --dir=FWD_I mb21ic256ih40oc128oh40kh3ph1
Output template: perf,%engine%,%impl%,%name%,%prb%,%Gops%,%+ctime%,%-time%,%-Gflops%,%0time%,%0Gflops%
perf,cpu,brg_conv_fwd:avx512_core,,--mode=P --conv --dir=FWD_I mb21ic256ih40oc128oh40kh3ph1,19.163,100.7,4.60156,4164.45,4.69376,4082.65
============================================================
= Implementation statistics (--summary=no-impl to disable) =
============================================================
| brg_conv_fwd:avx512_core : 1 (100%) |
============================================================
tests:1 passed:1 skipped:0 mistrusted:0 unimplemented:0 invalid_arguments:0 failed:0 listed:0
total perf: min(ms):4.60156 avg(ms):4.69376
Disable safe prefetching when offset <= INT_MAX:
export OMP_NUM_THREADS=56 ;export KMP_AFFINITY=granularity=fine,compact; export OMP_PROC_BIND close ; numactl -m 0 -N 0 ./tests/benchdnn/benchdnn -v5 --conv --dir=FWD_I --mode=p mb21_ic256oc128_ih40oh40kh3sh1dh0ph1_iw40ow40kw3sw1dw0pw1
create: --mode=P --conv --dir=FWD_I mb21ic256ih40oc128oh40kh3ph1
oneDNN implementation: brg_conv_fwd:avx512_core
run: --mode=P --conv --dir=FWD_I mb21ic256ih40oc128oh40kh3ph1
Output template: perf,%engine%,%impl%,%name%,%prb%,%Gops%,%+ctime%,%-time%,%-Gflops%,%0time%,%0Gflops%
perf,cpu,brg_conv_fwd:avx512_core,,--mode=P --conv --dir=FWD_I mb21ic256ih40oc128oh40kh3ph1,19.163,3.57324,4.14893,4618.78,4.25633,4502.23
============================================================
= Implementation statistics (--summary=no-impl to disable) =
============================================================
| brg_conv_fwd:avx512_core : 1 (100%) |
============================================================
tests:1 passed:1 skipped:0 mistrusted:0 unimplemented:0 invalid_arguments:0 failed:0 listed:0
total perf: min(ms):4.14893 avg(ms):4.25633
total: 3.13s; create_pd: 0.00s (0%); create_prim: 0.00s (0%); fill: 0.06s (2%); execute: 0.01s (0%);
Contributor
There was a problem hiding this comment.
I see. Btw, if you collected the performance data before the fix then it could affect the data because we skipped some prefetch blocks.
28ab7d6 to
fd8cde8
Compare
Contributor
Author
|
make test |
tczeszun
approved these changes
Oct 13, 2025
densamoilov
approved these changes
Oct 13, 2025
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Partially fixes MFDNN-14037, MFDNN-14054
This patch intends to fix the performance issue in f32 conv.
brgconv_f32_perf.xlsx