mapreduce: reinstate and fix block optimization #2880

christiangnrd · 2025-09-05T17:45:19Z

Do not squash.

My attempt to fix the performance regression I mention in #2869 (comment). I recalculate the optimal number of threads, and if needed, partial gets resized if the new num threads requires more blocks. Not quite sure how this could be tested.

Benchmarks in order: before, this commit, after the removal of the block optimization (current). The benchmarks are in this order because I wanted to make it easier to see that there's a slight performance regression from the old version, but if it's correct that's still an improvement over the current version.

github-actions · 2025-09-05T17:49:47Z

Your PR requires formatting changes to meet the project's style guidelines.
Please consider running Runic (git runic master) to apply these changes.

Click here to view the suggested changes.

diff --git a/src/mapreduce.jl b/src/mapreduce.jl
index 7f0b242e4..87ee2882a 100644
--- a/src/mapreduce.jl
+++ b/src/mapreduce.jl
@@ -198,7 +198,7 @@ function GPUArrays.mapreducedim!(f::F, op::OP, R::AnyCuArray{T},
     # If `Rother` is large enough, then a naive loop is more efficient than partial reductions.
     if length(Rother) >= serial_mapreduce_threshold(dev)
         args = (f, op, init, Rreduce, Rother, R, A)
-        kernel = @cuda launch=false serial_mapreduce_kernel(args...)
+        kernel = @cuda launch = false serial_mapreduce_kernel(args...)
         kernel_config = launch_configuration(kernel.fun)
         threads = kernel_config.threads
         blocks = cld(length(Rother), threads)
@@ -240,14 +240,16 @@ function GPUArrays.mapreducedim!(f::F, op::OP, R::AnyCuArray{T},
     reduce_blocks = if other_blocks >= kernel_config.blocks
         1
     else
-        min(cld(length(Rreduce), reduce_threads),       # how many we need at most
-            cld(kernel_config.blocks, other_blocks))    # maximize occupancy
+        min(
+            cld(length(Rreduce), reduce_threads),       # how many we need at most
+            cld(kernel_config.blocks, other_blocks)
+        )    # maximize occupancy
     end
 
     # determine the launch configuration
     threads = reduce_threads
     shmem = reduce_shmem
-    blocks = reduce_blocks*other_blocks
+    blocks = reduce_blocks * other_blocks
 
     # perform the actual reduction
     if reduce_blocks == 1
@@ -257,24 +259,26 @@ function GPUArrays.mapreducedim!(f::F, op::OP, R::AnyCuArray{T},
         # TODO: provide a version that atomically reduces from different blocks
 
         # temporary empty array whose type will match the final partial array
-	    partial = similar(R, ntuple(_ -> 0, Val(ndims(R)+1)))
+        partial = similar(R, ntuple(_ -> 0, Val(ndims(R) + 1)))
 
         # NOTE: we can't use the previously-compiled kernel, or its launch configuration,
         #       since the type of `partial` might not match the original output container
         #       (e.g. if that was a view).
-        partial_kernel = @cuda launch=false partial_mapreduce_grid(f, op, init, Rreduce, Rother, Val(shuffle), partial, A)
-        partial_kernel_config = launch_configuration(partial_kernel.fun; shmem=compute_shmem∘compute_threads)
+        partial_kernel = @cuda launch = false partial_mapreduce_grid(f, op, init, Rreduce, Rother, Val(shuffle), partial, A)
+        partial_kernel_config = launch_configuration(partial_kernel.fun; shmem = compute_shmem ∘ compute_threads)
         partial_reduce_threads = compute_threads(partial_kernel_config.threads)
         partial_reduce_shmem = compute_shmem(partial_reduce_threads)
         partial_reduce_blocks = if other_blocks >= partial_kernel_config.blocks
             1
         else
-            min(cld(length(Rreduce), partial_reduce_threads),
-                cld(partial_kernel_config.blocks, other_blocks))
+            min(
+                cld(length(Rreduce), partial_reduce_threads),
+                cld(partial_kernel_config.blocks, other_blocks)
+            )
         end
         partial_threads = partial_reduce_threads
         partial_shmem = partial_reduce_shmem
-        partial_blocks = partial_reduce_blocks*other_blocks
+        partial_blocks = partial_reduce_blocks * other_blocks
 
         partial = similar(R, (size(R)..., partial_blocks))
         if init === nothing
@@ -282,8 +286,10 @@ function GPUArrays.mapreducedim!(f::F, op::OP, R::AnyCuArray{T},
             partial .= R
         end
 
-        partial_kernel(f, op, init, Rreduce, Rother, Val(shuffle), partial, A;
-                       threads=partial_threads, blocks=partial_blocks, shmem=partial_shmem)
+        partial_kernel(
+            f, op, init, Rreduce, Rother, Val(shuffle), partial, A;
+            threads = partial_threads, blocks = partial_blocks, shmem = partial_shmem
+        )
 
         GPUArrays.mapreducedim!(identity, op, R, partial; init)
     end
diff --git a/test/base/array.jl b/test/base/array.jl
index ee1b42cd9..b7b3ce82e 100644
--- a/test/base/array.jl
+++ b/test/base/array.jl
@@ -718,7 +718,7 @@ end
 @testset "large map reduce" begin
   dev = device()
 
-  big_size = CUDA.serial_mapreduce_threshold(dev) + 5
+    big_size = CUDA.serial_mapreduce_threshold(dev) + 5
   a = rand(Float32, big_size, 31)
   c = CuArray(a)

github-actions

CUDA.jl Benchmarks

Details

Benchmark suite	Current: `df9c738`	Previous: `e734d7a`	Ratio
`latency/precompile`	`43629992505.5` ns	`43398695331.5` ns	`1.01`
`latency/ttfp`	`7158179129` ns	`7160470719` ns	`1.00`
`latency/import`	`3736142904` ns	`3655684127` ns	`1.02`
`integration/volumerhs`	`9626688.5` ns	`9617572.5` ns	`1.00`
`integration/byval/slices=1`	`146912` ns	`147149` ns	`1.00`
`integration/byval/slices=3`	`425798` ns	`426071` ns	`1.00`
`integration/byval/reference`	`144995` ns	`145099` ns	`1.00`
`integration/byval/slices=2`	`286414` ns	`286606` ns	`1.00`
`integration/cudadevrt`	`103460` ns	`103567` ns	`1.00`
`kernel/indexing`	`14314` ns	`14309` ns	`1.00`
`kernel/indexing_checked`	`15136` ns	`14963.5` ns	`1.01`
`kernel/occupancy`	`697.9533333333334` ns	`667.6981132075472` ns	`1.05`
`kernel/launch`	`2270` ns	`2240.222222222222` ns	`1.01`
`kernel/rand`	`15489` ns	`15887` ns	`0.97`
`array/reverse/1d`	`20096` ns	`19850` ns	`1.01`
`array/reverse/2d`	`24203.5` ns	`25206` ns	`0.96`
`array/reverse/1d_inplace`	`10975` ns	`10666` ns	`1.03`
`array/reverse/2d_inplace`	`13446` ns	`12171` ns	`1.10`
`array/copy`	`21019` ns	`20937` ns	`1.00`
`array/iteration/findall/int`	`158211` ns	`157262` ns	`1.01`
`array/iteration/findall/bool`	`139951` ns	`139199` ns	`1.01`
`array/iteration/findfirst/int`	`159122` ns	`2147909.5` ns	`0.0740822646391759`
`array/iteration/findfirst/bool`	`160046` ns	`2129592` ns	`0.0751533627098524`
`array/iteration/scalar`	`71237` ns	`71676` ns	`0.99`
`array/iteration/logical`	`215505` ns	`235260.5` ns	`0.92`
`array/iteration/findmin/1d`	`48299` ns	`239425` ns	`0.20`
`array/iteration/findmin/2d`	`96340.5` ns	`96120` ns	`1.00`
`array/reductions/reduce/Int64/1d`	`43686` ns	`147373` ns	`0.30`
`array/reductions/reduce/Int64/dims=1`	`44634.5` ns	`44089` ns	`1.01`
`array/reductions/reduce/Int64/dims=2`	`61714` ns	`61492` ns	`1.00`
`array/reductions/reduce/Int64/dims=1L`	`88989` ns	`89053` ns	`1.00`
`array/reductions/reduce/Int64/dims=2L`	`89880` ns	`661490` ns	`0.14`
`array/reductions/reduce/Float32/1d`	`36837` ns	`104099.5` ns	`0.35`
`array/reductions/reduce/Float32/dims=1`	`42569.5` ns	`40848` ns	`1.04`
`array/reductions/reduce/Float32/dims=2`	`60046` ns	`59344` ns	`1.01`
`array/reductions/reduce/Float32/dims=1L`	`52416` ns	`52528` ns	`1.00`
`array/reductions/reduce/Float32/dims=2L`	`73914` ns	`547809` ns	`0.13`
`array/reductions/mapreduce/Int64/1d`	`43669` ns	`149021` ns	`0.29`
`array/reductions/mapreduce/Int64/dims=1`	`45028.5` ns	`44144` ns	`1.02`
`array/reductions/mapreduce/Int64/dims=2`	`61975` ns	`61760` ns	`1.00`
`array/reductions/mapreduce/Int64/dims=1L`	`88929` ns	`89012` ns	`1.00`
`array/reductions/mapreduce/Int64/dims=2L`	`90163` ns	`685323` ns	`0.13`
`array/reductions/mapreduce/Float32/1d`	`36726` ns	`104623` ns	`0.35`
`array/reductions/mapreduce/Float32/dims=1`	`43663.5` ns	`41094` ns	`1.06`
`array/reductions/mapreduce/Float32/dims=2`	`60623` ns	`59797` ns	`1.01`
`array/reductions/mapreduce/Float32/dims=1L`	`52803` ns	`52788.5` ns	`1.00`
`array/reductions/mapreduce/Float32/dims=2L`	`74876` ns	`549999.5` ns	`0.14`
`array/broadcast`	`19959` ns	`20109` ns	`0.99`
`array/copyto!/gpu_to_gpu`	`12710` ns	`12888` ns	`0.99`
`array/copyto!/cpu_to_gpu`	`213957` ns	`213909` ns	`1.00`
`array/copyto!/gpu_to_cpu`	`281665` ns	`283768.5` ns	`0.99`
`array/accumulate/Int64/1d`	`125404` ns	`124986` ns	`1.00`
`array/accumulate/Int64/dims=1`	`83948` ns	`83520` ns	`1.01`
`array/accumulate/Int64/dims=2`	`157987.5` ns	`158891` ns	`0.99`
`array/accumulate/Int64/dims=1L`	`1710745.5` ns	`1720572.5` ns	`0.99`
`array/accumulate/Int64/dims=2L`	`966802.5` ns	`967934` ns	`1.00`
`array/accumulate/Float32/1d`	`110031` ns	`109201` ns	`1.01`
`array/accumulate/Float32/dims=1`	`80813` ns	`80412` ns	`1.00`
`array/accumulate/Float32/dims=2`	`147985.5` ns	`147560.5` ns	`1.00`
`array/accumulate/Float32/dims=1L`	`1618965` ns	`1618433` ns	`1.00`
`array/accumulate/Float32/dims=2L`	`699190` ns	`698180` ns	`1.00`
`array/construct`	`1315.4` ns	`1627.3` ns	`0.81`
`array/random/randn/Float32`	`46581.5` ns	`45289.5` ns	`1.03`
`array/random/randn!/Float32`	`24952` ns	`25002` ns	`1.00`
`array/random/rand!/Int64`	`27396` ns	`27438` ns	`1.00`
`array/random/rand!/Float32`	`8662.5` ns	`8965` ns	`0.97`
`array/random/rand/Int64`	`30091` ns	`29907` ns	`1.01`
`array/random/rand/Float32`	`13218` ns	`13209` ns	`1.00`
`array/permutedims/4d`	`60318` ns	`60267` ns	`1.00`
`array/permutedims/2d`	`54285` ns	`53766.5` ns	`1.01`
`array/permutedims/3d`	`55144` ns	`54703` ns	`1.01`
`array/sorting/1d`	`2758569` ns	`2755967.5` ns	`1.00`
`array/sorting/by`	`3354299.5` ns	`3342542` ns	`1.00`
`array/sorting/2d`	`1084839` ns	`1079574` ns	`1.00`
`cuda/synchronization/stream/auto`	`1047` ns	`1015.3` ns	`1.03`
`cuda/synchronization/stream/nonblocking`	`7695.6` ns	`7605.6` ns	`1.01`
`cuda/synchronization/stream/blocking`	`802.4516129032259` ns	`809.5368421052632` ns	`0.99`
`cuda/synchronization/context/auto`	`1179.9` ns	`1144.9` ns	`1.03`
`cuda/synchronization/context/nonblocking`	`8305.5` ns	`8738.6` ns	`0.95`
`cuda/synchronization/context/blocking`	`911.25` ns	`884.6666666666666` ns	`1.03`

This comment was automatically generated by workflow using github-action-benchmark.

maleadt · 2025-09-09T12:39:05Z

Ugh, I wanted to avoid having to duplicate the entire launch configuration computation. The resize! seems potentially problematic as well (also because it's kinda useless); could we first go with a zero-size array that doesn't require an allocation?

I'm also surprised this affected performance that much. A secondary kernel launch should always be more expensive than simply having some threads loop. I guess it's a case where the new launch configuration is very one-sided (e.g. with only a single block because the reduction space is much larger than the preserved space)?

This reverts commit 5f1ef7d.

christiangnrd · 2025-09-10T02:59:03Z

Ugh, I wanted to avoid having to duplicate the entire launch configuration computation. The resize! seems potentially problematic as well (also because it's kinda useless); could we first go with a zero-size array that doesn't require an allocation?

The resize! is because there would probably be incorrect results if the new number of threads made it so that a new number of blocks is required, then partial would need to have the right size. The latest commits are my attempt to implement the zero-size array, but it seems to have slightly worse performance than the current implementation.

maleadt · 2025-09-10T06:31:56Z

The resize! is because there would probably be incorrect results if the new number of threads made it so that a new number of blocks is required, then partial would need to have the right size.

I know; I was unclear. I meant to say that it unconditionally renders the initial allocation useless, while still necessitating a memory copy from the old (but uninitialized) memory. That significantly increases the cost in terms of API calls.

It's surprising that the zero-size approach is slower. The kernels cannot be different, and the zero-size allocation should be free...

codecov · 2025-09-10T16:48:09Z

Codecov Report

❌ Patch coverage is 96.15385% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 88.49%. Comparing base (b9f7c41) to head (c3c2885).
⚠️ Report is 1 commits behind head on master.

Files with missing lines	Patch %	Lines
src/mapreduce.jl	96.15%	1 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##           master    #2880       +/-   ##
===========================================
+ Coverage   12.15%   88.49%   +76.34%     
===========================================
  Files         147      150        +3     
  Lines       12930    13169      +239     
===========================================
+ Hits         1571    11654    +10083     
+ Misses      11359     1515     -9844

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Co-authored-by: Tim Besard <[email protected]>

github-actions bot reviewed Sep 5, 2025

View reviewed changes

tomchor mentioned this pull request Sep 5, 2025

Temporary fix for allocating reductions hitting CUDA bug CliMA/Oceananigans.jl#4736

Closed

maleadt mentioned this pull request Sep 9, 2025

Backports for v5.8.4 #2872

Merged

kshyatt force-pushed the mred branch from df9c738 to 36d8a2d Compare September 9, 2025 18:21

christiangnrd added 3 commits September 9, 2025 22:43

Revert "mapreduce: remove block optimization. (JuliaGPU#2869)"

7edf346

This reverts commit 5f1ef7d.

Recalculate threads and blocks in reduction block optimization

9b14634

0-size array

12d9b8d

christiangnrd force-pushed the mred branch from 36d8a2d to 12d9b8d Compare September 10, 2025 02:58

Minor clean-up.

c3c2885

maleadt force-pushed the mred branch from 1d78b59 to c3c2885 Compare September 10, 2025 13:41

maleadt merged commit c929405 into JuliaGPU:master Sep 10, 2025
3 checks passed

christiangnrd deleted the mred branch September 10, 2025 21:32

kshyatt pushed a commit that referenced this pull request Sep 11, 2025

mapreduce: reinstate and fix block optimization (#2880)

c282658

Co-authored-by: Tim Besard <[email protected]>

maleadt added a commit that referenced this pull request Sep 22, 2025

mapreduce: reinstate and fix block optimization (#2880)

e7839b0

Co-authored-by: Tim Besard <[email protected]>

christiangnrd mentioned this pull request Sep 30, 2025

Memory corruption in sum(...; dims = 1) #2903

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

mapreduce: reinstate and fix block optimization #2880

mapreduce: reinstate and fix block optimization #2880

Uh oh!

christiangnrd commented Sep 5, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Sep 5, 2025 •

edited

Loading

Uh oh!

github-actions bot left a comment

Uh oh!

maleadt commented Sep 9, 2025

Uh oh!

christiangnrd commented Sep 10, 2025 •

edited

Loading

Uh oh!

maleadt commented Sep 10, 2025

Uh oh!

codecov bot commented Sep 10, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mapreduce: reinstate and fix block optimization #2880

mapreduce: reinstate and fix block optimization #2880

Uh oh!

Conversation

christiangnrd commented Sep 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Sep 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

CUDA.jl Benchmarks

Uh oh!

maleadt commented Sep 9, 2025

Uh oh!

christiangnrd commented Sep 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

maleadt commented Sep 10, 2025

Uh oh!

codecov bot commented Sep 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

christiangnrd commented Sep 5, 2025 •

edited

Loading

github-actions bot commented Sep 5, 2025 •

edited

Loading

christiangnrd commented Sep 10, 2025 •

edited

Loading

codecov bot commented Sep 10, 2025 •

edited

Loading