Skip to content

Conversation

@christiangnrd
Copy link
Member

@christiangnrd christiangnrd commented Sep 5, 2025

Do not squash.

My attempt to fix the performance regression I mention in #2869 (comment). I recalculate the optimal number of threads, and if needed, partial gets resized if the new num threads requires more blocks. Not quite sure how this could be tested.

Benchmarks in order: before, this commit, after the removal of the block optimization (current). The benchmarks are in this order because I wanted to make it easier to see that there's a slight performance regression from the old version, but if it's correct that's still an improvement over the current version.
image

@github-actions
Copy link
Contributor

github-actions bot commented Sep 5, 2025

Your PR requires formatting changes to meet the project's style guidelines.
Please consider running Runic (git runic master) to apply these changes.

Click here to view the suggested changes.
diff --git a/src/mapreduce.jl b/src/mapreduce.jl
index 7f0b242e4..87ee2882a 100644
--- a/src/mapreduce.jl
+++ b/src/mapreduce.jl
@@ -198,7 +198,7 @@ function GPUArrays.mapreducedim!(f::F, op::OP, R::AnyCuArray{T},
     # If `Rother` is large enough, then a naive loop is more efficient than partial reductions.
     if length(Rother) >= serial_mapreduce_threshold(dev)
         args = (f, op, init, Rreduce, Rother, R, A)
-        kernel = @cuda launch=false serial_mapreduce_kernel(args...)
+        kernel = @cuda launch = false serial_mapreduce_kernel(args...)
         kernel_config = launch_configuration(kernel.fun)
         threads = kernel_config.threads
         blocks = cld(length(Rother), threads)
@@ -240,14 +240,16 @@ function GPUArrays.mapreducedim!(f::F, op::OP, R::AnyCuArray{T},
     reduce_blocks = if other_blocks >= kernel_config.blocks
         1
     else
-        min(cld(length(Rreduce), reduce_threads),       # how many we need at most
-            cld(kernel_config.blocks, other_blocks))    # maximize occupancy
+        min(
+            cld(length(Rreduce), reduce_threads),       # how many we need at most
+            cld(kernel_config.blocks, other_blocks)
+        )    # maximize occupancy
     end
 
     # determine the launch configuration
     threads = reduce_threads
     shmem = reduce_shmem
-    blocks = reduce_blocks*other_blocks
+    blocks = reduce_blocks * other_blocks
 
     # perform the actual reduction
     if reduce_blocks == 1
@@ -257,24 +259,26 @@ function GPUArrays.mapreducedim!(f::F, op::OP, R::AnyCuArray{T},
         # TODO: provide a version that atomically reduces from different blocks
 
         # temporary empty array whose type will match the final partial array
-	    partial = similar(R, ntuple(_ -> 0, Val(ndims(R)+1)))
+        partial = similar(R, ntuple(_ -> 0, Val(ndims(R) + 1)))
 
         # NOTE: we can't use the previously-compiled kernel, or its launch configuration,
         #       since the type of `partial` might not match the original output container
         #       (e.g. if that was a view).
-        partial_kernel = @cuda launch=false partial_mapreduce_grid(f, op, init, Rreduce, Rother, Val(shuffle), partial, A)
-        partial_kernel_config = launch_configuration(partial_kernel.fun; shmem=compute_shmem∘compute_threads)
+        partial_kernel = @cuda launch = false partial_mapreduce_grid(f, op, init, Rreduce, Rother, Val(shuffle), partial, A)
+        partial_kernel_config = launch_configuration(partial_kernel.fun; shmem = compute_shmem ∘ compute_threads)
         partial_reduce_threads = compute_threads(partial_kernel_config.threads)
         partial_reduce_shmem = compute_shmem(partial_reduce_threads)
         partial_reduce_blocks = if other_blocks >= partial_kernel_config.blocks
             1
         else
-            min(cld(length(Rreduce), partial_reduce_threads),
-                cld(partial_kernel_config.blocks, other_blocks))
+            min(
+                cld(length(Rreduce), partial_reduce_threads),
+                cld(partial_kernel_config.blocks, other_blocks)
+            )
         end
         partial_threads = partial_reduce_threads
         partial_shmem = partial_reduce_shmem
-        partial_blocks = partial_reduce_blocks*other_blocks
+        partial_blocks = partial_reduce_blocks * other_blocks
 
         partial = similar(R, (size(R)..., partial_blocks))
         if init === nothing
@@ -282,8 +286,10 @@ function GPUArrays.mapreducedim!(f::F, op::OP, R::AnyCuArray{T},
             partial .= R
         end
 
-        partial_kernel(f, op, init, Rreduce, Rother, Val(shuffle), partial, A;
-                       threads=partial_threads, blocks=partial_blocks, shmem=partial_shmem)
+        partial_kernel(
+            f, op, init, Rreduce, Rother, Val(shuffle), partial, A;
+            threads = partial_threads, blocks = partial_blocks, shmem = partial_shmem
+        )
 
         GPUArrays.mapreducedim!(identity, op, R, partial; init)
     end
diff --git a/test/base/array.jl b/test/base/array.jl
index ee1b42cd9..b7b3ce82e 100644
--- a/test/base/array.jl
+++ b/test/base/array.jl
@@ -718,7 +718,7 @@ end
 @testset "large map reduce" begin
   dev = device()
 
-  big_size = CUDA.serial_mapreduce_threshold(dev) + 5
+    big_size = CUDA.serial_mapreduce_threshold(dev) + 5
   a = rand(Float32, big_size, 31)
   c = CuArray(a)
 

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CUDA.jl Benchmarks

Details
Benchmark suite Current: df9c738 Previous: e734d7a Ratio
latency/precompile 43629992505.5 ns 43398695331.5 ns 1.01
latency/ttfp 7158179129 ns 7160470719 ns 1.00
latency/import 3736142904 ns 3655684127 ns 1.02
integration/volumerhs 9626688.5 ns 9617572.5 ns 1.00
integration/byval/slices=1 146912 ns 147149 ns 1.00
integration/byval/slices=3 425798 ns 426071 ns 1.00
integration/byval/reference 144995 ns 145099 ns 1.00
integration/byval/slices=2 286414 ns 286606 ns 1.00
integration/cudadevrt 103460 ns 103567 ns 1.00
kernel/indexing 14314 ns 14309 ns 1.00
kernel/indexing_checked 15136 ns 14963.5 ns 1.01
kernel/occupancy 697.9533333333334 ns 667.6981132075472 ns 1.05
kernel/launch 2270 ns 2240.222222222222 ns 1.01
kernel/rand 15489 ns 15887 ns 0.97
array/reverse/1d 20096 ns 19850 ns 1.01
array/reverse/2d 24203.5 ns 25206 ns 0.96
array/reverse/1d_inplace 10975 ns 10666 ns 1.03
array/reverse/2d_inplace 13446 ns 12171 ns 1.10
array/copy 21019 ns 20937 ns 1.00
array/iteration/findall/int 158211 ns 157262 ns 1.01
array/iteration/findall/bool 139951 ns 139199 ns 1.01
array/iteration/findfirst/int 159122 ns 2147909.5 ns 0.0740822646391759
array/iteration/findfirst/bool 160046 ns 2129592 ns 0.0751533627098524
array/iteration/scalar 71237 ns 71676 ns 0.99
array/iteration/logical 215505 ns 235260.5 ns 0.92
array/iteration/findmin/1d 48299 ns 239425 ns 0.20
array/iteration/findmin/2d 96340.5 ns 96120 ns 1.00
array/reductions/reduce/Int64/1d 43686 ns 147373 ns 0.30
array/reductions/reduce/Int64/dims=1 44634.5 ns 44089 ns 1.01
array/reductions/reduce/Int64/dims=2 61714 ns 61492 ns 1.00
array/reductions/reduce/Int64/dims=1L 88989 ns 89053 ns 1.00
array/reductions/reduce/Int64/dims=2L 89880 ns 661490 ns 0.14
array/reductions/reduce/Float32/1d 36837 ns 104099.5 ns 0.35
array/reductions/reduce/Float32/dims=1 42569.5 ns 40848 ns 1.04
array/reductions/reduce/Float32/dims=2 60046 ns 59344 ns 1.01
array/reductions/reduce/Float32/dims=1L 52416 ns 52528 ns 1.00
array/reductions/reduce/Float32/dims=2L 73914 ns 547809 ns 0.13
array/reductions/mapreduce/Int64/1d 43669 ns 149021 ns 0.29
array/reductions/mapreduce/Int64/dims=1 45028.5 ns 44144 ns 1.02
array/reductions/mapreduce/Int64/dims=2 61975 ns 61760 ns 1.00
array/reductions/mapreduce/Int64/dims=1L 88929 ns 89012 ns 1.00
array/reductions/mapreduce/Int64/dims=2L 90163 ns 685323 ns 0.13
array/reductions/mapreduce/Float32/1d 36726 ns 104623 ns 0.35
array/reductions/mapreduce/Float32/dims=1 43663.5 ns 41094 ns 1.06
array/reductions/mapreduce/Float32/dims=2 60623 ns 59797 ns 1.01
array/reductions/mapreduce/Float32/dims=1L 52803 ns 52788.5 ns 1.00
array/reductions/mapreduce/Float32/dims=2L 74876 ns 549999.5 ns 0.14
array/broadcast 19959 ns 20109 ns 0.99
array/copyto!/gpu_to_gpu 12710 ns 12888 ns 0.99
array/copyto!/cpu_to_gpu 213957 ns 213909 ns 1.00
array/copyto!/gpu_to_cpu 281665 ns 283768.5 ns 0.99
array/accumulate/Int64/1d 125404 ns 124986 ns 1.00
array/accumulate/Int64/dims=1 83948 ns 83520 ns 1.01
array/accumulate/Int64/dims=2 157987.5 ns 158891 ns 0.99
array/accumulate/Int64/dims=1L 1710745.5 ns 1720572.5 ns 0.99
array/accumulate/Int64/dims=2L 966802.5 ns 967934 ns 1.00
array/accumulate/Float32/1d 110031 ns 109201 ns 1.01
array/accumulate/Float32/dims=1 80813 ns 80412 ns 1.00
array/accumulate/Float32/dims=2 147985.5 ns 147560.5 ns 1.00
array/accumulate/Float32/dims=1L 1618965 ns 1618433 ns 1.00
array/accumulate/Float32/dims=2L 699190 ns 698180 ns 1.00
array/construct 1315.4 ns 1627.3 ns 0.81
array/random/randn/Float32 46581.5 ns 45289.5 ns 1.03
array/random/randn!/Float32 24952 ns 25002 ns 1.00
array/random/rand!/Int64 27396 ns 27438 ns 1.00
array/random/rand!/Float32 8662.5 ns 8965 ns 0.97
array/random/rand/Int64 30091 ns 29907 ns 1.01
array/random/rand/Float32 13218 ns 13209 ns 1.00
array/permutedims/4d 60318 ns 60267 ns 1.00
array/permutedims/2d 54285 ns 53766.5 ns 1.01
array/permutedims/3d 55144 ns 54703 ns 1.01
array/sorting/1d 2758569 ns 2755967.5 ns 1.00
array/sorting/by 3354299.5 ns 3342542 ns 1.00
array/sorting/2d 1084839 ns 1079574 ns 1.00
cuda/synchronization/stream/auto 1047 ns 1015.3 ns 1.03
cuda/synchronization/stream/nonblocking 7695.6 ns 7605.6 ns 1.01
cuda/synchronization/stream/blocking 802.4516129032259 ns 809.5368421052632 ns 0.99
cuda/synchronization/context/auto 1179.9 ns 1144.9 ns 1.03
cuda/synchronization/context/nonblocking 8305.5 ns 8738.6 ns 0.95
cuda/synchronization/context/blocking 911.25 ns 884.6666666666666 ns 1.03

This comment was automatically generated by workflow using github-action-benchmark.

@maleadt
Copy link
Member

maleadt commented Sep 9, 2025

Ugh, I wanted to avoid having to duplicate the entire launch configuration computation. The resize! seems potentially problematic as well (also because it's kinda useless); could we first go with a zero-size array that doesn't require an allocation?

I'm also surprised this affected performance that much. A secondary kernel launch should always be more expensive than simply having some threads loop. I guess it's a case where the new launch configuration is very one-sided (e.g. with only a single block because the reduction space is much larger than the preserved space)?

@christiangnrd
Copy link
Member Author

christiangnrd commented Sep 10, 2025

Ugh, I wanted to avoid having to duplicate the entire launch configuration computation. The resize! seems potentially problematic as well (also because it's kinda useless); could we first go with a zero-size array that doesn't require an allocation?

The resize! is because there would probably be incorrect results if the new number of threads made it so that a new number of blocks is required, then partial would need to have the right size. The latest commits are my attempt to implement the zero-size array, but it seems to have slightly worse performance than the current implementation.

@maleadt
Copy link
Member

maleadt commented Sep 10, 2025

The resize! is because there would probably be incorrect results if the new number of threads made it so that a new number of blocks is required, then partial would need to have the right size.

I know; I was unclear. I meant to say that it unconditionally renders the initial allocation useless, while still necessitating a memory copy from the old (but uninitialized) memory. That significantly increases the cost in terms of API calls.

It's surprising that the zero-size approach is slower. The kernels cannot be different, and the zero-size allocation should be free...

@codecov
Copy link

codecov bot commented Sep 10, 2025

Codecov Report

❌ Patch coverage is 96.15385% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 88.49%. Comparing base (b9f7c41) to head (c3c2885).
⚠️ Report is 1 commits behind head on master.

Files with missing lines Patch % Lines
src/mapreduce.jl 96.15% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           master    #2880       +/-   ##
===========================================
+ Coverage   12.15%   88.49%   +76.34%     
===========================================
  Files         147      150        +3     
  Lines       12930    13169      +239     
===========================================
+ Hits         1571    11654    +10083     
+ Misses      11359     1515     -9844     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@maleadt maleadt merged commit c929405 into JuliaGPU:master Sep 10, 2025
3 checks passed
@christiangnrd christiangnrd deleted the mred branch September 10, 2025 21:32
kshyatt pushed a commit that referenced this pull request Sep 11, 2025
maleadt added a commit that referenced this pull request Sep 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants