-
Notifications
You must be signed in to change notification settings - Fork 263
mapreduce: reinstate and fix block optimization #2880
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Your PR requires formatting changes to meet the project's style guidelines. Click here to view the suggested changes.diff --git a/src/mapreduce.jl b/src/mapreduce.jl
index 7f0b242e4..87ee2882a 100644
--- a/src/mapreduce.jl
+++ b/src/mapreduce.jl
@@ -198,7 +198,7 @@ function GPUArrays.mapreducedim!(f::F, op::OP, R::AnyCuArray{T},
# If `Rother` is large enough, then a naive loop is more efficient than partial reductions.
if length(Rother) >= serial_mapreduce_threshold(dev)
args = (f, op, init, Rreduce, Rother, R, A)
- kernel = @cuda launch=false serial_mapreduce_kernel(args...)
+ kernel = @cuda launch = false serial_mapreduce_kernel(args...)
kernel_config = launch_configuration(kernel.fun)
threads = kernel_config.threads
blocks = cld(length(Rother), threads)
@@ -240,14 +240,16 @@ function GPUArrays.mapreducedim!(f::F, op::OP, R::AnyCuArray{T},
reduce_blocks = if other_blocks >= kernel_config.blocks
1
else
- min(cld(length(Rreduce), reduce_threads), # how many we need at most
- cld(kernel_config.blocks, other_blocks)) # maximize occupancy
+ min(
+ cld(length(Rreduce), reduce_threads), # how many we need at most
+ cld(kernel_config.blocks, other_blocks)
+ ) # maximize occupancy
end
# determine the launch configuration
threads = reduce_threads
shmem = reduce_shmem
- blocks = reduce_blocks*other_blocks
+ blocks = reduce_blocks * other_blocks
# perform the actual reduction
if reduce_blocks == 1
@@ -257,24 +259,26 @@ function GPUArrays.mapreducedim!(f::F, op::OP, R::AnyCuArray{T},
# TODO: provide a version that atomically reduces from different blocks
# temporary empty array whose type will match the final partial array
- partial = similar(R, ntuple(_ -> 0, Val(ndims(R)+1)))
+ partial = similar(R, ntuple(_ -> 0, Val(ndims(R) + 1)))
# NOTE: we can't use the previously-compiled kernel, or its launch configuration,
# since the type of `partial` might not match the original output container
# (e.g. if that was a view).
- partial_kernel = @cuda launch=false partial_mapreduce_grid(f, op, init, Rreduce, Rother, Val(shuffle), partial, A)
- partial_kernel_config = launch_configuration(partial_kernel.fun; shmem=compute_shmem∘compute_threads)
+ partial_kernel = @cuda launch = false partial_mapreduce_grid(f, op, init, Rreduce, Rother, Val(shuffle), partial, A)
+ partial_kernel_config = launch_configuration(partial_kernel.fun; shmem = compute_shmem ∘ compute_threads)
partial_reduce_threads = compute_threads(partial_kernel_config.threads)
partial_reduce_shmem = compute_shmem(partial_reduce_threads)
partial_reduce_blocks = if other_blocks >= partial_kernel_config.blocks
1
else
- min(cld(length(Rreduce), partial_reduce_threads),
- cld(partial_kernel_config.blocks, other_blocks))
+ min(
+ cld(length(Rreduce), partial_reduce_threads),
+ cld(partial_kernel_config.blocks, other_blocks)
+ )
end
partial_threads = partial_reduce_threads
partial_shmem = partial_reduce_shmem
- partial_blocks = partial_reduce_blocks*other_blocks
+ partial_blocks = partial_reduce_blocks * other_blocks
partial = similar(R, (size(R)..., partial_blocks))
if init === nothing
@@ -282,8 +286,10 @@ function GPUArrays.mapreducedim!(f::F, op::OP, R::AnyCuArray{T},
partial .= R
end
- partial_kernel(f, op, init, Rreduce, Rother, Val(shuffle), partial, A;
- threads=partial_threads, blocks=partial_blocks, shmem=partial_shmem)
+ partial_kernel(
+ f, op, init, Rreduce, Rother, Val(shuffle), partial, A;
+ threads = partial_threads, blocks = partial_blocks, shmem = partial_shmem
+ )
GPUArrays.mapreducedim!(identity, op, R, partial; init)
end
diff --git a/test/base/array.jl b/test/base/array.jl
index ee1b42cd9..b7b3ce82e 100644
--- a/test/base/array.jl
+++ b/test/base/array.jl
@@ -718,7 +718,7 @@ end
@testset "large map reduce" begin
dev = device()
- big_size = CUDA.serial_mapreduce_threshold(dev) + 5
+ big_size = CUDA.serial_mapreduce_threshold(dev) + 5
a = rand(Float32, big_size, 31)
c = CuArray(a)
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CUDA.jl Benchmarks
Details
| Benchmark suite | Current: df9c738 | Previous: e734d7a | Ratio |
|---|---|---|---|
latency/precompile |
43629992505.5 ns |
43398695331.5 ns |
1.01 |
latency/ttfp |
7158179129 ns |
7160470719 ns |
1.00 |
latency/import |
3736142904 ns |
3655684127 ns |
1.02 |
integration/volumerhs |
9626688.5 ns |
9617572.5 ns |
1.00 |
integration/byval/slices=1 |
146912 ns |
147149 ns |
1.00 |
integration/byval/slices=3 |
425798 ns |
426071 ns |
1.00 |
integration/byval/reference |
144995 ns |
145099 ns |
1.00 |
integration/byval/slices=2 |
286414 ns |
286606 ns |
1.00 |
integration/cudadevrt |
103460 ns |
103567 ns |
1.00 |
kernel/indexing |
14314 ns |
14309 ns |
1.00 |
kernel/indexing_checked |
15136 ns |
14963.5 ns |
1.01 |
kernel/occupancy |
697.9533333333334 ns |
667.6981132075472 ns |
1.05 |
kernel/launch |
2270 ns |
2240.222222222222 ns |
1.01 |
kernel/rand |
15489 ns |
15887 ns |
0.97 |
array/reverse/1d |
20096 ns |
19850 ns |
1.01 |
array/reverse/2d |
24203.5 ns |
25206 ns |
0.96 |
array/reverse/1d_inplace |
10975 ns |
10666 ns |
1.03 |
array/reverse/2d_inplace |
13446 ns |
12171 ns |
1.10 |
array/copy |
21019 ns |
20937 ns |
1.00 |
array/iteration/findall/int |
158211 ns |
157262 ns |
1.01 |
array/iteration/findall/bool |
139951 ns |
139199 ns |
1.01 |
array/iteration/findfirst/int |
159122 ns |
2147909.5 ns |
0.0740822646391759 |
array/iteration/findfirst/bool |
160046 ns |
2129592 ns |
0.0751533627098524 |
array/iteration/scalar |
71237 ns |
71676 ns |
0.99 |
array/iteration/logical |
215505 ns |
235260.5 ns |
0.92 |
array/iteration/findmin/1d |
48299 ns |
239425 ns |
0.20 |
array/iteration/findmin/2d |
96340.5 ns |
96120 ns |
1.00 |
array/reductions/reduce/Int64/1d |
43686 ns |
147373 ns |
0.30 |
array/reductions/reduce/Int64/dims=1 |
44634.5 ns |
44089 ns |
1.01 |
array/reductions/reduce/Int64/dims=2 |
61714 ns |
61492 ns |
1.00 |
array/reductions/reduce/Int64/dims=1L |
88989 ns |
89053 ns |
1.00 |
array/reductions/reduce/Int64/dims=2L |
89880 ns |
661490 ns |
0.14 |
array/reductions/reduce/Float32/1d |
36837 ns |
104099.5 ns |
0.35 |
array/reductions/reduce/Float32/dims=1 |
42569.5 ns |
40848 ns |
1.04 |
array/reductions/reduce/Float32/dims=2 |
60046 ns |
59344 ns |
1.01 |
array/reductions/reduce/Float32/dims=1L |
52416 ns |
52528 ns |
1.00 |
array/reductions/reduce/Float32/dims=2L |
73914 ns |
547809 ns |
0.13 |
array/reductions/mapreduce/Int64/1d |
43669 ns |
149021 ns |
0.29 |
array/reductions/mapreduce/Int64/dims=1 |
45028.5 ns |
44144 ns |
1.02 |
array/reductions/mapreduce/Int64/dims=2 |
61975 ns |
61760 ns |
1.00 |
array/reductions/mapreduce/Int64/dims=1L |
88929 ns |
89012 ns |
1.00 |
array/reductions/mapreduce/Int64/dims=2L |
90163 ns |
685323 ns |
0.13 |
array/reductions/mapreduce/Float32/1d |
36726 ns |
104623 ns |
0.35 |
array/reductions/mapreduce/Float32/dims=1 |
43663.5 ns |
41094 ns |
1.06 |
array/reductions/mapreduce/Float32/dims=2 |
60623 ns |
59797 ns |
1.01 |
array/reductions/mapreduce/Float32/dims=1L |
52803 ns |
52788.5 ns |
1.00 |
array/reductions/mapreduce/Float32/dims=2L |
74876 ns |
549999.5 ns |
0.14 |
array/broadcast |
19959 ns |
20109 ns |
0.99 |
array/copyto!/gpu_to_gpu |
12710 ns |
12888 ns |
0.99 |
array/copyto!/cpu_to_gpu |
213957 ns |
213909 ns |
1.00 |
array/copyto!/gpu_to_cpu |
281665 ns |
283768.5 ns |
0.99 |
array/accumulate/Int64/1d |
125404 ns |
124986 ns |
1.00 |
array/accumulate/Int64/dims=1 |
83948 ns |
83520 ns |
1.01 |
array/accumulate/Int64/dims=2 |
157987.5 ns |
158891 ns |
0.99 |
array/accumulate/Int64/dims=1L |
1710745.5 ns |
1720572.5 ns |
0.99 |
array/accumulate/Int64/dims=2L |
966802.5 ns |
967934 ns |
1.00 |
array/accumulate/Float32/1d |
110031 ns |
109201 ns |
1.01 |
array/accumulate/Float32/dims=1 |
80813 ns |
80412 ns |
1.00 |
array/accumulate/Float32/dims=2 |
147985.5 ns |
147560.5 ns |
1.00 |
array/accumulate/Float32/dims=1L |
1618965 ns |
1618433 ns |
1.00 |
array/accumulate/Float32/dims=2L |
699190 ns |
698180 ns |
1.00 |
array/construct |
1315.4 ns |
1627.3 ns |
0.81 |
array/random/randn/Float32 |
46581.5 ns |
45289.5 ns |
1.03 |
array/random/randn!/Float32 |
24952 ns |
25002 ns |
1.00 |
array/random/rand!/Int64 |
27396 ns |
27438 ns |
1.00 |
array/random/rand!/Float32 |
8662.5 ns |
8965 ns |
0.97 |
array/random/rand/Int64 |
30091 ns |
29907 ns |
1.01 |
array/random/rand/Float32 |
13218 ns |
13209 ns |
1.00 |
array/permutedims/4d |
60318 ns |
60267 ns |
1.00 |
array/permutedims/2d |
54285 ns |
53766.5 ns |
1.01 |
array/permutedims/3d |
55144 ns |
54703 ns |
1.01 |
array/sorting/1d |
2758569 ns |
2755967.5 ns |
1.00 |
array/sorting/by |
3354299.5 ns |
3342542 ns |
1.00 |
array/sorting/2d |
1084839 ns |
1079574 ns |
1.00 |
cuda/synchronization/stream/auto |
1047 ns |
1015.3 ns |
1.03 |
cuda/synchronization/stream/nonblocking |
7695.6 ns |
7605.6 ns |
1.01 |
cuda/synchronization/stream/blocking |
802.4516129032259 ns |
809.5368421052632 ns |
0.99 |
cuda/synchronization/context/auto |
1179.9 ns |
1144.9 ns |
1.03 |
cuda/synchronization/context/nonblocking |
8305.5 ns |
8738.6 ns |
0.95 |
cuda/synchronization/context/blocking |
911.25 ns |
884.6666666666666 ns |
1.03 |
This comment was automatically generated by workflow using github-action-benchmark.
|
Ugh, I wanted to avoid having to duplicate the entire launch configuration computation. The I'm also surprised this affected performance that much. A secondary kernel launch should always be more expensive than simply having some threads loop. I guess it's a case where the new launch configuration is very one-sided (e.g. with only a single block because the reduction space is much larger than the preserved space)? |
The |
I know; I was unclear. I meant to say that it unconditionally renders the initial allocation useless, while still necessitating a memory copy from the old (but uninitialized) memory. That significantly increases the cost in terms of API calls. It's surprising that the zero-size approach is slower. The kernels cannot be different, and the zero-size allocation should be free... |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #2880 +/- ##
===========================================
+ Coverage 12.15% 88.49% +76.34%
===========================================
Files 147 150 +3
Lines 12930 13169 +239
===========================================
+ Hits 1571 11654 +10083
+ Misses 11359 1515 -9844 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Co-authored-by: Tim Besard <[email protected]>
Co-authored-by: Tim Besard <[email protected]>
Do not squash.
My attempt to fix the performance regression I mention in #2869 (comment). I recalculate the optimal number of threads, and if needed,
partialgets resized if the new num threads requires more blocks. Not quite sure how this could be tested.Benchmarks in order: before, this commit, after the removal of the block optimization (current). The benchmarks are in this order because I wanted to make it easier to see that there's a slight performance regression from the old version, but if it's correct that's still an improvement over the current version.
