-
Notifications
You must be signed in to change notification settings - Fork 263
Fix memory corruption in mapreduce #2907
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CUDA.jl Benchmarks
Details
| Benchmark suite | Current: 30c5352 | Previous: bb88163 | Ratio |
|---|---|---|---|
latency/precompile |
57177092254 ns |
57317397112 ns |
1.00 |
latency/ttfp |
8163795201.5 ns |
8147713210 ns |
1.00 |
latency/import |
4523409627 ns |
4526239162 ns |
1.00 |
integration/volumerhs |
9616953.5 ns |
9611180 ns |
1.00 |
integration/byval/slices=1 |
146915 ns |
147188 ns |
1.00 |
integration/byval/slices=3 |
425906 ns |
426028 ns |
1.00 |
integration/byval/reference |
145016 ns |
145313 ns |
1.00 |
integration/byval/slices=2 |
286368 ns |
286701.5 ns |
1.00 |
integration/cudadevrt |
103608 ns |
103791 ns |
1.00 |
kernel/indexing |
14214 ns |
14403 ns |
0.99 |
kernel/indexing_checked |
15045 ns |
15312 ns |
0.98 |
kernel/occupancy |
672.6582278481013 ns |
669.5886075949367 ns |
1.00 |
kernel/launch |
2220.8888888888887 ns |
2207.5555555555557 ns |
1.01 |
kernel/rand |
14822 ns |
15091 ns |
0.98 |
array/reverse/1d |
20317 ns |
20158 ns |
1.01 |
array/reverse/2dL_inplace |
66973.5 ns |
67142.5 ns |
1.00 |
array/reverse/1dL |
70557 ns |
70245 ns |
1.00 |
array/reverse/2d |
21952 ns |
22186 ns |
0.99 |
array/reverse/1d_inplace |
9734 ns |
9826 ns |
0.99 |
array/reverse/2d_inplace |
11155 ns |
13581 ns |
0.82 |
array/reverse/2dL |
73918.5 ns |
74246 ns |
1.00 |
array/reverse/1dL_inplace |
66887 ns |
66920 ns |
1.00 |
array/copy |
20876 ns |
20927 ns |
1.00 |
array/iteration/findall/int |
157196 ns |
157946 ns |
1.00 |
array/iteration/findall/bool |
139865 ns |
138951 ns |
1.01 |
array/iteration/findfirst/int |
161472 ns |
160970.5 ns |
1.00 |
array/iteration/findfirst/bool |
162192 ns |
161691.5 ns |
1.00 |
array/iteration/scalar |
72824 ns |
74056.5 ns |
0.98 |
array/iteration/logical |
215807 ns |
216360.5 ns |
1.00 |
array/iteration/findmin/1d |
50744 ns |
50590 ns |
1.00 |
array/iteration/findmin/2d |
96457 ns |
96901 ns |
1.00 |
array/reductions/reduce/Int64/1d |
43428 ns |
43650 ns |
0.99 |
array/reductions/reduce/Int64/dims=1 |
55108 ns |
44628 ns |
1.23 |
array/reductions/reduce/Int64/dims=2 |
61402 ns |
61474 ns |
1.00 |
array/reductions/reduce/Int64/dims=1L |
89003 ns |
89082 ns |
1.00 |
array/reductions/reduce/Int64/dims=2L |
87870 ns |
87873 ns |
1.00 |
array/reductions/reduce/Float32/1d |
36672 ns |
37502.5 ns |
0.98 |
array/reductions/reduce/Float32/dims=1 |
46136.5 ns |
42229.5 ns |
1.09 |
array/reductions/reduce/Float32/dims=2 |
59577 ns |
60144 ns |
0.99 |
array/reductions/reduce/Float32/dims=1L |
52448 ns |
52690 ns |
1.00 |
array/reductions/reduce/Float32/dims=2L |
71869 ns |
72227.5 ns |
1.00 |
array/reductions/mapreduce/Int64/1d |
43769 ns |
43386 ns |
1.01 |
array/reductions/mapreduce/Int64/dims=1 |
44642.5 ns |
49211.5 ns |
0.91 |
array/reductions/mapreduce/Int64/dims=2 |
61685.5 ns |
61803 ns |
1.00 |
array/reductions/mapreduce/Int64/dims=1L |
89029 ns |
89097 ns |
1.00 |
array/reductions/mapreduce/Int64/dims=2L |
88149.5 ns |
88388 ns |
1.00 |
array/reductions/mapreduce/Float32/1d |
36674 ns |
38045 ns |
0.96 |
array/reductions/mapreduce/Float32/dims=1 |
42398 ns |
42251.5 ns |
1.00 |
array/reductions/mapreduce/Float32/dims=2 |
60172 ns |
60265 ns |
1.00 |
array/reductions/mapreduce/Float32/dims=1L |
52809 ns |
52870 ns |
1.00 |
array/reductions/mapreduce/Float32/dims=2L |
72225.5 ns |
72214 ns |
1.00 |
array/broadcast |
20138 ns |
20220 ns |
1.00 |
array/copyto!/gpu_to_gpu |
11333 ns |
13230 ns |
0.86 |
array/copyto!/cpu_to_gpu |
216090 ns |
215637 ns |
1.00 |
array/copyto!/gpu_to_cpu |
285576 ns |
283097 ns |
1.01 |
array/accumulate/Int64/1d |
125094 ns |
124870 ns |
1.00 |
array/accumulate/Int64/dims=1 |
83637 ns |
83478 ns |
1.00 |
array/accumulate/Int64/dims=2 |
157733 ns |
157866 ns |
1.00 |
array/accumulate/Int64/dims=1L |
1709172.5 ns |
1708781.5 ns |
1.00 |
array/accumulate/Int64/dims=2L |
966298.5 ns |
966771.5 ns |
1.00 |
array/accumulate/Float32/1d |
109360 ns |
109240 ns |
1.00 |
array/accumulate/Float32/dims=1 |
80190.5 ns |
80433 ns |
1.00 |
array/accumulate/Float32/dims=2 |
147574 ns |
147663 ns |
1.00 |
array/accumulate/Float32/dims=1L |
1618271.5 ns |
1617944.5 ns |
1.00 |
array/accumulate/Float32/dims=2L |
698216 ns |
698274 ns |
1.00 |
array/construct |
1285.8 ns |
1301.7 ns |
0.99 |
array/random/randn/Float32 |
45307.5 ns |
45481 ns |
1.00 |
array/random/randn!/Float32 |
25085 ns |
25068 ns |
1.00 |
array/random/rand!/Int64 |
27391 ns |
27506 ns |
1.00 |
array/random/rand!/Float32 |
8819 ns |
8985.666666666666 ns |
0.98 |
array/random/rand/Int64 |
29957 ns |
30362 ns |
0.99 |
array/random/rand/Float32 |
13203 ns |
13368.5 ns |
0.99 |
array/permutedims/4d |
60161 ns |
60223.5 ns |
1.00 |
array/permutedims/2d |
53815 ns |
54018.5 ns |
1.00 |
array/permutedims/3d |
54707 ns |
54770.5 ns |
1.00 |
array/sorting/1d |
2757932 ns |
2758706 ns |
1.00 |
array/sorting/by |
3344898.5 ns |
3345315.5 ns |
1.00 |
array/sorting/2d |
1080629 ns |
1082259 ns |
1.00 |
cuda/synchronization/stream/auto |
1061.2 ns |
1038.3 ns |
1.02 |
cuda/synchronization/stream/nonblocking |
7540.4 ns |
8063.700000000001 ns |
0.94 |
cuda/synchronization/stream/blocking |
827.8958333333334 ns |
818 ns |
1.01 |
cuda/synchronization/context/auto |
1177.3 ns |
1188.9 ns |
0.99 |
cuda/synchronization/context/nonblocking |
7189.6 ns |
8084.1 ns |
0.89 |
cuda/synchronization/context/blocking |
909.7777777777778 ns |
908.8913043478261 ns |
1.00 |
This comment was automatically generated by workflow using github-action-benchmark.
|
Hmm, this is why I wanted to avoid duplicating the whole launch configuration determination. It wasn't clear to me what exactly we could re-use from the previous computations, and it's a lot of code to copy/paste. Let's hope this fixes the issue. |
|
I will test it on our cluster as soon as its done with other computations, as I don't want to interrupt it and it uses pretty much 100% GPU |
Yeah this was a nasty one. I know your policy is to only do the work if someone asks, but I'm wondering if this should also be backported and released as 5.8.5 since this is a fix for a fix of 5.8.4. With your go-ahead, once this is merged, since this is my bug, I'll do all the backport work I can (everything but merging the backport PR) I've opened #2909 to hopefully resolve the 1.11 failures. |
|
Yes we should backport if possible. |
|
As far as I can tell the issue has been resolved, I couldn't reproduce for a while. Thanks! Do you mind adding a couple of tests to reduce the probability of it happening in the future? I couldn't easily find a single test for |
Plenty of tests here: https://github.com/JuliaGPU/GPUArrays.jl/blob/3be4a0978f643b2322c4574f1c7d48722ef43eed/test/testsuite/reductions.jl#L101-L114 |
@bvdmitri Can you check that this fixes your issue?
Closes #2903