Skip to content

Conversation

@christiangnrd
Copy link
Member

@bvdmitri Can you check that this fixes your issue?

Closes #2903

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CUDA.jl Benchmarks

Details
Benchmark suite Current: 30c5352 Previous: bb88163 Ratio
latency/precompile 57177092254 ns 57317397112 ns 1.00
latency/ttfp 8163795201.5 ns 8147713210 ns 1.00
latency/import 4523409627 ns 4526239162 ns 1.00
integration/volumerhs 9616953.5 ns 9611180 ns 1.00
integration/byval/slices=1 146915 ns 147188 ns 1.00
integration/byval/slices=3 425906 ns 426028 ns 1.00
integration/byval/reference 145016 ns 145313 ns 1.00
integration/byval/slices=2 286368 ns 286701.5 ns 1.00
integration/cudadevrt 103608 ns 103791 ns 1.00
kernel/indexing 14214 ns 14403 ns 0.99
kernel/indexing_checked 15045 ns 15312 ns 0.98
kernel/occupancy 672.6582278481013 ns 669.5886075949367 ns 1.00
kernel/launch 2220.8888888888887 ns 2207.5555555555557 ns 1.01
kernel/rand 14822 ns 15091 ns 0.98
array/reverse/1d 20317 ns 20158 ns 1.01
array/reverse/2dL_inplace 66973.5 ns 67142.5 ns 1.00
array/reverse/1dL 70557 ns 70245 ns 1.00
array/reverse/2d 21952 ns 22186 ns 0.99
array/reverse/1d_inplace 9734 ns 9826 ns 0.99
array/reverse/2d_inplace 11155 ns 13581 ns 0.82
array/reverse/2dL 73918.5 ns 74246 ns 1.00
array/reverse/1dL_inplace 66887 ns 66920 ns 1.00
array/copy 20876 ns 20927 ns 1.00
array/iteration/findall/int 157196 ns 157946 ns 1.00
array/iteration/findall/bool 139865 ns 138951 ns 1.01
array/iteration/findfirst/int 161472 ns 160970.5 ns 1.00
array/iteration/findfirst/bool 162192 ns 161691.5 ns 1.00
array/iteration/scalar 72824 ns 74056.5 ns 0.98
array/iteration/logical 215807 ns 216360.5 ns 1.00
array/iteration/findmin/1d 50744 ns 50590 ns 1.00
array/iteration/findmin/2d 96457 ns 96901 ns 1.00
array/reductions/reduce/Int64/1d 43428 ns 43650 ns 0.99
array/reductions/reduce/Int64/dims=1 55108 ns 44628 ns 1.23
array/reductions/reduce/Int64/dims=2 61402 ns 61474 ns 1.00
array/reductions/reduce/Int64/dims=1L 89003 ns 89082 ns 1.00
array/reductions/reduce/Int64/dims=2L 87870 ns 87873 ns 1.00
array/reductions/reduce/Float32/1d 36672 ns 37502.5 ns 0.98
array/reductions/reduce/Float32/dims=1 46136.5 ns 42229.5 ns 1.09
array/reductions/reduce/Float32/dims=2 59577 ns 60144 ns 0.99
array/reductions/reduce/Float32/dims=1L 52448 ns 52690 ns 1.00
array/reductions/reduce/Float32/dims=2L 71869 ns 72227.5 ns 1.00
array/reductions/mapreduce/Int64/1d 43769 ns 43386 ns 1.01
array/reductions/mapreduce/Int64/dims=1 44642.5 ns 49211.5 ns 0.91
array/reductions/mapreduce/Int64/dims=2 61685.5 ns 61803 ns 1.00
array/reductions/mapreduce/Int64/dims=1L 89029 ns 89097 ns 1.00
array/reductions/mapreduce/Int64/dims=2L 88149.5 ns 88388 ns 1.00
array/reductions/mapreduce/Float32/1d 36674 ns 38045 ns 0.96
array/reductions/mapreduce/Float32/dims=1 42398 ns 42251.5 ns 1.00
array/reductions/mapreduce/Float32/dims=2 60172 ns 60265 ns 1.00
array/reductions/mapreduce/Float32/dims=1L 52809 ns 52870 ns 1.00
array/reductions/mapreduce/Float32/dims=2L 72225.5 ns 72214 ns 1.00
array/broadcast 20138 ns 20220 ns 1.00
array/copyto!/gpu_to_gpu 11333 ns 13230 ns 0.86
array/copyto!/cpu_to_gpu 216090 ns 215637 ns 1.00
array/copyto!/gpu_to_cpu 285576 ns 283097 ns 1.01
array/accumulate/Int64/1d 125094 ns 124870 ns 1.00
array/accumulate/Int64/dims=1 83637 ns 83478 ns 1.00
array/accumulate/Int64/dims=2 157733 ns 157866 ns 1.00
array/accumulate/Int64/dims=1L 1709172.5 ns 1708781.5 ns 1.00
array/accumulate/Int64/dims=2L 966298.5 ns 966771.5 ns 1.00
array/accumulate/Float32/1d 109360 ns 109240 ns 1.00
array/accumulate/Float32/dims=1 80190.5 ns 80433 ns 1.00
array/accumulate/Float32/dims=2 147574 ns 147663 ns 1.00
array/accumulate/Float32/dims=1L 1618271.5 ns 1617944.5 ns 1.00
array/accumulate/Float32/dims=2L 698216 ns 698274 ns 1.00
array/construct 1285.8 ns 1301.7 ns 0.99
array/random/randn/Float32 45307.5 ns 45481 ns 1.00
array/random/randn!/Float32 25085 ns 25068 ns 1.00
array/random/rand!/Int64 27391 ns 27506 ns 1.00
array/random/rand!/Float32 8819 ns 8985.666666666666 ns 0.98
array/random/rand/Int64 29957 ns 30362 ns 0.99
array/random/rand/Float32 13203 ns 13368.5 ns 0.99
array/permutedims/4d 60161 ns 60223.5 ns 1.00
array/permutedims/2d 53815 ns 54018.5 ns 1.00
array/permutedims/3d 54707 ns 54770.5 ns 1.00
array/sorting/1d 2757932 ns 2758706 ns 1.00
array/sorting/by 3344898.5 ns 3345315.5 ns 1.00
array/sorting/2d 1080629 ns 1082259 ns 1.00
cuda/synchronization/stream/auto 1061.2 ns 1038.3 ns 1.02
cuda/synchronization/stream/nonblocking 7540.4 ns 8063.700000000001 ns 0.94
cuda/synchronization/stream/blocking 827.8958333333334 ns 818 ns 1.01
cuda/synchronization/context/auto 1177.3 ns 1188.9 ns 0.99
cuda/synchronization/context/nonblocking 7189.6 ns 8084.1 ns 0.89
cuda/synchronization/context/blocking 909.7777777777778 ns 908.8913043478261 ns 1.00

This comment was automatically generated by workflow using github-action-benchmark.

@maleadt
Copy link
Member

maleadt commented Oct 2, 2025

Hmm, this is why I wanted to avoid duplicating the whole launch configuration determination. It wasn't clear to me what exactly we could re-use from the previous computations, and it's a lot of code to copy/paste.

Let's hope this fixes the issue.

@bvdmitri
Copy link

bvdmitri commented Oct 2, 2025

I will test it on our cluster as soon as its done with other computations, as I don't want to interrupt it and it uses pretty much 100% GPU

@christiangnrd
Copy link
Member Author

this is why I wanted to avoid duplicating the whole launch configuration determination

Yeah this was a nasty one.

I know your policy is to only do the work if someone asks, but I'm wondering if this should also be backported and released as 5.8.5 since this is a fix for a fix of 5.8.4. With your go-ahead, once this is merged, since this is my bug, I'll do all the backport work I can (everything but merging the backport PR)

I've opened #2909 to hopefully resolve the 1.11 failures.

@vchuravy
Copy link
Member

vchuravy commented Oct 2, 2025

Yes we should backport if possible.

@bvdmitri
Copy link

bvdmitri commented Oct 3, 2025

As far as I can tell the issue has been resolved, I couldn't reproduce for a while. Thanks! Do you mind adding a couple of tests to reduce the probability of it happening in the future? I couldn't easily find a single test for sum(...; dims = 1) so even a basic test (not sure how to test this particular bug tbh) would be a nice improvement

@maleadt
Copy link
Member

maleadt commented Oct 3, 2025

I couldn't easily find a single test for sum(...; dims = 1)

Plenty of tests here: https://github.com/JuliaGPU/GPUArrays.jl/blob/3be4a0978f643b2322c4574f1c7d48722ef43eed/test/testsuite/reductions.jl#L101-L114

@maleadt maleadt added the bugfix This gets something working again. label Oct 3, 2025
@maleadt maleadt merged commit 3d278b6 into JuliaGPU:master Oct 3, 2025
2 of 3 checks passed
@christiangnrd christiangnrd deleted the patch-2 branch October 3, 2025 10:33
maleadt pushed a commit to christiangnrd/CUDA.jl that referenced this pull request Oct 3, 2025
maleadt pushed a commit that referenced this pull request Oct 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bugfix This gets something working again.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Memory corruption in sum(...; dims = 1)

4 participants