Skip to content

Conversation

@simeonschaub
Copy link
Member

The current implementation has some performance issues as kron_kernel!
gets boxed, I believe due to self-recursion. The checks for whether to
transpose or conjugate also happen at runtime in each iteration which
seems unnecessary.

This works around the issue described in
JuliaGPU/AMDGPU.jl#766 (comment),
though it would be good to look further into since the old code is still
semantically correct.

The current implementation has some performance issues as `kron_kernel!`
gets boxed, I believe due to self-recursion. The checks for whether to
transpose or conjugate also happen at runtime in each iteration which
seems unnecessary.

This works around the issue described in
JuliaGPU/AMDGPU.jl#766 (comment),
though it would be good to look further into since the old code is still
semantically correct.
@github-actions
Copy link
Contributor

github-actions bot commented Jun 17, 2025

Your PR requires formatting changes to meet the project's style guidelines.
Please consider running Runic (git runic master) to apply these changes.

Click here to view the suggested changes.
diff --git a/src/host/linalg.jl b/src/host/linalg.jl
index bdfab65..2072ed7 100644
--- a/src/host/linalg.jl
+++ b/src/host/linalg.jl
@@ -796,7 +796,7 @@ for wrapa in trans_adj_wrappers, wrapb in trans_adj_wrappers
         backend = KernelAbstractions.get_backend(C)
         kernel = kron_kernel!(backend)
 
-        kernel(C, A, B, ndrange=(size(C, 1), size(C, 2)))
+        kernel(C, A, B, ndrange = (size(C, 1), size(C, 2)))
 
         return C
     end

Copy link
Member

@maleadt maleadt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I don't think we want to simply "work around" the issue without knowing what's up. If this doesn't fail on any other platform, and the @device_code output is identical, I don't see how splitting the kernels into separately-named ones could help.

The run-time check is fair, but I wonder if it wouldn't be better to express that in terms of the argument types, so that (ideally) the kernel can be entirely moved outside of the @eval block. No need to eval many versions of the kernel if we can rely on the compiler to optimize things away.

Significantly simplify indexing by not applying transposes and adjoints
manually and by using `fld1` and `mod1`. Also add some combinations
involving mixed vectors and matrices for generality
@maleadt
Copy link
Member

maleadt commented Jun 18, 2025

Thanks, that definitely improves the implementation here. I'm still wary treating this as a workaround for the AMDGPU.jl issue though, as moving a kernel out shouldn't impact things (we can't use boxed values in GPU kernels, so that can't be the cause).

@simeonschaub
Copy link
Member Author

simeonschaub commented Jun 18, 2025

I have opened an issue in AMDGPU.jl with a simple reproducer of the underlying issue: JuliaGPU/AMDGPU.jl#780. Should we go ahead with this and track that issue there?

@maleadt
Copy link
Member

maleadt commented Jun 18, 2025

Yeah, that's perfect, thanks!

@maleadt maleadt merged commit 0093468 into JuliaGPU:master Jun 18, 2025
13 of 16 checks passed
@simeonschaub simeonschaub deleted the sds/kron branch June 18, 2025 14:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants