-
Notifications
You must be signed in to change notification settings - Fork 89
improve kron implementation
#600
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The current implementation has some performance issues as `kron_kernel!` gets boxed, I believe due to self-recursion. The checks for whether to transpose or conjugate also happen at runtime in each iteration which seems unnecessary. This works around the issue described in JuliaGPU/AMDGPU.jl#766 (comment), though it would be good to look further into since the old code is still semantically correct.
|
Your PR requires formatting changes to meet the project's style guidelines. Click here to view the suggested changes.diff --git a/src/host/linalg.jl b/src/host/linalg.jl
index bdfab65..2072ed7 100644
--- a/src/host/linalg.jl
+++ b/src/host/linalg.jl
@@ -796,7 +796,7 @@ for wrapa in trans_adj_wrappers, wrapb in trans_adj_wrappers
backend = KernelAbstractions.get_backend(C)
kernel = kron_kernel!(backend)
- kernel(C, A, B, ndrange=(size(C, 1), size(C, 2)))
+ kernel(C, A, B, ndrange = (size(C, 1), size(C, 2)))
return C
end |
maleadt
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I don't think we want to simply "work around" the issue without knowing what's up. If this doesn't fail on any other platform, and the @device_code output is identical, I don't see how splitting the kernels into separately-named ones could help.
The run-time check is fair, but I wonder if it wouldn't be better to express that in terms of the argument types, so that (ideally) the kernel can be entirely moved outside of the @eval block. No need to eval many versions of the kernel if we can rely on the compiler to optimize things away.
Significantly simplify indexing by not applying transposes and adjoints manually and by using `fld1` and `mod1`. Also add some combinations involving mixed vectors and matrices for generality
|
Thanks, that definitely improves the implementation here. I'm still wary treating this as a workaround for the AMDGPU.jl issue though, as moving a kernel out shouldn't impact things (we can't use boxed values in GPU kernels, so that can't be the cause). |
|
I have opened an issue in AMDGPU.jl with a simple reproducer of the underlying issue: JuliaGPU/AMDGPU.jl#780. Should we go ahead with this and track that issue there? |
|
Yeah, that's perfect, thanks! |
The current implementation has some performance issues as
kron_kernel!gets boxed, I believe due to self-recursion. The checks for whether to
transpose or conjugate also happen at runtime in each iteration which
seems unnecessary.
This works around the issue described in
JuliaGPU/AMDGPU.jl#766 (comment),
though it would be good to look further into since the old code is still
semantically correct.