I found a couple things while looking at the transpose tutorial.
First, the launch and kernel solutions could use block_unchecked policies. This will also allow the kernel implementation to skip the second sync threads call.
Second, it doesn't look like the launch solution actually uses shared memory as intended. It looks like the same thread that reads a value writes that value. The intention of shared memory is to let different threads read and write so memory accesses to both matrices are coalesced. This will require the launch solution to have a teamSync call, which it is currently lacking.
I found a couple things while looking at the transpose tutorial.
First, the launch and kernel solutions could use block_unchecked policies. This will also allow the kernel implementation to skip the second sync threads call.
Second, it doesn't look like the launch solution actually uses shared memory as intended. It looks like the same thread that reads a value writes that value. The intention of shared memory is to let different threads read and write so memory accesses to both matrices are coalesced. This will require the launch solution to have a teamSync call, which it is currently lacking.