🚀 The feature, motivation and pitch
combine inner and outer reduction into one kernel.
- do partial outer reduction while blocks are looping over outer domain doing block inner reduction.
- write result of partial outer reduction to gmem
- sync and reload from gmem
- remap parallel pattern to finalized outer reduciton.
used in ln_backward.
Alternatives
No response
Additional context
No response