Conversation
|
Check memory usage too. Naive im2col might use tons of memory (maybe not the CPU version?), so even if your code is slower it's worth it to add such an inplace version especially for training conv layers where memory size counts a lot. Doesn't vec_dot_f16/f32 faster than omp for computing the inner products? |
|
Another option is to use im2col+mm in a tiled fashion: in a loop, compute im2col into a temporary buffer for a fixed batch of output patches, then call mul_mat to compute part of the result. It has the advantage of using a fixed amount of memory (eg 16 MB) but is still able to piggy-back on all the investment going into gemm kernels (like LLAMAFILE). I'm sure the direct approach can be faster in theory, but might be a tall order to get there. Not that I want to discourage you, I could be totally wrong and it's great to have options :) I've been using the tiled method on convolution heavy models for a while, but only implemented it for contiguous-channels layout so far. Will try to add a regular version and do some comparisons. |
|
Yes, I agree it's quite difficult to get im2col + gemm performance without heavy optimisations which leads to code which is not really maintainable. The tiled approach is interesting, if you have a contiguous channel implementation lying around, I can try to implement the regular version and get some numbers |
|
Implementation is currently here. It needs to make sure to allocate some space in Contiguous-channels has some advantages: 1x1 kernel conv2d is a direct mul_mat, and the result doesn't need to be permuted. |
Adding as draft because at the moment it doesn't seem to be always faster than doing im2col, but in some cases it is. Looking to optimize this solution as it's currently completely unoptimized, but it might be useful for #14316