We have optimized codegen for packing on f32 types, but not int8. This is a tracking issue for int8 case. I observed that some pack ops are not vectorized. Because masking is only supported on limited ops for dynamic shapes. We should just relax the condition to using isElementwise(), so linalg.transpose op can also get vectorized. I have an easy fix locally, and will send it out for review.
With the change and better distribution logic, we can save up to 43% total dispatch sizes for int8 models, see https://gist.github.com/iree-github-actions-bot/fa5becb880b9a6afc2d362883a585d5a
The next step is having better pack codegen for non-f32 types. We need a pattern to pack innermost tile being a single element and leverage it to 16x16 transpose lowering. Looking at transpose permutation map and using vector.bitcast op should help here.
We have optimized codegen for packing on f32 types, but not int8. This is a tracking issue for int8 case. I observed that some pack ops are not vectorized. Because masking is only supported on limited ops for dynamic shapes. We should just relax the condition to using
isElementwise(), so linalg.transpose op can also get vectorized. I have an easy fix locally, and will send it out for review.With the change and better distribution logic, we can save up to 43% total dispatch sizes for int8 models, see https://gist.github.com/iree-github-actions-bot/fa5becb880b9a6afc2d362883a585d5a
The next step is having better pack codegen for non-f32 types. We need a pattern to pack innermost tile being a single element and leverage it to 16x16 transpose lowering. Looking at transpose permutation map and using vector.bitcast op should help here.