-
Notifications
You must be signed in to change notification settings - Fork 5.9k
[XPU][PHI Kernels] fuse matmul+bias+act for xpu #61350
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| 1, | ||
| errors::InvalidArgument( | ||
| "FusedGemm do not support batched fc now, but got batch size %d.", | ||
| batch_size)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里选择不支持batched_fc有以下几个考虑:
- GPU和单测里目前均没有batched_fc的支持
- fc_batched目前不支持bias和act的融合
- 该kernel目前只会在FusedLinear中调用,由于weights是2维所以不会有batched_fc
zhangyk0314
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
cqulilujia
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
| PADDLE_ENFORCE_XDNN_SUCCESS(r, "gelu"); | ||
| XPUType* out_ptr = reinterpret_cast<XPUType*>(dev_ctx.template Alloc<T>(out)); | ||
|
|
||
| decltype(&xpu_fc_wrapper<XPUType, int16_t>) fc_api_list[5] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里的fc_api_list,看起来内容和phi/kernels/xpu/xpu_api_wrapper.h里面的MatMulXPUFunction函数中定义的一样?
现在这么写没问题,不过有没有更好或者更优雅的办法能减少重复代码?以及如果以后有更新,两遍没同步的话,不知道会不会导致奇怪的问题。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
我在xpu_api_wrapper.h的更改还在等待xhpc更新产出,我这边先mark一下,等我下个pr更新xpu_api_wrapper.h的时候把这边也更新一下好了。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
我在xpu_api_wrapper.h的更改还在等待xhpc更新产出,我这边先mark一下,等我下个pr更新xpu_api_wrapper.h的时候把这边也更新一下好了。
感觉可以把根据fccal_type选择和运行fc_fusion和fc_batched单独抽取到一个函数里
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
感觉可以把根据fccal_type选择和运行fc_fusion和fc_batched单独抽取到一个函数里
xpu_api_wrapper.h 有个MatMulXPUFunction就是实现这个功能的
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
不过看了一下那个函数只是根据batch_size来选择调用fc_fusion或fc_batched,不知道能不能满足你这边的情况
PR types
Performance optimization
PR changes
OPs
Description
change fused_gemm_epilogue to use one unify fc_fusion