Skip to content

Conversation

@lj970926
Copy link
Contributor

PR types

Performance optimization

PR changes

OPs

Description

change fused_gemm_epilogue to use one unify fc_fusion

1,
errors::InvalidArgument(
"FusedGemm do not support batched fc now, but got batch size %d.",
batch_size));
Copy link
Contributor Author

@lj970926 lj970926 Jan 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里选择不支持batched_fc有以下几个考虑:

  1. GPU和单测里目前均没有batched_fc的支持
  2. fc_batched目前不支持bias和act的融合
  3. 该kernel目前只会在FusedLinear中调用,由于weights是2维所以不会有batched_fc

Copy link
Contributor

@zhangyk0314 zhangyk0314 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@cqulilujia cqulilujia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

PADDLE_ENFORCE_XDNN_SUCCESS(r, "gelu");
XPUType* out_ptr = reinterpret_cast<XPUType*>(dev_ctx.template Alloc<T>(out));

decltype(&xpu_fc_wrapper<XPUType, int16_t>) fc_api_list[5] = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里的fc_api_list,看起来内容和phi/kernels/xpu/xpu_api_wrapper.h里面的MatMulXPUFunction函数中定义的一样?
现在这么写没问题,不过有没有更好或者更优雅的办法能减少重复代码?以及如果以后有更新,两遍没同步的话,不知道会不会导致奇怪的问题。

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我在xpu_api_wrapper.h的更改还在等待xhpc更新产出,我这边先mark一下,等我下个pr更新xpu_api_wrapper.h的时候把这边也更新一下好了。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我在xpu_api_wrapper.h的更改还在等待xhpc更新产出,我这边先mark一下,等我下个pr更新xpu_api_wrapper.h的时候把这边也更新一下好了。

感觉可以把根据fccal_type选择和运行fc_fusion和fc_batched单独抽取到一个函数里

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

感觉可以把根据fccal_type选择和运行fc_fusion和fc_batched单独抽取到一个函数里

xpu_api_wrapper.h 有个MatMulXPUFunction就是实现这个功能的

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

不过看了一下那个函数只是根据batch_size来选择调用fc_fusion或fc_batched,不知道能不能满足你这边的情况

@houj04 houj04 merged commit c6c9697 into PaddlePaddle:develop Feb 1, 2024
@houj04 houj04 added the XPU label Sep 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants