Improve PowKernel and PowGradKernel for GPU #74638

zrr1999 · 2025-08-15T07:36:57Z

PR Category

Operator Mechanism

PR Types

Performance

Description

修改内容

原本的 Paddle GPU PowKernel 仅针对 factor为0和 1 进行了特化，本PR 额外添加了对 0.5 2 3 -0.5 -1 -2 的特化（对齐PyTorch），其中 -0.5 -1 -2 不对整数类型进行特化。
原本的 Paddle GPU PowGradKernel 仅针对 factor 为0 进行了特化，本PR 额外添加了对 1 1.5 2 3 4 0.5 -1 的特化（对齐PyTorch），其中 0.5 -1 不对整数类型进行特化。
PowGradKernel 的 1 特化使用copy。2 特化复用现有functor。-1 0.5特化使用新的CudaReciprocalGradDepXFunctor 和 CudaSqrtGradDepXFunctor，依赖于X而不是out。1.5 3 4 使用新的 CudaPow1p5GradFunctor,CudaCubeGradFunctor, CudaPow4GradFunctor。
PowKernel 的 0.5 2 -0.5 -1 特化复用现有functor。3 -2 使用新的CudaCubeFunctor 和 CudaRsquareFunctor。
PowKernel 和 PowGradKernel 的非特化部分通过修改计算顺序可以对齐PyTorch的精度
MPTypeTrait 添加 complex 到 complex 的提升，与pytorch对齐。
exponent 当前使用 float 保存，存在精度损失，~~本PR修改为double~~ 本PR修改为T，与PyTorch对齐
ElementwisePowGradKernel 的 compute_pow_grad_dx 和 compute_pow_grad_dy 通过修改计算顺序和精度对齐PyTorch的精度。
test_pow_op单测中，numpy 会将 float(np.random.uniform(1, 2, [])) 视为float32,但实际上这个值是float64，Paddle与PyTorch对齐后，这部分输入将视为float64,因此添加 .astype(np.float32)保证float64和float32在这个值上是一致的，从而与numpy的输入保持一致。
ElementwiseInversePowFunctor<ComplexType> 的实现中，gpu的逻辑把 pow(a,b)修正为pow(b, a)。

特化实现思路

PyTorch 的反向实现如下：

Tensor pow_backward_self(
    const Tensor& grad,
    const Tensor& self,
    const Tensor& exponent) {
  auto out = at::where(
      exponent == 0.0,
      at::scalar_tensor(0.0, grad.options()),
      grad * (exponent * self.pow(exponent - 1)).conj());
  return handle_r_to_c(self, std::move(out));
}

PyTorch 与 Paddle，PyTorch的反向和正向使用了相同的算子库，因此对pow特化的时候，反向不需要额外的特化，但是Paddle实现要注意以下几点：

PyTorch 许多算子实数和复数使用了同一套实现，因此一些计算并不是最优路径，
PyTorch 采用了统一的实现（包括实数和复数），所以在计算顺序上并不是通常的顺序计算，例如 1/x的反向，最优应该是-dout/(x*x)，但PyTorch是dout*(-1/(x*x))。x^3的反向 dout * three * x * x 要改成 dout * (three * (x * x)) 。
PyTorch 的正向对 0 0.5 1 2 3 -0.5 -1 -2 进行了特化，根据上述实现可知反向对 1 1.5 2 3 4 0.5 0 -1 实现了特化。

剩余问题

本PR合入后剩余不对齐 case如下：

paddle.pow(Tensor([2, 3, 4],"float32"), Tensor([],"float32"), )
paddle.pow(Tensor([20, 1],"float32"), Tensor([],"float32"), )
paddle.pow(Tensor([20000, 1],"float32"), Tensor([],"float32"), )
paddle.pow(Tensor([20600, 1],"float32"), Tensor([],"float32"), )
paddle.pow(Tensor([4, 3, 2],"float32"), Tensor([4, 3, 2],"float16"), )
paddle.pow(Tensor([4, 3, 2],"float64"), Tensor([4, 3, 2],"float16"), )
paddle.pow(Tensor([4, 3, 2],"float64"), Tensor([4, 3, 2],"float32"), )
paddle.pow(Tensor([5, 9, 7],"float64"), Tensor([7],"float64"), )
paddle.pow(Tensor([],"float32"), Tensor([209],"float32"), )
paddle.pow(Tensor([4, 3, 2],"float64"), Tensor([4, 3, 2],"float32"), )
paddle.pow(Tensor([4, 3, 2],"float64"), Tensor([4, 3, 2],"float16"), )

其余问题目前汇总如下：

目前特化只针对pow(tensor, scalar)，需要补充pow(tensor, tensor)
第一个输入是常数或0d的情况 PyTorch 也单独写了一个kernel，Paddle目前无法对齐，需要新增kernel。
a b 的shape不一致时，广播可能产生随机性。
a b 的dtype不一致时且a的精度高于b的精度，根据分析 PyTorch 的实现（grad * (exponent * self.pow(exponent - 1)).conj()）可以发现，PyTorch的反向是复用的前向算子，并未有类型提升机制，exponent - 1 的会产生精度误差，虽然进行self.pow(exponent - 1)计算的时候会进行类型提升，但是精度损失已经产生，而 paddle 则提前进行了类型提升，所以进行类似的 exponent - 1 时精度本身就是更高的，后续的计算与PyTorch可以对齐。

其他不能与PyTorch 对齐的情况：

tensor^scalar当，tensor为int，scalar为float的时候，torch会返回f32,paddle返回与tensor一致。

其他问题：

下列内容与本PR的具体修改内容无关：

随机输入生成的问题，将在 [Accuracy diff No.78、142、143] Improve get_numpy_tensor for rpow and pow PFCCLab/PaddleAPITest#528 修复。
PowGradDX和PowGradDY的复数实现可能存在问题。

Pcard-67164

paddle-bot · 2025-08-15T07:37:05Z

你的PR提交成功，感谢你对开源项目的贡献!
请关注后续CI自动化测试结果，详情请参考Paddle-CI手册。
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

wanghuancoder

LGTM

zrr1999 added 3 commits August 14, 2025 19:54

format compute_pow

91f0193

refactor BaseCudaPowFunctor

2288558

improve PowKernel

f6d091a

zrr1999 changed the title ~~Acc/pow~~ Improve PowKernel and PowGradKernel for GPU Aug 15, 2025

zrr1999 added 5 commits August 15, 2025 16:31

add test

b638563

add test

5bbcce5

align torch

d8d14cb

add test

f8874af

rm unused functor

134fabc

zrr1999 mentioned this pull request Aug 18, 2025

[Accuracy diff No.78、142、143] Improve get_numpy_tensor for rpow and pow PFCCLab/PaddleAPITest#528

Merged

zrr1999 added 11 commits August 19, 2025 11:21

fix windows

9c3b823

fix one ele tensor

5642758

fix pow

d82fde8

exponent use f64

1234aa6

fix

fbd96ee

add amp

7a584f2

fix cudapowfunctor

7e8a321

add complex test

806ef4f

fix ElementwiseInversePowFunctor

fbdf325

fix ele_pow acc

ac7773a

fix test

296b4ae

wanghuancoder approved these changes Aug 22, 2025

View reviewed changes

wanghuancoder merged commit 5cb6b67 into PaddlePaddle:develop Aug 25, 2025
177 of 190 checks passed

wanghuancoder mentioned this pull request Oct 24, 2025

[Cherry-pick Fleety_12] Bigtensor and api precision #76023

Closed

zhengshengning mentioned this pull request Oct 24, 2025

[Cherry-pick Fleety_12] Bigtensor and api precision #76028

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve PowKernel and PowGradKernel for GPU #74638

Improve PowKernel and PowGradKernel for GPU #74638

Uh oh!

zrr1999 commented Aug 15, 2025 •

edited

Loading

Uh oh!

paddle-bot bot commented Aug 15, 2025

Uh oh!

wanghuancoder left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Improve PowKernel and PowGradKernel for GPU #74638

Improve PowKernel and PowGradKernel for GPU #74638

Uh oh!

Conversation

zrr1999 commented Aug 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Category

PR Types

Description

修改内容

特化实现思路

剩余问题

其他问题：

Uh oh!

paddle-bot bot commented Aug 15, 2025

Uh oh!

wanghuancoder left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

zrr1999 commented Aug 15, 2025 •

edited

Loading