[PHI] Fix paddle.cumsum calculation speed #74442
Merged
lshpku merged 8 commits intoPaddlePaddle:developfrom Aug 12, 2025
Merged
[PHI] Fix paddle.cumsum calculation speed #74442lshpku merged 8 commits intoPaddlePaddle:developfrom
paddle.cumsum calculation speed #74442lshpku merged 8 commits intoPaddlePaddle:developfrom
Conversation
|
你的PR提交成功,感谢你对开源项目的贡献! |
05a1ccc to
f164405
Compare
BlockPrefixCallbackOp paddle.cumsum calculation speed
wanghuancoder
approved these changes
Aug 12, 2025
lshpku
approved these changes
Aug 12, 2025
maxiaolong001
pushed a commit
to maxiaolong001/Paddle
that referenced
this pull request
Aug 12, 2025
* fix ThrustCumsumKernel * refine * refine ThrustCumsumKernel * fix * update ThrustCumsumKernel * fix logcumsumexp in ThrustCumsumKernel
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
PR Category
Operator Mechanism
PR Types
Performance
Description
修复 #74081 精度修复时,对部分模型造成的性能下降:https://console.cloud.baidu-int.com/devops/icafe/issue/DLTP-92332/show
修复方法为:
ThrustCumsumKernel快速路径ThrustCumsumKernel增加 fp16 与 bf16 类型支持在之前的测试中,错误地判断了 Thrust 库的计算精度;在新的测试中,对于 1D 超大张量的边缘情况(即单个巨型行), Thrust 库表现完美,而
BlockScanKernel由于grid_size == 1,导致其退化为串行执行,计算速度显著下降以下为 20 万至 20 亿元素个数时,
paddle.cumsumAPI 通过BlockScanKernel分支与ThrustCumsumKernel分支的计算精度(与 torch 相比)与计算速度对比:结果说明,在 1D 张量的情况下, Thrust 库的计算精度与计算速度均显著优于当前的
BlockScanKernel内核实现。当前BlockScanKernel内核实现主要为多行数据设计,其每个 Block 都在并行处理不同的数据行。Pcard-85711