-
Notifications
You must be signed in to change notification settings - Fork 5.9k
Fix paddle.linalg.vector_norm for big tensor #74197
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix paddle.linalg.vector_norm for big tensor #74197
Conversation
|
你的PR提交成功,感谢你对开源项目的贡献! |
wanghuancoder
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
/re-run all-failed |
|
/re-run all-failed |
|
Sorry to inform you that 8ebf85f's CIs have passed for more than 7 days. To prevent PR conflicts, you need to re-run all CIs manually. |
|
/re-run all-failed |
8ebf85f to
d1f6fee
Compare
|
/re-run all-failed |
2 similar comments
|
/re-run all-failed |
|
/re-run all-failed |
|
/re-run all-failed |
310aead
310aead to
7d48cb8
Compare
|
/re-run all-failed |
1 similar comment
|
/re-run all-failed |
wanghuancoder
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
| #include "paddle/phi/kernels/funcs/reduce_function.h" | ||
| #include "paddle/phi/kernels/gpu/reduce.h" | ||
|
|
||
| #include "paddle/fluid/framework/tensor_util.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
phi目录下不能引入fluid目录的文件,这里是否是多余引用
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已删除
38cc403 to
be1019d
Compare
PR Category
Execute Infrastructure
PR Types
Bug fixes
Description
修复
p_norm内核及相关 API 的多项问题一、 概述
本次修复主要针对
p_normkernel在以下四个方面的问题:FP16精度下,中间结果累加可能导致上溢出变为inf。int32溢出风险,可能导致error 700或精度错误。(rebase以后发现已经被修改)下面将对每个问题的具体成因和修复方案进行详细说明。
二、 问题详情与修复方案
1. 无穷范数 (p=inf) 的反向梯度分配策略
问题描述:
在计算无穷范数的反向梯度时,当输入张量中存在多个绝对值相等的最大值时,PaddlePaddle 与 PyTorch 的梯度分配策略存在差异。
1.0赋给所有绝对值最大的元素。1.0在所有绝对值最大的元素之间进行平均分配。修复方案:
在
p_norm_grad_kernel.cu中,参考了amaxkernel 实现。该 kernel 会统计出绝对值最大元素的个数,并在反向传播时将梯度进行平均分配,从而与 PyTorch 的行为保持一致。2. L2范数 (p=2) 在 FP16 精度下的溢出问题
问题描述:
当使用
FP16数据类型计算 L2 范数时,ReduceAnyKernel和ReduceHigherDimKernel中的累加操作使用了FP32的累加器reduce_var以保证精度。但在计算结束后,通过Ty result = static_cast<Ty>(reduce_var);将结果转换回FP16(Ty此时为half)。如果reduce_var的值超过了FP16的最大表示范围 (65504),result就会上溢出为inf。修复方案:
考虑到
Reduce*kernel 的通用性,为避免影响其他模块,选择在调用层进行处理。在p_norm_kernel.cu中,对调用Reduce*kernel 的模板参数进行了修改,强制要求返回类型Ty为FP32,从而避免了从FP32到FP16的溢出转换,同时对计算结果直接进行开方和强转,避免后续的问题。3. 多种范数下的整数溢出风险
问题描述:
在
reduce_grad_functions.h的实现中,部分用于索引计算或计数的变量使用了int(32位整型)。当处理超大规模的张量时,这些变量可能发生整数溢出,进而导致error 700或计算结果的精度错误。修复方案:
将
reduce_grad_functions.h中存在溢出风险的int类型变量统一调整为int64_t,确保在处理大规模数据时能够正确计算。4. 负数范数 (p=-1) 的反向传播逻辑
问题描述:
当
p = -1时,p_norm反向传播的计算逻辑与 PyTorch 未完全对齐,导致在特定场景下出现精度差异。修复方案:
在
p_norm_grad_kernel.cu的PNormGradFunctor函数中,对p < 0分支下的反向计算公式进行了修正,使其与 PyTorch 的实现逻辑对齐。