Fix paddle.linalg.vector_norm for big tensor #74197

LCStayingdullCircuit · 2025-07-23T10:17:58Z

PR Category

Execute Infrastructure

PR Types

Bug fixes

Description

修复 `p_norm` 内核及相关 API 的多项问题

一、概述

本次修复主要针对 p_norm kernel在以下四个方面的问题：

无穷范数 (p=inf)：反向梯度分配策略与 PyTorch 不一致。
L2范数 (p=2)：在 FP16 精度下，中间结果累加可能导致上溢出变为 inf。
多种范数：计算过程中存在 int32 溢出风险，可能导致 error 700 或精度错误。（rebase以后发现已经被修改）
负数范数 (p=-1)：反向传播的计算逻辑与 PyTorch 未对齐，导致精度问题。

下面将对每个问题的具体成因和修复方案进行详细说明。

二、问题详情与修复方案

1. 无穷范数 (p=inf) 的反向梯度分配策略

问题描述：
在计算无穷范数的反向梯度时，当输入张量中存在多个绝对值相等的最大值时，PaddlePaddle 与 PyTorch 的梯度分配策略存在差异。
- PaddlePaddle (修复前): 将梯度 1.0 赋给所有绝对值最大的元素。
- PyTorch (对齐目标): 将梯度 1.0 在所有绝对值最大的元素之间进行平均分配。
修复方案：
在 p_norm_grad_kernel.cu 中，参考了 amax kernel 实现。该 kernel 会统计出绝对值最大元素的个数，并在反向传播时将梯度进行平均分配，从而与 PyTorch 的行为保持一致。

2. L2范数 (p=2) 在 FP16 精度下的溢出问题

问题描述：
当使用 FP16 数据类型计算 L2 范数时，ReduceAnyKernel 和 ReduceHigherDimKernel 中的累加操作使用了 FP32 的累加器 reduce_var 以保证精度。但在计算结束后，通过 Ty result = static_cast<Ty>(reduce_var); 将结果转换回 FP16 (Ty 此时为 half)。如果 reduce_var 的值超过了 FP16 的最大表示范围 (65504)，result 就会上溢出为 inf。
修复方案：
考虑到 Reduce* kernel 的通用性，为避免影响其他模块，选择在调用层进行处理。在 p_norm_kernel.cu 中，对调用 Reduce* kernel 的模板参数进行了修改，强制要求返回类型 Ty 为 FP32，从而避免了从 FP32 到 FP16 的溢出转换，同时对计算结果直接进行开方和强转，避免后续的问题。

3. 多种范数下的整数溢出风险

问题描述：
在 reduce_grad_functions.h 的实现中，部分用于索引计算或计数的变量使用了 int (32位整型)。当处理超大规模的张量时，这些变量可能发生整数溢出，进而导致 error 700 或计算结果的精度错误。
修复方案：
将 reduce_grad_functions.h 中存在溢出风险的 int 类型变量统一调整为 int64_t，确保在处理大规模数据时能够正确计算。

4. 负数范数 (p=-1) 的反向传播逻辑

问题描述：
当 p = -1 时，p_norm 反向传播的计算逻辑与 PyTorch 未完全对齐，导致在特定场景下出现精度差异。
修复方案：
在 p_norm_grad_kernel.cu 的 PNormGradFunctor 函数中，对 p < 0 分支下的反向计算公式进行了修正，使其与 PyTorch 的实现逻辑对齐。

paddle-bot · 2025-07-23T10:18:04Z

你的PR提交成功，感谢你对开源项目的贡献!
请关注后续CI自动化测试结果，详情请参考Paddle-CI手册。
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

wanghuancoder

LGTM

LCStayingdullCircuit · 2025-07-25T02:53:18Z

/re-run all-failed

LCStayingdullCircuit · 2025-07-25T03:19:23Z

/re-run all-failed

paddle-ci-bot · 2025-08-02T02:44:34Z

Sorry to inform you that 8ebf85f's CIs have passed for more than 7 days. To prevent PR conflicts, you need to re-run all CIs manually.

LCStayingdullCircuit · 2025-08-04T06:01:27Z

/re-run all-failed

LCStayingdullCircuit · 2025-08-05T07:50:22Z

/re-run all-failed

LCStayingdullCircuit · 2025-08-06T07:42:42Z

/re-run all-failed

LCStayingdullCircuit · 2025-08-08T05:12:17Z

/re-run all-failed

LCStayingdullCircuit · 2025-08-18T02:08:20Z

/re-run all-failed

LCStayingdullCircuit · 2025-08-21T11:15:52Z

/re-run all-failed

LCStayingdullCircuit · 2025-08-25T07:12:09Z

/re-run all-failed

wanghuancoder

LGTM

zyfncg · 2025-08-25T13:21:22Z

paddle/phi/kernels/gpu/p_norm_kernel.cu

 #include "paddle/phi/kernels/funcs/reduce_function.h"
 #include "paddle/phi/kernels/gpu/reduce.h"

+#include "paddle/fluid/framework/tensor_util.h"


phi目录下不能引入fluid目录的文件，这里是否是多余引用

lshpku previously approved these changes Jul 23, 2025

View reviewed changes

wanghuancoder previously approved these changes Jul 23, 2025

View reviewed changes

zyfncg previously approved these changes Jul 25, 2025

View reviewed changes

leon062112 mentioned this pull request Jul 31, 2025

[BIG tensor] paddle.linalg.norm修改fp16的非法case PFCCLab/PaddleAPITest#479

Merged

LCStayingdullCircuit force-pushed the bugfix/vector_norm branch from 8ebf85f to d1f6fee Compare August 4, 2025 11:03

cszdrg mentioned this pull request Aug 11, 2025

为p_norm内核反向添加均分 #74524

Closed

cszdrg approved these changes Aug 12, 2025

View reviewed changes

LCStayingdullCircuit dismissed stale reviews from zyfncg, wanghuancoder, and lshpku via 310aead August 21, 2025 04:11

LCStayingdullCircuit force-pushed the bugfix/vector_norm branch from 310aead to 7d48cb8 Compare August 21, 2025 05:00

lshpku previously approved these changes Aug 21, 2025

View reviewed changes

LCStayingdullCircuit added 5 commits August 25, 2025 15:28

fix bug:vector_norm test=develop

8be7936

bugfix:p_norm test=develop

2231fa5

bugfix:p_norm test=develop

55c2649

bugfix:p_norm test=develop

7828c52

bugfix:p_norm test=develop

6faabd5

zrr1999 mentioned this pull request Aug 25, 2025

Disable grad assert for test_dygraph(static)_negative(positive)_inf_norm PaddlePaddle/PaddleTest#3132

Merged

wanghuancoder previously approved these changes Aug 25, 2025

View reviewed changes

zyfncg reviewed Aug 25, 2025

View reviewed changes

improve

be1019d

zrr1999 dismissed stale reviews from wanghuancoder and lshpku via be1019d August 26, 2025 02:19

zrr1999 force-pushed the bugfix/vector_norm branch from 38cc403 to be1019d Compare August 26, 2025 02:19

wanghuancoder approved these changes Aug 26, 2025

View reviewed changes

zyfncg approved these changes Aug 26, 2025

View reviewed changes

swgu98 added skip-ci: static-check skip-ci: all labels Aug 26, 2025

swgu98 merged commit a184716 into PaddlePaddle:develop Aug 26, 2025
136 of 147 checks passed

Fix paddle.linalg.vector_norm for big tensor #74197

Fix paddle.linalg.vector_norm for big tensor #74197

Uh oh!

Conversation

LCStayingdullCircuit commented Jul 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Category

PR Types

Description

修复 p_norm 内核及相关 API 的多项问题

一、 概述

二、 问题详情与修复方案

1. 无穷范数 (p=inf) 的反向梯度分配策略

2. L2范数 (p=2) 在 FP16 精度下的溢出问题

3. 多种范数下的整数溢出风险

4. 负数范数 (p=-1) 的反向传播逻辑

Uh oh!

paddle-bot bot commented Jul 23, 2025

Uh oh!

wanghuancoder left a comment

Choose a reason for hiding this comment

Uh oh!

LCStayingdullCircuit commented Jul 25, 2025

Uh oh!

LCStayingdullCircuit commented Jul 25, 2025

Uh oh!

paddle-ci-bot bot commented Aug 2, 2025

Uh oh!

LCStayingdullCircuit commented Aug 4, 2025

Uh oh!

LCStayingdullCircuit commented Aug 5, 2025

Uh oh!

LCStayingdullCircuit commented Aug 6, 2025

Uh oh!

LCStayingdullCircuit commented Aug 8, 2025

Uh oh!

LCStayingdullCircuit commented Aug 18, 2025

Uh oh!

LCStayingdullCircuit commented Aug 21, 2025

Uh oh!

LCStayingdullCircuit commented Aug 25, 2025

Uh oh!

wanghuancoder left a comment

Choose a reason for hiding this comment

Uh oh!

zyfncg Aug 25, 2025

Choose a reason for hiding this comment

Uh oh!

zrr1999 Aug 26, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

LCStayingdullCircuit commented Jul 23, 2025 •

edited

Loading

修复 `p_norm` 内核及相关 API 的多项问题

一、概述

二、问题详情与修复方案