[AutoParallel] fix the grad_clip logic of auto_hybrid_pp #74409

zty-king · 2025-08-04T17:20:11Z

PR Category

Auto Parallel

PR Types

Performance

Description

当前新的动半pp的GradientClipByGlobalNorm逻辑存在问题，计算结果错误，需要修复。
第一版修复的逻辑导致以下训练性能下降比较严重，同时发现一些重要的可优化点，因此做一些优化
优化如下：
1. 将all_gather+sum操作修改为 all_reduce
1. 将每次调用时创建新的group的逻辑，修改成get_sub_mesh，并根据sub_mesh获取sub_group，从而防止大量group的创建导致性能下降。

paddle-bot · 2025-08-04T17:20:17Z

你的PR提交成功，感谢你对开源项目的贡献!
请关注后续CI自动化测试结果，详情请参考Paddle-CI手册。
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

codecov-commenter · 2025-08-04T22:01:20Z

Codecov Report

❌ Patch coverage is 55.55556% with 4 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@68835a8). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
python/paddle/nn/clip.py	55.55%	4 Missing ⚠️

❌ Your patch status has failed because the patch coverage (55.55%) is below the target coverage (90.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files

@@            Coverage Diff             @@
##             develop   #74409   +/-   ##
==========================================
  Coverage           ?   55.55%           
==========================================
  Files              ?        1           
  Lines              ?        9           
  Branches           ?        0           
==========================================
  Hits               ?        5           
  Misses             ?        4           
  Partials           ?        0

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

zty-king · 2025-08-05T07:09:04Z

/re-run all-failed

zty-king · 2025-08-05T10:39:40Z

本地测试如下：

xuxinyi389 · 2025-08-06T13:07:14Z

python/paddle/nn/clip.py

+                    global_norm_var.process_mesh,
+                    global_norm_var.placements,
+                )
+


感觉 is_pp_enable 逻辑不够简洁，可以改成下方代码：

# Check for auto hybrid pipeline parallelism and source mesh existence if flag_auto_hybrid_pp and src_mesh is not None: g_mesh = dist.get_mesh() # Check if mesh exists and pipeline parallelism is enabled ("pp" dim size > 1) if g_mesh and "pp" in g_mesh.dim_names and g_mesh.get_dim_size("pp") > 1: # Get the pipeline parallelism subgroup for communication pp_group = g_mesh.get_submesh_with_dim("pp").get_group("pp") # Perform all-reduce on the local tensor value across the PP group global_norm_var_local = global_norm_var._local_value() dist.all_reduce( global_norm_var_local, op=dist.ReduceOp.SUM, group=pp_group, ) # Re-shard the tensor with the reduced value global_norm_var = dist.shard_tensor( global_norm_var_local, global_norm_var.process_mesh, global_norm_var.placements, )

zty-king · 2025-08-07T00:15:25Z

/re-run all-failed

xuxinyi389

LGTM

From00

LGTM

…e#74409) * fix the grad clip performance * add test * empty commit to rerun CI * modify the note * Simplify code logic

…dlePaddle#74409)'\n This reverts commit 8f77fa2.

…dlePaddle#74409)" This reverts commit 8f77fa2.

fix the grad clip performance

2b8e14e

paddle-bot bot added the contributor External developers label Aug 4, 2025

add test

442fa68

zty-king added 2 commits August 5, 2025 02:06

empty commit to rerun CI

77403f7

modify the note

21fb0c1

xuxinyi389 reviewed Aug 6, 2025

View reviewed changes

Simplify code logic

bf4a270

xuxinyi389 approved these changes Aug 7, 2025

View reviewed changes

XieYunshen added the skip-ci: coverage label Aug 8, 2025

From00 approved these changes Aug 8, 2025

View reviewed changes

xuxinyi389 merged commit 8f77fa2 into PaddlePaddle:develop Aug 11, 2025
83 of 90 checks passed

waliwali777 added a commit to waliwali777/Paddle2 that referenced this pull request Aug 26, 2025

Revert '[AutoParallel] fix the grad_clip logic of auto_hybrid_pp (Pad…

21e113c

…dlePaddle#74409)'\n This reverts commit 8f77fa2.

waliwali777 added a commit to waliwali777/Paddle2 that referenced this pull request Aug 26, 2025

Revert '[AutoParallel] fix the grad_clip logic of auto_hybrid_pp (Pad…

3729614

…dlePaddle#74409)'\n This reverts commit 8f77fa2.

waliwali777 added a commit to waliwali777/Paddle2 that referenced this pull request Aug 28, 2025

Revert "[AutoParallel] fix the grad_clip logic of auto_hybrid_pp (Pad…

720cecd

…dlePaddle#74409)" This reverts commit 8f77fa2.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[AutoParallel] fix the grad_clip logic of auto_hybrid_pp #74409

[AutoParallel] fix the grad_clip logic of auto_hybrid_pp #74409

Uh oh!

zty-king commented Aug 4, 2025 •

edited

Loading

Uh oh!

paddle-bot bot commented Aug 4, 2025

Uh oh!

codecov-commenter commented Aug 4, 2025 •

edited

Loading

Uh oh!

zty-king commented Aug 5, 2025

Uh oh!

zty-king commented Aug 5, 2025

Uh oh!

xuxinyi389 Aug 6, 2025 •

edited

Loading

Uh oh!

zty-king commented Aug 7, 2025

Uh oh!

xuxinyi389 left a comment

Uh oh!

From00 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[AutoParallel] fix the grad_clip logic of auto_hybrid_pp #74409

[AutoParallel] fix the grad_clip logic of auto_hybrid_pp #74409

Uh oh!

Conversation

zty-king commented Aug 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Category

PR Types

Description

Uh oh!

paddle-bot bot commented Aug 4, 2025

Uh oh!

codecov-commenter commented Aug 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

zty-king commented Aug 5, 2025

Uh oh!

zty-king commented Aug 5, 2025

Uh oh!

xuxinyi389 Aug 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zty-king commented Aug 7, 2025

Uh oh!

xuxinyi389 left a comment

Choose a reason for hiding this comment

Uh oh!

From00 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

zty-king commented Aug 4, 2025 •

edited

Loading

codecov-commenter commented Aug 4, 2025 •

edited

Loading

xuxinyi389 Aug 6, 2025 •

edited

Loading