Skip to content

Conversation

@zty-king
Copy link
Contributor

@zty-king zty-king commented Aug 4, 2025

PR Category

Auto Parallel

PR Types

Performance

Description

  • 当前新的动半pp的GradientClipByGlobalNorm逻辑存在问题,计算结果错误,需要修复。
  • 第一版修复的逻辑导致以下训练性能下降比较严重,同时发现一些重要的可优化点,因此做一些优化
    D99837D14964F74C9B422DE141F996D7
  • 优化如下:
    1. 将all_gather+sum操作 修改为 all_reduce
    1. 将每次调用时创建新的group的逻辑,修改成get_sub_mesh,并根据sub_mesh获取sub_group,从而防止大量group的创建导致性能下降。

@paddle-bot
Copy link

paddle-bot bot commented Aug 4, 2025

你的PR提交成功,感谢你对开源项目的贡献!
请关注后续CI自动化测试结果,详情请参考Paddle-CI手册
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

@paddle-bot paddle-bot bot added the contributor External developers label Aug 4, 2025
@codecov-commenter
Copy link

codecov-commenter commented Aug 4, 2025

Codecov Report

❌ Patch coverage is 55.55556% with 4 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@68835a8). Learn more about missing BASE report.

Files with missing lines Patch % Lines
python/paddle/nn/clip.py 55.55% 4 Missing ⚠️

❌ Your patch status has failed because the patch coverage (55.55%) is below the target coverage (90.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files
@@            Coverage Diff             @@
##             develop   #74409   +/-   ##
==========================================
  Coverage           ?   55.55%           
==========================================
  Files              ?        1           
  Lines              ?        9           
  Branches           ?        0           
==========================================
  Hits               ?        5           
  Misses             ?        4           
  Partials           ?        0           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@zty-king
Copy link
Contributor Author

zty-king commented Aug 5, 2025

/re-run all-failed

@zty-king
Copy link
Contributor Author

zty-king commented Aug 5, 2025

本地测试如下:
image
image

global_norm_var.process_mesh,
global_norm_var.placements,
)

Copy link
Contributor

@xuxinyi389 xuxinyi389 Aug 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

感觉 is_pp_enable 逻辑不够简洁,可以改成下方代码:

# Check for auto hybrid pipeline parallelism and source mesh existence
if flag_auto_hybrid_pp and src_mesh is not None:
    g_mesh = dist.get_mesh()
    
    # Check if mesh exists and pipeline parallelism is enabled ("pp" dim size > 1)
    if g_mesh and "pp" in g_mesh.dim_names and g_mesh.get_dim_size("pp") > 1:
        
        # Get the pipeline parallelism subgroup for communication
        pp_group = g_mesh.get_submesh_with_dim("pp").get_group("pp")
        
        # Perform all-reduce on the local tensor value across the PP group
        global_norm_var_local = global_norm_var._local_value()
        dist.all_reduce(
            global_norm_var_local,
            op=dist.ReduceOp.SUM,
            group=pp_group,
        )
        
        # Re-shard the tensor with the reduced value
        global_norm_var = dist.shard_tensor(
            global_norm_var_local,
            global_norm_var.process_mesh,
            global_norm_var.placements,
        )

@zty-king
Copy link
Contributor Author

zty-king commented Aug 7, 2025

/re-run all-failed

Copy link
Contributor

@xuxinyi389 xuxinyi389 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@From00 From00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@xuxinyi389 xuxinyi389 merged commit 8f77fa2 into PaddlePaddle:develop Aug 11, 2025
83 of 90 checks passed
Enigmatisms pushed a commit to Enigmatisms/Paddle that referenced this pull request Aug 11, 2025
…e#74409)

* fix the grad clip performance

* add test

* empty commit to rerun CI

* modify the note

* Simplify code logic
maxiaolong001 pushed a commit to maxiaolong001/Paddle that referenced this pull request Aug 12, 2025
…e#74409)

* fix the grad clip performance

* add test

* empty commit to rerun CI

* modify the note

* Simplify code logic
waliwali777 added a commit to waliwali777/Paddle2 that referenced this pull request Aug 26, 2025
waliwali777 added a commit to waliwali777/Paddle2 that referenced this pull request Aug 26, 2025
waliwali777 added a commit to waliwali777/Paddle2 that referenced this pull request Aug 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants