-
Notifications
You must be signed in to change notification settings - Fork 5.9k
[CPU] Enable barrier op upon gloo #34671
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CPU] Enable barrier op upon gloo #34671
Conversation
|
Thanks for your contribution! |
562e071 to
0c1fdde
Compare
0c1fdde to
defd61f
Compare
gongweibao
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Coverage没过
python/paddle/fluid/tests/unittests/test_collective_cpu_barrier_with_gloo.py
Outdated
Show resolved
Hide resolved
gongweibao
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Coverage Failed
…unction from Python.
bd8cd78 to
268163c
Compare
gongweibao
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
…add context release.
This comment has been minimized.
This comment has been minimized.
…rt its init status.
… add_gloo_c_barrier_op_cpu
a736ebb
gongweibao
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
This comment has been minimized.
This comment has been minimized.
XieYunshen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
gongweibao
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
zhiqiu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM for init.py
gongweibao
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
PR types
New features
PR changes
APIs
Describe
Background:
Barrier operation from specific communication library (e.g. HCCL for Ascend cluster) was not stable, and backup solution with CPU barrier is required;
CPU barrier op was developed before but not verified;
Considering performance across thousands of machine to synchronize, calling barrier function directly is better than executing it with program;
Implementation:
Reuse previous barrier op implemented for CPU and add an initialization parallel environment flow for CPU only;
Both function-level call and op-level are supported;
Unittest:
test_collective_cpu_barrier_with_gloo.py