-
-
Notifications
You must be signed in to change notification settings - Fork 11.7k
[Core][Distributed] use cpu/gloo to initialize pynccl #4248
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core][Distributed] use cpu/gloo to initialize pynccl #4248
Conversation
zhuohan123
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Left some small comments.
| current_device = torch.cuda.current_device() | ||
| try: | ||
| torch.cuda.set_device(device) | ||
| NCCL_CHECK( | ||
| _c_ncclCommInitRank(ctypes.byref(self.comm), self.world_size, | ||
| self.unique_id, self.rank)) | ||
| self.stream = torch.cuda.Stream() | ||
| finally: | ||
| torch.cuda.set_device(current_device) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need a try...finally block here? Can the program continue to run when there is an exception in the try block?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can, but I think it would be better for a function to be pure, i.e. don't implicitly modify some global state.
nccl initialization requires broadcasting a unique id, which lives in cpu memory. previously, we only have one
ncclbackend process group, so we have to move the unique id to gpu, broadcast it, and then move it back to cpu.After #3904 , we always have a cpu/gloo backend, so we don't need to move the unique id around. We can just broadcast it in cpu memory.