Skip to content

Conversation

@andoorve
Copy link
Collaborator

@andoorve andoorve commented May 24, 2024

Helpers for #4412

cc: @youkaichao @simon-mo

Copy link
Collaborator Author

@andoorve andoorve May 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally this function isn't even necessary and we can change the defaults for broadcast_tensor_dict to be TP group broadcast but adding this in to not break the assumption that broadcast_tensor_dict uses World group by default.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can also return a dict to use as kwargs for the broadcast_tensor_dict if this is less ugly:

broadcast_tensor_dict(**get_tensor_model_parallel_src_rank_and_group())

@youkaichao youkaichao self-requested a review May 29, 2024 00:28
@youkaichao
Copy link
Member

Sorry for the late review. I feel the modification is somewhat intrusive. I plan to refactor the distributed related code in a more OOP-style, and every operation only needs to specify a relative rank, following the RFC #3587 .

e.g. the broadcast tensor dict will be get_tp().broadcast_tensor_dict(src=0), and get_tp() will return a tensor parallel coordinator, which knows what does src=0 mean.

@andoorve
Copy link
Collaborator Author

andoorve commented Jun 3, 2024

What would you recommend in this case @youkaichao? As this is blocking PP, is it possible to have this merged as it does not affect performance or functionality and then have the refactor you suggest above done at a later time?

@youkaichao
Copy link
Member

I would recommend adding these helpers after the refactor. There are too many "helper functions" in distributed part, which could been organized in a better way. Let's keep code quality before things get out of control.

My ETA is 1~2 weeks, I originally planned to do the refactor, but was interrupted by the nccl stuff which has a higher priority.

@andoorve
Copy link
Collaborator Author

andoorve commented Jun 3, 2024

Got it. In that case, to confirm, can we can say we are shelving PP (#4412) until the refactor is done?

@youkaichao
Copy link
Member

Got it. In that case, to confirm, can we can say we are shelving PP (#4412) until the refactor is done?

Yes.

    Signed-off-by: Muralidhar Andoorveedu <[email protected]>

Signed-off-by: Muralidhar Andoorveedu <[email protected]>
@andoorve andoorve marked this pull request as draft June 4, 2024 22:59
@andoorve andoorve mentioned this pull request Jun 7, 2024
16 tasks
@andoorve andoorve closed this Jun 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants