Skip to content

Nccl reduce scatter, all gather#2727

Merged
awni merged 16 commits intoml-explore:mainfrom
nastya236:nccl-reduce-scatter-all-gather
Nov 5, 2025
Merged

Nccl reduce scatter, all gather#2727
awni merged 16 commits intoml-explore:mainfrom
nastya236:nccl-reduce-scatter-all-gather

Conversation

@nastya236
Copy link
Copy Markdown
Collaborator

Proposed changes

  1. Added nccl all_gather and reduce_scatter
  2. Make majority of tests shared for all backends:
  • nccl doesn’t support some dtypes, so there’s an extra all_reduce test for ring/MPI. But the following are now shared across all backends: test_all_reduce, test_average_gradients, test_donation, test_shard_linear, test_all_gather.
    In test_shard_linear, since we don’t have quantized matmuls on CUDA yet, the quantized variant runs only when CUDA is not available.

Test: mlx.launch -n 8 mlx/python/tests/nccl_test_distributed.py

)pbdoc");

m.def(
"reduce_scatter",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering about the name. We use all_sum (instead of all_reduce) to indicate it's a sum. Maybe we should use sum_scatter here to be more consistent? Wdyt?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I would agree.. I think it will be more consistent.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did the same as for AllReduce by adding a reduction op, let me know if you think that it is not needed and single sum_scatter is enough.

Copy link
Copy Markdown
Member

@awni awni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Will merge when tests clear!

@nastya236
Copy link
Copy Markdown
Collaborator Author

I fixed a typo, sorry about that. It should pass now. Thanks for reviewing!

@awni
Copy link
Copy Markdown
Member

awni commented Nov 5, 2025

The tests failed since your last push. It looks like it's trying to initialize nccl on mac.. can you see the failures?

@nastya236
Copy link
Copy Markdown
Collaborator Author

Finally everything is fixed :)

@awni awni merged commit 2777815 into ml-explore:main Nov 5, 2025
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants