[MoE] Add node limited routing support #2111

shuhuayu · 2025-12-05T00:17:43Z

As titled, added node-limited routing support via two-layer routing. First, group experts into num_groups groups, and experts in the same group should reside on the same node to utilize fast intra-node communication. Second, pick the top_k_group by the top 2 expert scores' sum in each group. Third, pick top_k experts within the selected top_k_groups.

Reference: https://github.com/huggingface/transformers/blob/4c9fde2a2a3aece0bcf1be93f696e88297da9397/src/transformers/models/deepseek_v3/modeling_deepseek_v3.py#L212

Test on one node using DeepSeek V3 debug model with MoE arguments num_experts=8, num_shared_experts=2, num_groups=4, top_k_group=2, top_k=3.

wwwjn · 2025-12-05T03:40:44Z

torchtitan/models/moe/moe.py

-    # token-choice
+    # token-choice with node limited routing support
+    num_groups: int | None = None  # must be a divisor of num_experts
+    top_k_group: int | None = None


Suggested change

top_k_group: int | None = None

top_k_groups: int | None = None

set the default for 671B model according to https://github.com/deepseek-ai/DeepSeek-V3/blob/main/inference/configs/config_671B.json

tianyu-l · 2025-12-07T06:38:29Z

torchtitan/models/moe/moe.py

+        selected_experts_indices = torch.topk(
+            scores_for_choice, k=self.top_k, dim=-1, sorted=False
+        )[1]
+        # Get actual scores (without bias) for the selected experts
+        top_scores = scores.gather(1, selected_experts_indices)


let's unify this part with other two paths (even no expert bias we can do topk + gather)

Great suggestion! Merged into a unified _node_limited_routing method.

Also changed to a smaller mask to save memory.

tianyu-l · 2025-12-07T06:43:42Z

torchtitan/models/moe/moe.py

-    # token-choice
+    # token-choice with node limited routing support
+    num_groups: int | None = None  # must be a divisor of num_experts
+    top_k_group: int | None = None


set the default for 671B model according to https://github.com/deepseek-ai/DeepSeek-V3/blob/main/inference/configs/config_671B.json

tianyu-l · 2025-12-07T06:45:49Z

torchtitan/models/moe/moe.py

+    num_groups: int | None = None  # must be a divisor of num_experts
+    top_k_group: int | None = None


Are the names coming from some repo? I saw that deepseek repo calls them n_expert_groups and n_limited_groups.

I think we can combine the convention and call them num_expert_groups and num_limited_groups. WDYT?

nit: let's put these two fields below top_k which is a more "important" arg.

Sounds great! I changed to num_expert_groups and num_limited_groups, which have clearer meaning from naming. The previous names are from huggingface's implementations.

tianyu-l · 2025-12-07T06:47:12Z

torchtitan/models/moe/moe.py

+            assert (
+                self.top_k_group is not None
+            ), "top_k_group must be set when num_groups is set"
+            assert (
+                self.num_experts % self.num_groups == 0
+            ), f"num_experts ({self.num_experts}) must be divisible by num_groups ({self.num_groups})"


Instead of doing assert, let's raise ValueError since they are more like error from users.

tianyu-l · 2025-12-07T07:55:44Z

need to remove https://github.com/pytorch/torchtitan/blob/main/torchtitan/models/deepseek_v3/model/args.py#L70-L71

tianyu-l · 2025-12-12T00:57:35Z

torchtitan/models/moe/moe.py

+        selected_experts_indices = torch.topk(
+            scores_for_choice, k=self.top_k, dim=-1, sorted=False
+        )[1]
+
+        # NOTE: The expert_bias is only used for routing. The gating value
+        #       top_scores is still derived from the original scores.
+        top_scores = scores.gather(dim=1, index=selected_experts_indices)
+
+        return selected_experts_indices, top_scores


this should stay outside node_limited_routing method. If you worry about naming, you could call it something like _get_node_limited_routing_scores

wwwjn · 2025-12-12T07:48:29Z

torchtitan/models/moe/moe.py


+    Optionally supports node-limited (group-limited) routing where experts are divided into groups
+    (e.g., by node), and only num_limited_groups groups are considered before selecting top_k experts.
+    This reduces cross-node communication in distributed settings.


This is not true for NCCL native a2a, and only true with DeepEP?

wwwjn · 2025-12-12T07:51:40Z

torchtitan/models/moe/moe.py

+            group_idx = torch.topk(
+                group_scores, k=self.num_limited_groups, dim=-1, sorted=False
+            )[1]


Suggested change

group_idx = torch.topk(

group_scores, k=self.num_limited_groups, dim=-1, sorted=False

)[1]

_, group_idx = torch.topk(

group_scores, k=self.num_limited_groups, dim=-1, sorted=False

)

For readability, it's easy to forgot why we need to slice topk()'s result

wwwjn · 2025-12-12T07:52:05Z

torchtitan/models/moe/moe.py

+        selected_experts_indices = torch.topk(
+            scores_for_choice, k=self.top_k, dim=-1, sorted=False
+        )[1]


Suggested change

selected_experts_indices = torch.topk(

scores_for_choice, k=self.top_k, dim=-1, sorted=False

)[1]

_, selected_experts_indices = torch.topk(

scores_for_choice, k=self.top_k, dim=-1, sorted=False

)

wwwjn · 2025-12-12T07:54:06Z

torchtitan/models/moe/moe.py

+            scores_grouped = scores_for_choice.view(
+                -1, self.num_expert_groups, experts_per_group
+            )
+            group_scores = scores_grouped.topk(2, dim=-1)[0].sum(dim=-1)


Suggested change

group_scores = scores_grouped.topk(2, dim=-1)[0].sum(dim=-1)

top2_scores_in_group, _ = scores_grouped.topk(2, dim=-1)

group_scores = top2_scores_in_group.sum(dim=-1)

wdyt?

[MoE] Add node limited routing support

df1d60e

shuhuayu requested review from fegin, tianyu-l, wconstab and wwwjn as code owners December 5, 2025 00:17

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Dec 5, 2025

shuhuayu mentioned this pull request Dec 5, 2025

Integrate DeepEP to torchtitan #2107

Open

wwwjn reviewed Dec 5, 2025

View reviewed changes

tianyu-l reviewed Dec 7, 2025

View reviewed changes

refactor

3835936

tianyu-l reviewed Dec 12, 2025

View reviewed changes

wwwjn reviewed Dec 12, 2025

View reviewed changes

	top_k_group: int \| None = None
	top_k_groups: int \| None = None

		num_groups: int \| None = None # must be a divisor of num_experts
		top_k_group: int \| None = None

	group_scores = scores_grouped.topk(2, dim=-1)[0].sum(dim=-1)
	top2_scores_in_group, _ = scores_grouped.topk(2, dim=-1)
	group_scores = top2_scores_in_group.sum(dim=-1)

[MoE] Add node limited routing support #2111

Are you sure you want to change the base?

[MoE] Add node limited routing support #2111

Conversation

shuhuayu commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tianyu-l commented Dec 7, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

shuhuayu commented Dec 5, 2025 •

edited

Loading