-
Notifications
You must be signed in to change notification settings - Fork 105
Open
Description
I want to set 12 experts and select top 4 per gpu.
I set parallel_type == 1, I find a2a in time.
I set parallel_type == 0, I find allgather in timeline .
I only want to dp Moe per gpu .
from tutel import moe as tutel_moe
self.ff_out = tutel_moe.moe_layer(
gate_type={'type': 'top', 'k': 4},
model_dim=512,
experts={
'num_experts_per_device': 12,
'type': 'ffn', 'hidden_size_per_expert': 2048, 'activation_fn': lambda x: torch.nn.functional.relu(x)
},
parallel_type='data',
scan_expert_func = lambda name, param: setattr(param, 'skip_allreduce', True),
)
Metadata
Metadata
Assignees
Labels
No labels