-
Notifications
You must be signed in to change notification settings - Fork 3.3k
Open
Labels
call for contributiondev branchDev branch related issues and developmentDev branch related issues and developmentmodule: moe
Description
Description
The focus for Megatron Core MoE Q3-Q4 2025 is to provide comprehensive support for latest MoE architectures, advanced parallelism strategies, and performance optimizations for Blackwell. This is a tentative roadmap and subject to change.
🎉 This Roadmap is based on the dev branch; please see the details in its README.
Model Support
- ✅ DeepSeek
- ✅ DeepSeek-V2
- ✅ DeepSeek-V3, including MTP
- 🚧 DeepSeek-V3.2, WIP
- ✅ Qwen
- ✅ Qwen2-57B-A14B
- ✅ Qwen3-235B-A22B
- ✅ (🚀New!) Qwen3-Next
- ✅ Mixtral
- ✅ Mixtral-8x7B
- ✅ Mixtral-8x22B
Core MoE Functionality
- ✅ Token dropless MoE - Advanced routing without token dropping
- ✅ Top-K Router with flexible K selection
- ✅ Load balancing losses for expert load balancing optimization
Advanced Parallelism
- ✅ Expert Parallel (EP) with 3D parallelism integration
- ✅ Full parallelism combo: EP + DP + TP + PP + SP support
- ✅ Context Parallel (CP) for long sequence MoE training
- ✅ Parallel Folding Heterogeneous Parallelism Mappings for Efficient Large-Scale MoE Model Training
- ✅ Distributed Optimizer for MoE (ZeRO-1 equivalent)
- ✅ (🚀New!) Megatron FSDP with full expert parallel support
Optimizations
- ✅ Memory Efficient token permutation
- ✅ Fine-grained Recomputations (mla, moe, mlp, moe_act, norm)
- ✅ GroupedGEMM and Gradient Accumulation Fusion
- ✅ DP/PP/TP/EP Communication Overlapping
- ✅ Advanced fusions for Router; Permutation; MLA Rope; FP8 casting, etc
- ✅ cuDNN fused Attention and FlashAttn integration
- ✅ (🚀New!) 1F1B EP A2A Overlap - Hiding Expert Parallel Communication with 1F1B Pipeline Schedule
- ✅ (🚀New!) Muon and Layer-wise distributed optimizer
- ✅ (🚀New!) Pipeline-aware fine-grained activation offloading
Precision Support
- ✅ GroupedGEMM including FP8/MXFP8 support
- ✅ FP8 weights with BF16 optimizer states
- ✅ FP8 training full support
Optimized Expert Parallel Communication Support
- ✅ DeepEP support for H100 and B200
- ✅ (🚀New!) HybridEP for GB200
Developer Experience
- ✅ MoE Model Zoo with pre-training best practices
- ✅ MCore2HF Converter for ecosystem compatibility in megatron-bridge
- ✅ Distributed Checkpointing Support
- ✅ Runtime Upcycling Support for efficient model scaling
- ✅ Layer-wise logging for detailed monitoring
Next Release Roadmap (MCore v0.16)
Performance & Memory Enhancements
- 🚀Support placing MTP layers into standalone pipeline stages
- 🚀Fused Linear and Cross Entropy operations
- Cuda graph support for FP8 Primary weight
Advanced Functionality
- 🚀Enhanced cuda_graph_scope for MoE and Mamba
1. More fine-grained graph scope like MoE router and dispatch preprocessing
2. A minimally intrusive implementation - MuonClip support (non-split version)
- Adding context parallel support to eager attention implementation
- CUDA Graph support with 1F1B EP A2A overlapping
- Remove calculation of padding token in moe routing loss
- Revive FP16 Training
- Router replay support for RL training
- Support NVFP4 MOE with Proper Padding
Communication Optimization
- HybridEP Kernel Optimizations
- HybridEP for NVL8+IB
Bug Fix
- Tokenizer compatibility fix for DeepSeek and Qwen HF tokenizer
Ongoing Long-term Features
- E2E Performance optimization for DeepSeek-V3, Qwen-3 and other fine-grained MoEs
- Sync-Free and Full-Iter cudaGraph MoE Training
- Targeting for dropless MoE
- Device initiated HybridEP and GroupedGEMM
- MoE ECHO Dispatcher
- CPU Overhead Optimizations for Blackwell Performance
- MLA CP 2.0 - MLA CP Enhancement for Longer Sequence Training
- Dynamic Context Parallel for Imbalanced Long-Sequence Training
- Megatron FSDP Performance Optimization for MoE Training
Call for Community Contributions
- Model implementations - Additional MoE model variants
- Performance testing - Performance tests across different platforms and workloads
- Documentation and tutorials - Best practices and optimization guides
Call for Community Contributions
- Model implementations - Additional MoE model variants
- Performance testing - Performance tests across different platforms and workloads
- Documentation and tutorials - Best practices and optimization guides
This roadmap reflects the collective efforts of NVIDIA and our collaborators
Credits: MCore MoE Team and @sbhavani
Labels: roadmap, moe, call-for-contribution
Ktakuya332C, Skylion007, jeromeku, GHGmc2, joshelb and 15 moresbmaruf, sbhavani, Skylion007, joshelb, eagle705 and 7 moreTJ-Solergibert, Skylion007, joshelb, okoge-kaz, blahBlahhhJ and 3 more
Metadata
Metadata
Assignees
Labels
call for contributiondev branchDev branch related issues and developmentDev branch related issues and developmentmodule: moe