[ROADMAP][Updated on November 4] Megatron Core MoE Q3-Q4 2025 Roadmap

## Description

The focus for Megatron Core MoE Q3-Q4 2025 is to provide comprehensive support for latest MoE architectures, advanced parallelism strategies, and performance optimizations for Blackwell. This is a tentative roadmap and subject to change.

**🎉 This Roadmap is based on the [dev branch](https://github.com/NVIDIA/Megatron-LM/tree/dev); please see the details in its README.**

### Model Support

* ✅ **DeepSeek**  
  * ✅ DeepSeek-V2  
  * ✅ DeepSeek-V3, including MTP  
  * 🚧 DeepSeek-V3.2, WIP  
* ✅ **Qwen**  
  * ✅ Qwen2-57B-A14B  
  * ✅ Qwen3-235B-A22B  
  * ✅ **(🚀New\!) Qwen3-Next**  
* ✅ **Mixtral**  
  * ✅ Mixtral-8x7B  
  * ✅ Mixtral-8x22B

### Core MoE Functionality

* ✅ **Token dropless MoE** \- Advanced routing without token dropping  
* ✅ **Top-K Router** with flexible K selection  
* ✅ **Load balancing losses** for expert load balancing optimization

### Advanced Parallelism

* ✅ **Expert Parallel (EP)** with 3D parallelism integration  
* ✅ **Full parallelism combo**: EP \+ DP \+ TP \+ PP \+ SP support  
* ✅ **Context Parallel (CP)** for long sequence MoE training  
* ✅ **Parallel Folding** Heterogeneous Parallelism Mappings for Efficient Large-Scale MoE Model Training  
* ✅ **Distributed Optimizer for MoE** (ZeRO-1 equivalent)  
* ✅ **(🚀New\!) Megatron FSDP** with full **expert parallel support**

### Optimizations

* ✅ **Memory Efficient token permutation**  
* ✅ **Fine-grained Recomputations** (mla, moe, mlp, moe\_act, norm)  
* ✅ **GroupedGEMM** and Gradient Accumulation Fusion  
* ✅ **DP/PP/TP/EP Communication Overlapping**  
* ✅ **Advanced fusions for Router; Permutation; MLA Rope; FP8 casting, etc**  
* ✅ **cuDNN fused Attention** and FlashAttn integration  
* ✅  (**🚀New\!) 1F1B EP A2A Overlap** \- Hiding Expert Parallel Communication with 1F1B Pipeline Schedule  
* ✅ **(🚀New\!) Muon and Layer-wise distributed optimizer**  
* ✅ **(🚀New\!) Pipeline-aware fine-grained activation offloading**

### Precision Support

* ✅ **GroupedGEMM** including FP8/MXFP8 support  
* ✅ **FP8 weights with BF16 optimizer states**  
* ✅ **FP8 training** full support

### Optimized Expert Parallel Communication Support

* ✅ **DeepEP support for H100 and B200**  
* ✅ **(🚀New\!) HybridEP for GB200**

### Developer Experience

* ✅ **MoE Model Zoo** with pre-training best practices  
* ✅ **MCore2HF Converter** for ecosystem compatibility in megatron-bridge  
* ✅ **Distributed Checkpointing Support**  
* ✅ **Runtime Upcycling Support** for efficient model scaling  
* ✅ **Layer-wise logging** for detailed monitoring

## Next Release Roadmap (MCore v0.16)

### Performance & Memory Enhancements

- [x] **🚀**Support placing MTP layers into standalone pipeline stages  
- [ ] **🚀**Fused Linear and Cross Entropy operations  
- [x] Cuda graph support for FP8 Primary weight 

### Advanced Functionality

- [x] **🚀**Enhanced cuda\_graph\_scope for MoE and Mamba  
      1. More fine-grained graph scope like MoE router and dispatch preprocessing  
      2. A minimally intrusive implementation  
- [x] MuonClip support (non-split version)  
- [x] Adding context parallel support to eager attention implementation  
- [ ] CUDA Graph support with 1F1B EP A2A overlapping 
- [ ] Remove calculation of padding token in moe routing loss  
- [ ] Revive FP16 Training  
- [ ] Router replay support for RL training  
- [x] Support NVFP4 MOE with Proper Padding

### Communication Optimization

- [x] HybridEP Kernel Optimizations  
- [ ] HybridEP for NVL8+IB

### Bug Fix

- [x] Tokenizer compatibility fix for DeepSeek and Qwen HF tokenizer

## Ongoing Long-term Features 

* **E2E Performance optimization** for DeepSeek-V3, Qwen-3 and other fine-grained MoEs  
* **Sync-Free and Full-Iter cudaGraph MoE Training**  
  * **Targeting for dropless MoE**
  * **Device initiated HybridEP and GroupedGEMM**  
  * **MoE ECHO Dispatcher**  
* **CPU Overhead Optimizations** for Blackwell Performance  
* **MLA CP 2.0** \- MLA CP Enhancement for Longer Sequence Training  
* **Dynamic Context Parallel** for Imbalanced Long-Sequence Training  
* **Megatron FSDP Performance Optimization for MoE Training**

## Call for Community Contributions

* **Model implementations** \- Additional MoE model variants  
* **Performance testing** \- Performance tests across different platforms and workloads  
* **Documentation and tutorials** \- Best practices and optimization guides

## Call for Community Contributions

* **Model implementations** \- Additional MoE model variants  
* **Performance testing** \- Performance tests across different platforms and workloads  
* **Documentation and tutorials** \- Best practices and optimization guides

---

This roadmap reflects the collective efforts of NVIDIA and our collaborators

Credits: MCore MoE Team and @sbhavani

**Labels:** `roadmap`, `moe`, `call-for-contribution`  

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ROADMAP][Updated on November 4] Megatron Core MoE Q3-Q4 2025 Roadmap #1729

Description

Model Support

Core MoE Functionality

Advanced Parallelism

Optimizations

Precision Support

Optimized Expert Parallel Communication Support

Developer Experience

Next Release Roadmap (MCore v0.16)

Performance & Memory Enhancements

Advanced Functionality

Communication Optimization

Bug Fix

Ongoing Long-term Features

Call for Community Contributions

Call for Community Contributions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[ROADMAP][Updated on November 4] Megatron Core MoE Q3-Q4 2025 Roadmap #1729

Description

Description

Model Support

Core MoE Functionality

Advanced Parallelism

Optimizations

Precision Support

Optimized Expert Parallel Communication Support

Developer Experience

Next Release Roadmap (MCore v0.16)

Performance & Memory Enhancements

Advanced Functionality

Communication Optimization

Bug Fix

Ongoing Long-term Features

Call for Community Contributions

Call for Community Contributions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions