Skip to content

[ROADMAP][Updated on November 4] Megatron Core MoE Q3-Q4 2025 Roadmap #1729

@yanring

Description

@yanring

Description

The focus for Megatron Core MoE Q3-Q4 2025 is to provide comprehensive support for latest MoE architectures, advanced parallelism strategies, and performance optimizations for Blackwell. This is a tentative roadmap and subject to change.

🎉 This Roadmap is based on the dev branch; please see the details in its README.

Model Support

  • DeepSeek
    • ✅ DeepSeek-V2
    • ✅ DeepSeek-V3, including MTP
    • 🚧 DeepSeek-V3.2, WIP
  • Qwen
    • ✅ Qwen2-57B-A14B
    • ✅ Qwen3-235B-A22B
    • (🚀New!) Qwen3-Next
  • Mixtral
    • ✅ Mixtral-8x7B
    • ✅ Mixtral-8x22B

Core MoE Functionality

  • Token dropless MoE - Advanced routing without token dropping
  • Top-K Router with flexible K selection
  • Load balancing losses for expert load balancing optimization

Advanced Parallelism

  • Expert Parallel (EP) with 3D parallelism integration
  • Full parallelism combo: EP + DP + TP + PP + SP support
  • Context Parallel (CP) for long sequence MoE training
  • Parallel Folding Heterogeneous Parallelism Mappings for Efficient Large-Scale MoE Model Training
  • Distributed Optimizer for MoE (ZeRO-1 equivalent)
  • (🚀New!) Megatron FSDP with full expert parallel support

Optimizations

  • Memory Efficient token permutation
  • Fine-grained Recomputations (mla, moe, mlp, moe_act, norm)
  • GroupedGEMM and Gradient Accumulation Fusion
  • DP/PP/TP/EP Communication Overlapping
  • Advanced fusions for Router; Permutation; MLA Rope; FP8 casting, etc
  • cuDNN fused Attention and FlashAttn integration
  • ✅ (🚀New!) 1F1B EP A2A Overlap - Hiding Expert Parallel Communication with 1F1B Pipeline Schedule
  • (🚀New!) Muon and Layer-wise distributed optimizer
  • (🚀New!) Pipeline-aware fine-grained activation offloading

Precision Support

  • GroupedGEMM including FP8/MXFP8 support
  • FP8 weights with BF16 optimizer states
  • FP8 training full support

Optimized Expert Parallel Communication Support

  • DeepEP support for H100 and B200
  • (🚀New!) HybridEP for GB200

Developer Experience

  • MoE Model Zoo with pre-training best practices
  • MCore2HF Converter for ecosystem compatibility in megatron-bridge
  • Distributed Checkpointing Support
  • Runtime Upcycling Support for efficient model scaling
  • Layer-wise logging for detailed monitoring

Next Release Roadmap (MCore v0.16)

Performance & Memory Enhancements

  • 🚀Support placing MTP layers into standalone pipeline stages
  • 🚀Fused Linear and Cross Entropy operations
  • Cuda graph support for FP8 Primary weight

Advanced Functionality

  • 🚀Enhanced cuda_graph_scope for MoE and Mamba
    1. More fine-grained graph scope like MoE router and dispatch preprocessing
    2. A minimally intrusive implementation
  • MuonClip support (non-split version)
  • Adding context parallel support to eager attention implementation
  • CUDA Graph support with 1F1B EP A2A overlapping
  • Remove calculation of padding token in moe routing loss
  • Revive FP16 Training
  • Router replay support for RL training
  • Support NVFP4 MOE with Proper Padding

Communication Optimization

  • HybridEP Kernel Optimizations
  • HybridEP for NVL8+IB

Bug Fix

  • Tokenizer compatibility fix for DeepSeek and Qwen HF tokenizer

Ongoing Long-term Features

  • E2E Performance optimization for DeepSeek-V3, Qwen-3 and other fine-grained MoEs
  • Sync-Free and Full-Iter cudaGraph MoE Training
    • Targeting for dropless MoE
    • Device initiated HybridEP and GroupedGEMM
    • MoE ECHO Dispatcher
  • CPU Overhead Optimizations for Blackwell Performance
  • MLA CP 2.0 - MLA CP Enhancement for Longer Sequence Training
  • Dynamic Context Parallel for Imbalanced Long-Sequence Training
  • Megatron FSDP Performance Optimization for MoE Training

Call for Community Contributions

  • Model implementations - Additional MoE model variants
  • Performance testing - Performance tests across different platforms and workloads
  • Documentation and tutorials - Best practices and optimization guides

Call for Community Contributions

  • Model implementations - Additional MoE model variants
  • Performance testing - Performance tests across different platforms and workloads
  • Documentation and tutorials - Best practices and optimization guides

This roadmap reflects the collective efforts of NVIDIA and our collaborators

Credits: MCore MoE Team and @sbhavani

Labels: roadmap, moe, call-for-contribution

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions