This is the official repository for the paper AdaMuon: Adaptive Muon Optimizer.
AdaMuon is an effective optimizer based on Muon. It can achieve more than 40% training efficiency compared to AdamW.
This repository contains two projects: one is the GPT-2 experiments, and the other is the open-sourced Megatron-LM code, which we included to facilitate large-scale experiments.
To use AdaMuon in your own training pipeline on other architectures and datasets, use the following pseudo code as an example:
from opt_config import configure_optimizers
# Model
model = Model()
# Optimizer
optimizer = configure_optimizers(model.parameters(), weight_decay=0.1, learning_rate=6e-4)
# Training
for epoch in range(epochs):
for X, Y in data_loader:
# standard training code
logits, loss = model(X, Y)
loss.backward()
# ...This repository is licensed under the Apache 2.0 license. See the LICENSE file for more details.
@article{si2025adamuon,
title={AdaMuon: Adaptive Muon Optimizer},
author={Si, Chongjie and Zhang, Debing and Shen, Wei},
journal={arXiv preprint arXiv:2507.11005},
year={2025}
}If you have any questions, please raise an issue or contact us at [email protected].




