This is the official repository for our ICML'25 paper Scaling Laws for Upcycling Mixture-of-Experts Language Models, containing code and data to reproduce analyses of the paper.
data: contains the data obtained from our scaling law experiments.data/result_8x.txt: results for training Mixtral-like MoE from scratch.data/result.txt: results for training dense LLM from scratch.data/result_upcycle_8x_topk_2.txt: results for upcycling Mixtral-like MoE from scratch.data/sparsity.csv: experimental data for fitting the sparsity-active parameter scaling law.data/ablate*: results for various ablation studies.
analysis.ipynb: contains example fitting the joint scaling law for Mixtral-like MoE.analyze_sparsity.ipynb: contains example fitting the sparsity-active parameter scaling law.
This implementation is licensed under the Apache License 2.0.
If you find this work helpful, please consider citing our paper:
@inproceedings{liew2025scaling,
title = {Scaling Laws for Upcycling Mixture-of-Experts Language Models},
booktitle = {Forty-Second International Conference on Machine Learning},
author = {Liew, Seng Pei and Kato, Takuya and Takase, Sho},
year = {2025}
}