-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
Checklist / 检查清单
- I have searched existing issues, and this is a new bug report. / 我已经搜索过现有的 issues,确认这是一个新的 bug report。
Bug Description / Bug 描述
File "/data/home/user/miniconda3/envs/py311/lib/python3.11/site-packages/swift/cli/_megatron/sft.py", line 7, in
[rank2]: megatron_sft_main()
[rank2]: File "/data/home/user/miniconda3/envs/py311/lib/python3.11/site-packages/swift/megatron/train/sft.py", line 87, in megatron_sft_main
[rank2]: return MegatronSft(args).main()
[rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/data/home/user/miniconda3/envs/py311/lib/python3.11/site-packages/swift/llm/base.py", line 49, in main
[rank2]: result = self.run()
[rank2]: ^^^^^^^^^^
[rank2]: File "/data/home/user/miniconda3/envs/py311/lib/python3.11/site-packages/swift/megatron/train/sft.py", line 77, in run
[rank2]: self.trainer.train(train_dataset, val_dataset, data_collator)
[rank2]: File "/data/home/user/miniconda3/envs/py311/lib/python3.11/site-packages/swift/megatron/trainers/base.py", line 1126, in train
[rank2]: pretrain(
[rank2]: File "/data/home/user/moe-llm/Megatron-LM-core_r0.15.0/megatron/training/training.py", line 666, in pretrain
[rank2]: model, optimizer, opt_param_scheduler = setup_model_and_optimizer(
[rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/data/home/user/miniconda3/envs/py311/lib/python3.11/site-packages/swift/megatron/trainers/base.py", line 504, in setup_model_and_optimizer
[rank2]: model, optimizer, opt_param_scheduler = self._origin_setup_model_and_optimizer(
[rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/data/home/user/moe-llm/Megatron-LM-core_r0.15.0/megatron/training/training.py", line 1094, in setup_model_and_optimizer
[rank2]: model = get_model(model_provider_func, model_type)
[rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/data/home/user/moe-llm/Megatron-LM-core_r0.15.0/megatron/training/training.py", line 885, in get_model
[rank2]: model = build_model()
[rank2]: ^^^^^^^^^^^^^
[rank2]: File "/data/home/user/moe-llm/Megatron-LM-core_r0.15.0/megatron/training/training.py", line 877, in build_model
[rank2]: model = model_provider_func(pre_process=pre_process, post_process=post_process)
[rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/data/home/user/miniconda3/envs/py311/lib/python3.11/site-packages/swift/megatron/trainers/base.py", line 479, in new_model_provider_func
[rank2]: self.bridge.load_weights(model, args.model_dir)
[rank2]: File "/data/home/user/miniconda3/envs/py311/lib/python3.11/site-packages/swift/megatron/model/gpt_bridge.py", line 1425, in load_weights
[rank2]: list(self._convert([mg_model], state_dict, hf_prefix, True, 'Loading: '))
[rank2]: File "/data/home/user/miniconda3/envs/py311/lib/python3.11/site-packages/swift/megatron/model/gpt_bridge.py", line 1343, in _convert
[rank2]: res = self._set_layer_state(mg_layer, hf_state_dict, f'{self.hf_layers_prefix}.', layer_idx, to_mcore)
[rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/data/home/user/miniconda3/envs/py311/lib/python3.11/site-packages/swift/megatron/model/gpt_bridge.py", line 1240, in _set_layer_state
[rank2]: hf_state_dict.update(self._set_layer_mlp(mg_layer, hf_state_dict, layer_idx, to_mcore))
[rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/data/home/user/miniconda3/envs/py311/lib/python3.11/site-packages/swift/megatron/model/gpt_bridge.py", line 1224, in _set_layer_mlp
[rank2]: hf_state_dict.update(self._set_moe_state(mg_mlp, hf_state_dict, f'{hf_mlp_prefix}.', layer_idx, to_mcore))
[rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/data/home/user/miniconda3/envs/py311/lib/python3.11/site-packages/swift/megatron/model/gpt_bridge.py", line 690, in _set_moe_state
[rank2]: self._set_mlp_state(mg_experts, hf_state_dict, 'experts.', layer_idx, to_mcore, ep_rank=ep_rank))
[rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/data/home/user/miniconda3/envs/py311/lib/python3.11/site-packages/swift/megatron/model/gpt_bridge.py", line 728, in _set_mlp_state
[rank2]: if isinstance(mg_mlp.linear_fc1, LoraParallelLinear):
[rank2]: ^^^^^^^^^^^^^^^^^
[rank2]: File "/data/home/user/miniconda3/envs/py311/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1962, in getattr
[rank2]: raise AttributeError(
[rank2]: AttributeError: 'SequentialMLP' object has no attribute 'linear_fc1'
How to Reproduce / 如何复现
环境:
ms_swift 3.12.4
peft 0.18.1
flash_attn 2.8.3+cu12torch28cxx11abitrue
transformers 4.57.6
transformer_engine_torch 2.10.0
megatron-core 0.15.3
运行脚本:
export MEGATRON_LM_PATH='/data/home/user/Megatron-LM-core_r0.15.0'
PYTORCH_CUDA_ALLOC_CONF="expandable_segments:True"
NPROC_PER_NODE=4
CUDA_VISIBLE_DEVICES=0,1,2,3
srun -n 1 megatron sft
--model '/data/home/user/.cache/modelscope/hub/models/Qwen/Qwen3-30B-A3B-Instruct-2507'
--check_model false
--load_safetensors true
--save_safetensors true
--merge_lora false
--moe_grouped_gemm true
--dataset '/data/home/user/data.json'
--load_from_cache_file true
--no_gradient_accumulation_fusion true
--use_precision_aware_optimizer true
--optimizer_cpu_offload true
--optimizer_offload_fraction 0.7
--train_type lora
--lora_rank 8
--lora_alpha 32
--target_modules all-linear
--sequence_parallel true
--freeze_llm false
--freeze_vit true
--freeze_aligner true
--packing true
--split_dataset_ratio 0.01
--expert_model_parallel_size 4
--moe_permute_fusion true
--moe_grouped_gemm false
--moe_shared_expert_overlap true
--moe_aux_loss_coeff 1e-3
--micro_batch_size 1
--global_batch_size 4
--recompute_granularity full
--recompute_method uniform
--recompute_num_layers 1
--finetune true
--cross_entropy_loss_fusion true
--lr 1e-4
--lr_warmup_fraction 0.05
--min_lr 1e-5
--max_epochs 1
--save megatron_output/Qwen/Qwen3-30B-A3B-Instruct-2507
--eval_interval 200
--save_interval 200
--vit_gradient_checkpointing true
--max_length 3000
--num_workers 8
--dataset_num_proc 8
--no_save_optim true
--no_save_rng true
--sequence_parallel true
--attention_backend flash
Additional Information / 补充信息
No response