-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
System Info
- `Accelerate` version: 1.11.0
- Platform: Linux-5.15.0-1071-nvidia-x86_64-with-glibc2.39
- `accelerate` bash location: /app/.venv/bin/accelerate
- Python version: 3.12.12
- Numpy version: 2.3.4
- PyTorch version: 2.8.0+cu128
- PyTorch accelerator: CUDA
- System RAM: 2015.56 GB
- GPU type: NVIDIA H200
- `Accelerate` default config:
Not foundInformation
- The official example scripts
- My own modified scripts
Tasks
- One of the scripts in the examples/ folder of Accelerate or an officially supported
no_trainerscript in theexamplesfolder of thetransformersrepo (such asrun_no_trainer_glue.py) - My own task or dataset (give details below)
Reproduction
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
Traceback (most recent call last):
File "/app/src/train.py", line 558, in main
_train_with_accelerate(train_cfg)
File "/app/src/train.py", line 359, in _train_with_accelerate
_save_checkpoint(
File "/app/src/train.py", line 523, in _save_checkpoint
checkpoint_dir = accelerator.save_state()
^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/.venv/lib/python3.12/site-packages/accelerate/accelerator.py", line 3562, in save_state
save_fsdp_optimizer(self.state.fsdp_plugin, self, opt, self._models[i], output_dir, i)
I am training a llama style model using FSDP2 and TP. I am using tp_size=2, dp_shard_size=8 and dp_replicate_size=1 based on the ND parallelism guide.
I am using an learning rate scheduler and a single optimizer. These are all prepared in a single accelerate.prepare().
I have noticed that regardless of how I order the prepare call for the optimizers I see 2 optimizers registered under accelerator._optimizers. However there is only a single model.
[2025-11-01 15:33:08,112][__main__][INFO] - accelerator._optimizers: [AcceleratedOptimizer (
Parameter Group 0
amsgrad: False
betas: (0.9, 0.999)
capturable: False
decoupled_weight_decay: True
differentiable: False
eps: 1e-08
foreach: None
fused: True
initial_lr: 0.0002
lr: 0.0
maximize: False
weight_decay: 0.0
), AcceleratedOptimizer (
Parameter Group 0
amsgrad: False
betas: (0.9, 0.999)
capturable: False
decoupled_weight_decay: True
differentiable: False
eps: 1e-08
foreach: None
fused: True
initial_lr: 0.0002
lr: 0.0
maximize: False
weight_decay: 0.0
)]
Based on the logic in save_state this is what is causing the error. But I can't see why 2 optimizers are being registered? I am certain I am only calling prepare a single time like so:
model, optimizer, train_loader, eval_loader, lr_scheduler = accelerator.prepare(
model, optimizer, train_loader, eval_loader, lr_scheduler
)
optimizer = torch.optim.AdamW(
model.parameters(),
lr=cfg.optimizer.learning_rate,
weight_decay=cfg.optimizer.weight_decay,
fused=cfg.optimizer.fused,
)
lr_scheduler = get_linear_schedule_with_warmup(
optimizer=optimizer,
num_warmup_steps=warmup_steps,
num_training_steps=total_training_steps,
)
I'm at a bit of a loss here unfortunately. Thanks in advance!
Expected behavior
I expect saving to succeed