Skip to content

Accelerate save_state() error using FSDP2/TP #3826

@gmintoco

Description

@gmintoco

System Info

- `Accelerate` version: 1.11.0
- Platform: Linux-5.15.0-1071-nvidia-x86_64-with-glibc2.39
- `accelerate` bash location: /app/.venv/bin/accelerate
- Python version: 3.12.12
- Numpy version: 2.3.4
- PyTorch version: 2.8.0+cu128
- PyTorch accelerator: CUDA
- System RAM: 2015.56 GB
- GPU type: NVIDIA H200
- `Accelerate` default config:
	Not found

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
  • My own task or dataset (give details below)

Reproduction

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
Traceback (most recent call last):
File "/app/src/train.py", line 558, in main
_train_with_accelerate(train_cfg)
File "/app/src/train.py", line 359, in _train_with_accelerate
_save_checkpoint(
File "/app/src/train.py", line 523, in _save_checkpoint
checkpoint_dir = accelerator.save_state()
^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/.venv/lib/python3.12/site-packages/accelerate/accelerator.py", line 3562, in save_state
save_fsdp_optimizer(self.state.fsdp_plugin, self, opt, self._models[i], output_dir, i)

I am training a llama style model using FSDP2 and TP. I am using tp_size=2, dp_shard_size=8 and dp_replicate_size=1 based on the ND parallelism guide.

I am using an learning rate scheduler and a single optimizer. These are all prepared in a single accelerate.prepare().

I have noticed that regardless of how I order the prepare call for the optimizers I see 2 optimizers registered under accelerator._optimizers. However there is only a single model.

[2025-11-01 15:33:08,112][__main__][INFO] - accelerator._optimizers: [AcceleratedOptimizer (
Parameter Group 0
    amsgrad: False
    betas: (0.9, 0.999)
    capturable: False
    decoupled_weight_decay: True
    differentiable: False
    eps: 1e-08
    foreach: None
    fused: True
    initial_lr: 0.0002
    lr: 0.0
    maximize: False
    weight_decay: 0.0
), AcceleratedOptimizer (
Parameter Group 0
    amsgrad: False
    betas: (0.9, 0.999)
    capturable: False
    decoupled_weight_decay: True
    differentiable: False
    eps: 1e-08
    foreach: None
    fused: True
    initial_lr: 0.0002
    lr: 0.0
    maximize: False
    weight_decay: 0.0
)]

Based on the logic in save_state this is what is causing the error. But I can't see why 2 optimizers are being registered? I am certain I am only calling prepare a single time like so:

model, optimizer, train_loader, eval_loader, lr_scheduler = accelerator.prepare(
        model, optimizer, train_loader, eval_loader, lr_scheduler
    )
optimizer = torch.optim.AdamW(
        model.parameters(),
        lr=cfg.optimizer.learning_rate,
        weight_decay=cfg.optimizer.weight_decay,
        fused=cfg.optimizer.fused,
    )
lr_scheduler = get_linear_schedule_with_warmup(
        optimizer=optimizer,
        num_warmup_steps=warmup_steps,
        num_training_steps=total_training_steps,
    )

I'm at a bit of a loss here unfortunately. Thanks in advance!

Expected behavior

I expect saving to succeed

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions