Skip to content

Epoch start freezing regression #21550

@niemiaszek

Description

@niemiaszek

Bug description

I've recently upgraded my environment from torch 2.7.1+cu128, lightning 2.5.2, python 3.13 to torch 2.10+cu130, lightning 2.6.1 and python 3.14.

With old environment and "auto" strategy for a single GPU, my training was starting right away with basically no boot time. Now I have to wait a minute before the epoch starts. Once it starts, there is major speedup for first epoch, reporting it/s +6% for fp32 and +15% for fp16. In following epochs, the freeze time is included into the epoch time and it accounts for net slowdown, erasing any possible gains with environment upgrade.

I'm not doing anything custom in my training, simply using Trainer fit on single GPU with h5 LightningDataModule.

This is also accompanied by warning:

.../.pixi/envs/default/lib/python3.14/site-packages/pytorch_lightning/utilities/_pytree.py:21: FutureWarning:

`isinstance(treespec, LeafSpec)` is deprecated, use `isinstance(treespec, TreeSpec) and treespec.is_leaf()` instead.

Which from the other issue I understand is not critical here.

What version are you seeing the problem on?

master

Reproduced in studio

No response

How to reproduce the bug

I can't easily share my related training code, but I could share the pixi environments for reproduction.
Old:

[workspace]
channels = ["https://prefix.dev/conda-forge"]
name = "x"
platforms = ["linux-64"]
version = "0.1.0"

[system-requirements]
cuda = "12.0"

[tasks]

[dependencies]
python = "~=3.13.0"
ipykernel = "*"
numpy = "*"
scipy = "*"
pandas = "*"
matplotlib = "*"
ruff = ">=0.12.4,<0.13"
ipympl = ">=0.9.7,<0.10"



[pypi-dependencies]
clearml = "*"
torch = { version = "*", index = "https://download.pytorch.org/whl/cu128" }
torchvision = { version = "*", index = "https://download.pytorch.org/whl/cu128" }
torchaudio = { version = "*", index = "https://download.pytorch.org/whl/cu128" }
neuraloperator = { git = "https://github.com/neuraloperator/neuraloperator.git" }
torch-harmonics = "==0.7.3"
pyroomacoustics = ">=0.8.4, <0.9"
soundfile = ">=0.13.1, <0.14"
lightning = ">=2.5.2, <3"
torchmetrics = ">=1.7.4, <2"
tensorboard = ">=2.20.0, <3"

New:

[workspace]
channels = ["https://prefix.dev/conda-forge"]
name = "x"
platforms = ["linux-64"]
version = "0.1.0"

[system-requirements]
cuda = "13.0"

[tasks]

[dependencies]
python = ">=3.14.3,<3.15"
ipykernel = ">=7.2.0,<8"
numpy = ">=2.4.2,<3"
scipy = ">=1.17.0,<2"
pandas = ">=3.0.1,<4"
matplotlib = ">=3.10.8,<4"
ruff = ">=0.15.2,<0.16"
ipympl = ">=0.10.0,<0.11"



[pypi-dependencies]
clearml = ">=2.1.3, <3"
torch = { version = "*", index = "https://download.pytorch.org/whl/cu130" }
torchvision = { version = "*", index = "https://download.pytorch.org/whl/cu130" }
torchaudio = { version = "*", index = "https://download.pytorch.org/whl/cu130" }
pyroomacoustics = ">=0.9.0, <0.10"
soundfile = ">=0.13.1, <0.14"
lightning = ">=2.6.1, <3"
torchmetrics = ">=1.8.2, <2"
tensorboard = ">=2.20.0, <3"
h5py = ">=3.15.1, <4"

Error messages and logs

# New first epoch start
Epoch 0:   3%|██▌                                                                                                | 44/1688 [00:10<06:16,  4.36it/s, v_num=bf10, train/loss_step=1.020]
# New second epoch start
Epoch 1:   3%|█▋                                                         | 49/1688 [01:07<37:29,  0.73it/s, v_num=bf10, train/loss_step=0.395, val/loss=0.388, train/loss_epoch=0.448]
# Old second epoch start
Epoch 1:   2%|▉                                    | 40/1688 [00:11<08:10,  3.36it/s, v_num=fd5c, train/loss_step=0.414, val/loss=0.384, train/loss_epoch=0.447]

Environment

Old environment
  • CUDA:
    - GPU:
    - NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition
    - NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition
    - NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition
    - NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition
    - available: True
    - version: 12.8
  • Lightning:
    - lightning: 2.5.2
    - lightning-utilities: 0.14.3
    - pytorch-lightning: 2.5.2
    - tensorly-torch: 0.5.0
    - torch: 2.7.1+cu128
    - torch_harmonics: 0.7.3
    - torchaudio: 2.7.1+cu128
    - torchmetrics: 1.7.4
    - torchvision: 0.22.1+cu128
  • Packages:
    - Cython: 3.1.2
    - GitPython: 3.1.44
    - Jinja2: 3.1.6
    - Markdown: 3.8.2
    - MarkupSafe: 3.0.2
    - PyJWT: 2.10.1
    - PySide6: 6.9.1
    - PyYAML: 6.0.2
    - Pygments: 2.19.2
    - Werkzeug: 3.1.3
    - absl-py: 2.3.1
    - aiohappyeyeballs: 2.6.1
    - aiohttp: 3.12.14
    - aiosignal: 1.4.0
    - annotated-types: 0.7.0
    - asttokens: 3.0.0
    - attrs: 25.3.0
    - certifi: 2025.7.14
    - cffi: 1.17.1
    - charset-normalizer: 3.4.2
    - clearml: 2.0.2
    - click: 8.2.1
    - comm: 0.2.2
    - configmypy: 0.2.0
    - contourpy: 1.3.2
    - cycler: 0.12.1
    - debugpy: 1.8.15
    - decorator: 5.2.1
    - exceptiongroup: 1.3.0
    - executing: 2.2.0
    - filelock: 3.18.0
    - fonttools: 4.59.0
    - frozenlist: 1.7.0
    - fsspec: 2025.7.0
    - furl: 2.1.4
    - gitdb: 4.0.12
    - grpcio: 1.73.1
    - h5py: 3.14.0
    - idna: 3.10
    - importlib_metadata: 8.7.0
    - iniconfig: 2.1.0
    - ipykernel: 6.29.5
    - ipympl: 0.9.7
    - ipython: 9.4.0
    - ipython_pygments_lexers: 1.1.1
    - ipywidgets: 8.1.7
    - jedi: 0.19.2
    - jsonschema: 4.24.1
    - jsonschema-specifications: 2025.4.1
    - jupyter_client: 8.6.3
    - jupyter_core: 5.8.1
    - jupyterlab_widgets: 3.0.15
    - kiwisolver: 1.4.8
    - lightning: 2.5.2
    - lightning-utilities: 0.14.3
    - matplotlib: 3.10.3
    - matplotlib-inline: 0.1.7
    - mpmath: 1.3.0
    - multidict: 6.6.3
    - munkres: 1.1.4
    - nest_asyncio: 1.6.0
    - networkx: 3.5
    - neuraloperator: 1.0.2
    - numpy: 2.3.1
    - nvidia-cublas-cu12: 12.8.3.14
    - nvidia-cuda-cupti-cu12: 12.8.57
    - nvidia-cuda-nvrtc-cu12: 12.8.61
    - nvidia-cuda-runtime-cu12: 12.8.57
    - nvidia-cudnn-cu12: 9.7.1.26
    - nvidia-cufft-cu12: 11.3.3.41
    - nvidia-cufile-cu12: 1.13.0.11
    - nvidia-curand-cu12: 10.3.9.55
    - nvidia-cusolver-cu12: 11.7.2.55
    - nvidia-cusparse-cu12: 12.5.7.53
    - nvidia-cusparselt-cu12: 0.6.3
    - nvidia-nccl-cu12: 2.26.2
    - nvidia-nvjitlink-cu12: 12.8.61
    - nvidia-nvtx-cu12: 12.8.55
    - opt_einsum: 3.4.0
    - orderedmultidict: 1.0.1
    - packaging: 25.0
    - pandas: 2.3.1
    - parso: 0.8.4
    - pathlib2: 2.3.7.post1
    - pexpect: 4.9.0
    - pickleshare: 0.7.5
    - pillow: 11.3.0
    - platformdirs: 4.3.8
    - pluggy: 1.6.0
    - prompt_toolkit: 3.0.51
    - propcache: 0.3.2
    - protobuf: 6.31.1
    - psutil: 7.0.0
    - ptyprocess: 0.7.0
    - pure_eval: 0.2.3
    - pybind11: 3.0.0
    - pycparser: 2.22
    - pydantic: 2.11.7
    - pydantic_core: 2.33.2
    - pyparsing: 3.2.3
    - pyroomacoustics: 0.8.4
    - pytest: 8.4.1
    - pytest-mock: 3.14.1
    - python-dateutil: 2.9.0.post0
    - pytorch-lightning: 2.5.2
    - pytz: 2025.2
    - pyzmq: 27.0.0
    - referencing: 0.36.2
    - requests: 2.32.4
    - rpds-py: 0.26.0
    - ruamel.yaml: 0.18.14
    - ruamel.yaml.clib: 0.2.12
    - ruff: 0.12.4
    - scipy: 1.16.0
    - sentry-sdk: 2.33.0
    - setuptools: 80.9.0
    - shiboken6: 6.9.1
    - six: 1.17.0
    - smmap: 5.0.2
    - soundfile: 0.13.1
    - stack_data: 0.6.3
    - sympy: 1.14.0
    - tensorboard: 2.20.0
    - tensorboard-data-server: 0.7.2
    - tensorly: 0.9.0
    - tensorly-torch: 0.5.0
    - torch: 2.7.1+cu128
    - torch_harmonics: 0.7.3
    - torchaudio: 2.7.1+cu128
    - torchmetrics: 1.7.4
    - torchvision: 0.22.1+cu128
    - tornado: 6.5.1
    - tqdm: 4.67.1
    - traitlets: 5.14.3
    - triton: 3.3.1
    - typing-inspection: 0.4.1
    - typing_extensions: 4.14.1
    - tzdata: 2025.2
    - urllib3: 2.5.0
    - wandb: 0.21.0
    - wcwidth: 0.2.13
    - widgetsnbextension: 4.0.14
    - yarl: 1.20.1
    - zencfg: 0.3.0
    - zipp: 3.23.0
  • System:
    - OS: Linux
    - architecture:
    - 64bit
    - ELF
    - processor: x86_64
    - python: 3.13.5
    - release: 6.14.0-37-generic
    - version: Consider: ability to set seed #37~24.04.1-Ubuntu SMP PREEMPT_DYNAMIC Thu Nov 20 10:25:38 UTC 2
Current environment
  • CUDA:
    - GPU:
    - NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition
    - NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition
    - NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition
    - NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition
    - available: True
    - version: 13.0
  • Lightning:
    - lightning: 2.6.1
    - lightning-utilities: 0.15.2
    - pytorch-lightning: 2.6.1
    - torch: 2.10.0+cu130
    - torchaudio: 2.10.0+cu130
    - torchmetrics: 1.8.2
    - torchvision: 0.25.0+cu130
  • Packages:
    - Cython: 3.2.4
    - Jinja2: 3.1.6
    - Markdown: 3.10.2
    - MarkupSafe: 3.0.3
    - PyJWT: 2.10.1
    - PySide6: 6.10.2
    - PyYAML: 6.0.3
    - Pygments: 2.19.2
    - Werkzeug: 3.1.6
    - absl-py: 2.4.0
    - aiohappyeyeballs: 2.6.1
    - aiohttp: 3.13.3
    - aiosignal: 1.4.0
    - asttokens: 3.0.1
    - attrs: 25.4.0
    - certifi: 2026.1.4
    - cffi: 2.0.0
    - charset-normalizer: 3.4.4
    - clearml: 2.1.3
    - comm: 0.2.3
    - contourpy: 1.3.3
    - cuda-bindings: 13.0.3
    - cuda-pathfinder: 1.3.4
    - cycler: 0.12.1
    - debugpy: 1.8.20
    - decorator: 5.2.1
    - executing: 2.2.1
    - filelock: 3.24.3
    - fonttools: 4.61.1
    - frozenlist: 1.8.0
    - fsspec: 2026.2.0
    - furl: 2.1.4
    - grpcio: 1.78.1
    - h5py: 3.15.1
    - idna: 3.11
    - ipykernel: 7.2.0
    - ipympl: 0.10.0
    - ipython: 9.10.0
    - ipython_pygments_lexers: 1.1.1
    - ipywidgets: 8.1.8
    - jedi: 0.19.2
    - jsonschema: 4.26.0
    - jsonschema-specifications: 2025.9.1
    - jupyter_client: 8.8.0
    - jupyter_core: 5.9.1
    - jupyterlab_widgets: 3.0.16
    - kiwisolver: 1.4.9
    - lightning: 2.6.1
    - lightning-utilities: 0.15.2
    - matplotlib: 3.10.8
    - matplotlib-inline: 0.2.1
    - mpmath: 1.3.0
    - multidict: 6.7.1
    - munkres: 1.1.4
    - nest_asyncio: 1.6.0
    - networkx: 3.6.1
    - numpy: 2.4.2
    - nvidia-cublas: 13.1.0.3
    - nvidia-cuda-cupti: 13.0.85
    - nvidia-cuda-nvrtc: 13.0.88
    - nvidia-cuda-runtime: 13.0.96
    - nvidia-cudnn-cu13: 9.15.1.9
    - nvidia-cufft: 12.0.0.61
    - nvidia-cufile: 1.15.1.6
    - nvidia-curand: 10.4.0.35
    - nvidia-cusolver: 12.0.4.66
    - nvidia-cusparse: 12.6.3.3
    - nvidia-cusparselt-cu13: 0.8.0
    - nvidia-nccl-cu13: 2.28.9
    - nvidia-nvjitlink: 13.0.88
    - nvidia-nvshmem-cu13: 3.4.5
    - nvidia-nvtx: 13.0.85
    - orderedmultidict: 1.0.2
    - packaging: 26.0
    - pandas: 3.0.1
    - parso: 0.8.6
    - pathlib2: 2.3.7.post1
    - pexpect: 4.9.0
    - pillow: 12.1.1
    - platformdirs: 4.9.2
    - prompt_toolkit: 3.0.52
    - propcache: 0.4.1
    - protobuf: 6.33.5
    - psutil: 7.2.2
    - ptyprocess: 0.7.0
    - pure_eval: 0.2.3
    - pybind11: 3.0.2
    - pycparser: 3.0
    - pyparsing: 3.3.2
    - pyroomacoustics: 0.9.0
    - python-dateutil: 2.9.0.post0
    - pytorch-lightning: 2.6.1
    - pyzmq: 27.1.0
    - referencing: 0.37.0
    - requests: 2.32.5
    - rpds-py: 0.30.0
    - ruff: 0.15.2
    - scipy: 1.17.0
    - setuptools: 82.0.0
    - shiboken6: 6.10.2
    - six: 1.17.0
    - soundfile: 0.13.1
    - stack_data: 0.6.3
    - sympy: 1.14.0
    - tensorboard: 2.20.0
    - tensorboard-data-server: 0.7.2
    - torch: 2.10.0+cu130
    - torchaudio: 2.10.0+cu130
    - torchmetrics: 1.8.2
    - torchvision: 0.25.0+cu130
    - tornado: 6.5.3
    - tqdm: 4.67.3
    - traitlets: 5.14.3
    - triton: 3.6.0
    - typing_extensions: 4.15.0
    - unicodedata2: 17.0.1
    - urllib3: 2.6.3
    - wcwidth: 0.6.0
    - widgetsnbextension: 4.0.15
    - yarl: 1.22.0
  • System:
    - OS: Linux
    - architecture:
    - 64bit
    - ELF
    - processor: x86_64
    - python: 3.14.3
    - release: 6.14.0-37-generic
    - version: Consider: ability to set seed #37~24.04.1-Ubuntu SMP PREEMPT_DYNAMIC Thu Nov 20 10:25:38 UTC 2

More info

Sorry that I don't include any code for reproduction, but I hope it's enough for nailing this regression.

cc @ethanwharris

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions