Skip to content

PTlightning upgrade, logs display change. #772

@marctessier

Description

@marctessier

Bug description

Noticed the other day that PytorchLightning was upgraded when building a new environment..

I trained a model a small LJ model up to 1000 epoch using that latest version and the previous version. I did not noticed much difference in the training curves and output.

./make-everyvoice-env --conda -n EveryVoice_2026-02-17_ptl_latest
...

(EveryVoice_2026-02-17_ptl_latest) $ pip list | grep pytorch-lightning
pytorch-lightning         2.6.1
(EveryVoice_2026-02-17_ptl_latest) $ 

VSpytorch-lightning 2.4.0

One thing I did noticed was how the training logs progress is being displayed in the output logs.

I noticed that when I ran the training as a "job" on the GPSC, the output logs does not display a live progress with the "tail -f " command.

We are also being presented new information.

cat  PTL_latest.o971987
Done Loading... (0:00:15.19)
┏━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┳━━━━━━━┓
┃   ┃ Name               ┃ Type                ┃ Params ┃ Mode  ┃ FLOPs ┃
┡━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━╇━━━━━━━┩
│ 0 │ loss               │ FastSpeech2Loss     │      0 │ train │     0 │
│ 1 │ text_input_layer   │ Embedding           │ 20.0 K │ train │     0 │
│ 2 │ position_embedding │ PositionalEmbedding │      0 │ train │     0 │
│ 3 │ encoder            │ Conformer           │  6.1 M │ train │     0 │
│ 4 │ variance_adaptor   │ VarianceAdaptor     │  1.6 M │ train │     0 │
│ 5 │ decoder            │ Conformer           │  6.1 M │ train │     0 │
│ 6 │ mel_linear         │ Linear              │ 20.6 K │ train │     0 │
│ 7 │ postnet            │ PostNet             │  4.3 M │ train │     0 │
└───┴────────────────────┴─────────────────────┴────────┴───────┴───────┘
Trainable params: 18.2 M                                                        
Non-trainable params: 510                                                       
Total params: 18.2 M                                                            
Total estimated model params size (MB): 72                                      
Modules in train mode: 471                                                      
Modules in eval mode: 0                                                         
Total FLOPs: 0          

BUT , I did noticed that if I started the training directly on the command line in a "sleeper" job , I was able to see "LIVE" progress. The format did change

ex:

Epoch 3/499 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸━━━━━━━ 46/56 0:00:05 • 0:00:02 8.30it/s v_num: base training/pitch_loss: 0.080 training/energy_loss: 0.071 training/duration_loss: 0.063 training/spec_loss: 2.155       
Epoch 3/499 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸━━━━━━━ 46/56 0:00:05 • 0:00:02 8.30it/s v_num: base training/pitch_loss: 0.068 training/energy_loss: 0.062 training/duration_loss: 0.063 training/spec_loss: 2.311   
Epoch 3/499 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸━━━━━━ 47/56 0:00:06 • 0:00:02 8.30it/s v_num: base training/pitch_loss: 0.068 training/energy_loss: 0.062                    
                                                                                      training/duration_loss: 0.063 training/spec_loss: 2.311 training/postnet_lossEpoch 3/499 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸━━━━━━ 47/56 0:00:06 • 0:00:02 8.30it/s v_num: base training/pitch_loss: 0.063 training/energy_loss: 0.061                 
                                                                                      training/duration_loss: 0.059 training/spec_loss: 2.101 training/postnet_loss:     
INFO - `Trainer.fit` stopped: `max_epochs=500` reached.
Epoch 499/499 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 56/56 0:00:07 • 0:00:00 8.27it/s v_num: base training/pitch_loss: 0.006 training/energy_loss: 0.005     
                                                                                        training/duration_loss: 0.005 training/spec_loss: 0.130                
                                                                                        training/postnet_loss: 0.129 training/attn_ctc_loss: 0.275             
                                                                                        training/attn_bin_loss: 0.033 training/total_loss: 0.582               
Loading EveryVoice modules: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [1:13:01<00:00, 1095.46s/it]

Question, are we OK with this new behaviour?

How to reproduce the bug

Train from the command line and from the Job. The output logs are not the same

Error messages and logs

# Error messages and logs here please

Environment

Current environment
# Please paste the output of `everyvoice --diagnostic` here
# EveryVoice Diagnostic information

More info

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions