Batchsize bug when larger than 1 during training.

Hi there,

Thanks for this amazing work! I noticed that this project only takes 30 minutes to train on 8 GPUs, which is quite impressive. We tried running experiments on one or two A100 80GB GPUs, but I found that the required time exceeds 100 hours, which is very surprising. Have you encountered a similar issue? Could it be that we have not configured something correctly?

Best