Hi all,
I am on an M3 Mac and am trying to launch a skypilot_executor pretraining job on a remote k8s cluster. In doing so, I am running into the following error: ModuleNotFoundError: No module named 'nvidia_resiliency_ext', which I cannot install locally since it is only compiled for linux machines (and depends on CUDA libraries). I believe the offending code is inside the NemoLogger and can be bypassed all together in my use-case since I'm not using the FaultTolerance launcher. Would appreciate any help in resolving this issue!