-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Fix TPU with new XLA device type
#2467
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
Sorry for the code that breaks the task on TPU. In one of my earliest versions, I checked the device_type using the |
No worries. This is a subtle bug on GPU, and unfortunately we don't have any TPU CI set up in this repository. I'll go ahead and replace this check with |
|
LGTM, thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Big fan of not having a try/catch + complicated logic there. Very nice!
And glad to see these bugs are fixed. To be clear: now Accelerate won't crash on TPU-XLA? :) (I think we had an issue in Transformers about it)
Also please run make style; make quality to fix the quality check :)
d9ab16d to
2e0d667
Compare
Oops, fixed.
There's still an outstanding issue on TPU v2 and v3 that @vanbasten23 is working on. |
|
Great, thanks @will-cromar! Also found the transformers issue for posterity: huggingface/transformers#28204 |
What does this PR do?
#2176 replaces the
TPUdevice type withXLA, letting us use GPUs withacceleratenow 🎊This PR fixes some issues that pop up on TPU after that PR:
xm.xla_deviceinis_torch_xla_available. Callingxm.xla_devicebeforexmp.spawncauses issues. This causestorch_xlato initialize the runtime parent process, reserving some space on GPU that can't be used by the child processes and causing TPU workloads to outright crash (message below).accelerate configto offerXLAas an option. SelectingTPUcauses an error because that device type no longer exists.Currently, running
accelerateon TPU causes this crash due to the first issue:Tested
accelerate teston TPU v4-8.Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
cc @muellerzr @anw90 @vanbasten23