Fix TPU with new `XLA` device type #2467

will-cromar · 2024-02-20T22:51:14Z

What does this PR do?

#2176 replaces the TPU device type with XLA, letting us use GPUs with accelerate now 🎊

This PR fixes some issues that pop up on TPU after that PR:

Don't check the xm.xla_device in is_torch_xla_available. Calling xm.xla_device before xmp.spawn causes issues. This causes torch_xla to initialize the runtime parent process, reserving some space on GPU that can't be used by the child processes and causing TPU workloads to outright crash (message below).
Fix menu of options in accelerate config to offer XLA as an option. Selecting TPU causes an error because that device type no longer exists.
Allow bf16 mixed precision on TPU. Matches old behavior before Make torch xla available on GPU #2176.

Currently, running accelerate on TPU causes this crash due to the first issue:

...
F0000 00:00:1708382221.197251   23274 pjrt_registry.cc:117] Non-OK-status: pjrt::LoadPjrtPlugin("tpu", tpu_library_path).status() status: ALREADY_EXISTS: PJRT_Api already exists for device type tpu
...

Tested accelerate test on TPU v4-8.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

cc @muellerzr @anw90 @vanbasten23

HuggingFaceDocBuilderDev · 2024-02-20T23:21:07Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

anw90 · 2024-02-21T08:13:08Z

Don't check the xm.xla_device if we don't need to know the device type in is_torch_xla_available. Calling xm.xla_device before xmp.spawn causes issues. This causes torch_xla to initialize the runtime parent process, reserving some space on GPU that can't be used by the child processes and causing TPU workloads to outright crash (message below). (Can we just check torch_xla.runtime.device_type() instead? @anw90)

Sorry for the code that breaks the task on TPU. In one of my earliest versions, I checked the device_type using the PJRT_DEVICE value in is_torch_xla_available. Later, I changed it to the current implementation to decouple it from outside environments. I think it's okay to use torch_xla.runtime.device_type() to check the device type if there is a crash on TPU for the current implementation.

will-cromar · 2024-02-21T17:28:46Z

Sorry for the code that breaks the task on TPU. In one of my earliest versions, I checked the device_type using the PJRT_DEVICE value in is_torch_xla_available. Later, I changed it to the current implementation to decouple it from outside environments. I think it's okay to use torch_xla.runtime.device_type() to check the device type if there is a crash on TPU for the current implementation.

No worries. This is a subtle bug on GPU, and unfortunately we don't have any TPU CI set up in this repository.

I'll go ahead and replace this check with torch_xla.runtime.device_type() since it's more straightforward than digging into the device hardware type and less risky.

anw90 · 2024-02-22T04:47:07Z

LGTM, thanks!

muellerzr

Big fan of not having a try/catch + complicated logic there. Very nice!

And glad to see these bugs are fixed. To be clear: now Accelerate won't crash on TPU-XLA? :) (I think we had an issue in Transformers about it)

Also please run make style; make quality to fix the quality check :)

will-cromar · 2024-02-26T19:27:48Z

Also please run make style; make quality to fix the quality check :)

Oops, fixed.

To be clear: now Accelerate won't crash on TPU-XLA?

There's still an outstanding issue on TPU v2 and v3 that @vanbasten23 is working on. accelerate test won't crash on v4 and v5 after this change.

muellerzr · 2024-02-26T19:43:01Z

Great, thanks @will-cromar!

Also found the transformers issue for posterity: huggingface/transformers#28204

muellerzr approved these changes Feb 26, 2024

View reviewed changes

will-cromar added 3 commits February 26, 2024 19:25

Fix TPU after new XLA device type

7ca7e9a

use torch_xla.runtime.device_type

011c9b5

format

2e0d667

will-cromar force-pushed the fix-xla-tpu branch from d9ab16d to 2e0d667 Compare February 26, 2024 19:25

muellerzr merged commit c0b441f into huggingface:main Feb 26, 2024

yitongh mentioned this pull request Feb 28, 2024

Make torch xla available on GPU huggingface/transformers#29334

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix TPU with new `XLA` device type #2467

Fix TPU with new `XLA` device type #2467

Uh oh!

will-cromar commented Feb 20, 2024 •

edited

Loading

Uh oh!

HuggingFaceDocBuilderDev commented Feb 20, 2024

Uh oh!

anw90 commented Feb 21, 2024

Uh oh!

will-cromar commented Feb 21, 2024

Uh oh!

anw90 commented Feb 22, 2024

Uh oh!

muellerzr left a comment •

edited

Loading

Uh oh!

will-cromar commented Feb 26, 2024

Uh oh!

muellerzr commented Feb 26, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Fix TPU with new XLA device type #2467

Fix TPU with new XLA device type #2467

Uh oh!

Conversation

will-cromar commented Feb 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

Who can review?

Uh oh!

HuggingFaceDocBuilderDev commented Feb 20, 2024

Uh oh!

anw90 commented Feb 21, 2024

Uh oh!

will-cromar commented Feb 21, 2024

Uh oh!

anw90 commented Feb 22, 2024

Uh oh!

muellerzr left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

will-cromar commented Feb 26, 2024

Uh oh!

muellerzr commented Feb 26, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Fix TPU with new `XLA` device type #2467

Fix TPU with new `XLA` device type #2467

will-cromar commented Feb 20, 2024 •

edited

Loading

muellerzr left a comment •

edited

Loading