Skip to content

Conversation

@Aman-Dwivedi
Copy link

Added amd gpu support.
Updated requirements for rocm and added functions in setup.py to detect amd gpus. An example script has also been added by yiakwy-xpu-ml-framework-team.

@avjves
Copy link
Contributor

avjves commented Sep 22, 2025

Hi! This makes a custom requirements file for ROCm specifically - is there a reason for it? Also, that yunchang branch / version it installs is a year old with some changes. Yunchang already supports AMD GPUs in the upstream repo via flash_attn or AITER (the latest way to call FA with AMD GPUs). This seems like a regression.

Also, this breaks the changes made by PR #559 due to the duplicate imports in xfuser/core/long_ctx_attention/ring/ring_flash_attn.py. Currently it's gated like this:

Copy link
Collaborator

@feifeibear feifeibear left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@feifeibear
Copy link
Collaborator

@Aman-Dwivedi could you please check the duplicate imports problem mentioned before?

@avjves
Copy link
Contributor

avjves commented Sep 23, 2025

This line in the requirements:

yunchang @ git+https://github.com/yiakwy-xpu-ml-framework-team/xDiT-long-context-attention-fork.git@add_amd_gpu_suppport

is also a big problem in general for AMD GPUs. Could it also be removed? 😄

@eppaneamd
Copy link
Contributor

@feifeibear kindly note that this PR should be revisited and its merits re-evaluated.

@Aman-Dwivedi could you elaborate why this PR is needed, how xDiT and yunchang is not working for AMD GPUs currently? Why gfx942 is the only allowed gpu arch? Have you tried newer images than rocm/pytorch:rocm6.2_ubuntu22.04_py3.10_pytorch_release_2.3.0, such as rocm7.0_ubuntu22.04_py3.10_pytorch_release_2.8.0?

@Aman-Dwivedi
Copy link
Author

Aman-Dwivedi commented Oct 12, 2025

@avjves @eppaneamd This PR is in response to #437. I cherrypicked the commit mention in the issue and made some changes to run it smoothly on the amd cluster. I was able to run a example. I tried running the main branch since then. After facing a lot of hurdles I am able to run it now but I don't think it is running properly. It produced a gibberish image, attached below. I tried following this readme (https://github.com/feifeibear/long-context-attention/blob/main/docs/install_amd.md) for installing yunchang but it resulted in errors. Is there another updated readme on making xdit run on amd cluster? Is the image generated due to some error on my part or is there an issue with xdit?
xdit image

@avjves
Copy link
Contributor

avjves commented Oct 15, 2025

@Aman-Dwivedi
xDiT and yunchang should both work OOB with AMD GPUs, though pipeline paralleism I've really used myself. There are some recent AMD related commits that are not yet in releases, so I'd recommend building both from source. It should be enough to to run pip3 install -e . inside the cloned repositories. You should also install AITER or flash_attn to speed up attention. Instructions for AITER are here (https://github.com/ROCm/aiter?tab=readme-ov-file#installation).

Is the above image ran with the run_amdgpu_1x8.sh included in this PR script? If not, can you post the command you used? :)

@Aman-Dwivedi
Copy link
Author

@avjves
I tried checking out and building both xDiT and yunchang from source along with AITER. I am still facing some errors (pasted below). I also found a doc for installing yunchang on AMD GPUs (https://github.com/feifeibear/long-context-attention/blob/main/docs/install_amd.md). Initially, I tried installing yunchange directly using pip install . but it resulted in the below error. Following the doc resulted in some other errors and I was not able to install using that method. Is the doc still updated with the newer version of yunchang or directly installing yunchang on AMD gpus would work?

For generating the above image I used the below command:
python -m torch.distributed.run --nproc_per_node=1 examples/pixartalpha_example.py --model PixArt-alpha/PixArt-XL-2-1024-MS --height 512 --width 512 --prompt "a cute dog" --num_inference_steps 10 --guidance_scale 4.5

Error:
(xdit) yangzhou@chi-mi300x-041:~/aman/xDiT$ torchrun --nproc_per_node=8
examples/pixartalpha_example.py
--model models/PixArt-XL-2-1024-MS
--pipefusion_parallel_degree 2
--ulysses_degree 2
--num_inference_steps 20
--warmup_steps 0
--prompt "A cute dog"
--use_cfg_parallel
W1016 20:03:19.944000 1048711 site-packages/torch/distributed/run.py:803]
W1016 20:03:19.944000 1048711 site-packages/torch/distributed/run.py:803] *****************************************
W1016 20:03:19.944000 1048711 site-packages/torch/distributed/run.py:803] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W1016 20:03:19.944000 1048711 site-packages/torch/distributed/run.py:803] *****************************************
[aiter] import [module_aiter_enum] under /home/yangzhou/aman/aiter/aiter/jit/module_aiter_enum.so
[aiter] import [module_aiter_enum] under /home/yangzhou/aman/aiter/aiter/jit/module_aiter_enum.so
INFO 10-16 20:03:54 [envs.py:196] Using AITER as the attention library
[aiter] import [module_aiter_enum] under /home/yangzhou/aman/aiter/aiter/jit/module_aiter_enum.so
[aiter] import [module_aiter_enum] under /home/yangzhou/aman/aiter/aiter/jit/module_aiter_enum.so
[aiter] import [module_aiter_enum] under /home/yangzhou/aman/aiter/aiter/jit/module_aiter_enum.so
[aiter] import [module_aiter_enum] under /home/yangzhou/aman/aiter/aiter/jit/module_aiter_enum.so
[aiter] import [module_aiter_enum] under /home/yangzhou/aman/aiter/aiter/jit/module_aiter_enum.so
[aiter] import [module_aiter_enum] under /home/yangzhou/aman/aiter/aiter/jit/module_aiter_enum.so
INFO 10-16 20:03:54 [envs.py:196] Using AITER as the attention library
INFO 10-16 20:03:55 [envs.py:196] Using AITER as the attention library
INFO 10-16 20:03:55 [envs.py:196] Using AITER as the attention library
WARNING 10-16 20:03:55 [args.py:377] Distributed environment is not initialized. Initializing...
Traceback (most recent call last):
File "/home/yangzhou/aman/xDiT/examples/pixartalpha_example.py", line 83, in
main()
File "/home/yangzhou/aman/xDiT/examples/pixartalpha_example.py", line 21, in main
engine_config, input_config = engine_args.create_config()
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/yangzhou/aman/xDiT/xfuser/config/args.py", line 380, in create_config
init_distributed_environment()
File "/home/yangzhou/aman/xDiT/xfuser/core/distributed/parallel_state.py", line 221, in init_distributed_environment
backend = envs.get_torch_distributed_backend()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/yangzhou/aman/xDiT/xfuser/envs.py", line 133, in get_torch_distributed_backend
raise NotImplementedError(
NotImplementedError: No Accelerators(AMD/NV/MTT GPU, AMD MI instinct accelerators) available

@avjves
Copy link
Contributor

avjves commented Oct 17, 2025

@Aman-Dwivedi Aha, it seems the PR #566 that was just merged broke support for AMD devices accidentally. I have a PR open to fix that #577 , but you can cherry pick the changes from there if you want to test it prior it being merged. After that as long as you have a working pytorch environment, you should be good. EDIT: PR already merged :)

I haven't really tested pipeline parallelism myself and running your above comand with the fixes still runs into an error, though I don't believe that to be AMD specific. Pure sequence parallelism with cfg works at least OOB:

torchrun --nproc_per_node=8 examples/pixartalpha_example.py --model PixArt-alpha/PixArt-XL-2-1024-MS  --ulysses_degree 4 --num_inference_steps 20 --warmup_steps 0 --prompt "A cute dog" --use_cfg_parallel
pixart_alpha_result_dp1_cfg2_ulysses4_ringNone_pp1_patchNone_tc_False_0

@jcaraban
Copy link
Collaborator

Closing because this PR lost focus and doesn't seem to fix what it claims.
xDiT indeed runs OOB with ROCm devices, as long as PyTorch is installed correctly

We can extend the README to make this a more clear. We could also add a docker with ready ROCm environment.
@Aman-Dwivedi please let me know if you still face specific issues.

@jcaraban jcaraban closed this Oct 29, 2025
@avjves
Copy link
Contributor

avjves commented Oct 29, 2025

@Aman-Dwivedi

Here's a small Dockerfile to run PixArt:

FROM ubuntu
WORKDIR /app
RUN apt update && apt install python3-pip git -y && pip3 install --pre torch torchvision --index-url https://download.pytorch.org/whl/nightly/rocm7.0 --break-system-packages
RUN git clone https://github.com/feifeibear/long-context-attention.git && cd long-context-attention && pip install -e . --break-system-packages
RUN git clone https://github.com/xdit-project/xDiT.git && cd xDiT && pip install -e . --break-system-packages
CMD cd xDiT && torchrun --nproc_per_node=8 examples/pixartalpha_example.py --model PixArt-alpha/PixArt-XL-2-1024-MS  --ulysses_degree 4 --num_inference_steps 20 --warmup_steps 0 --prompt "A cute dog" --use_cfg_parallel

Build that and run it:

docker run --ipc host --device /dev/dri --device /dev/kfd --privileged --shm-size 128G -v $PWD/results:/app/xDiT/results <built_image_tag>

after it's done the picture should now be in results folder :)

@Aman-Dwivedi
Copy link
Author

Aman-Dwivedi commented Oct 29, 2025

@avjves Thanks for sharing this. I was able to run xDiT across multiple nodes and within a node. Thankyou so much for your help. I have tried it out with AITER and it works. I haven't tried with flash attention. I agree with @jcaraban about extending the README. Since, pipeline parallelism does not work the README can be updated where the demo command does not have pipefusion_parallel_degree. Once again, thankyou so much for all your help!
Also, can you close issue #437. That was my initial motivator to build AMD support, but clearly it is already added

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants