Skip to content

[ROCm] [CI] Add ROCm Dockerfile and AMD CI (vllm v0.12.0)#280

Merged
DarkLight1337 merged 39 commits intovllm-project:mainfrom
EmbeddedLLM:dockerfile-amd-2
Dec 21, 2025
Merged

[ROCm] [CI] Add ROCm Dockerfile and AMD CI (vllm v0.12.0)#280
DarkLight1337 merged 39 commits intovllm-project:mainfrom
EmbeddedLLM:dockerfile-amd-2

Conversation

@tjtanaa
Copy link
Contributor

@tjtanaa tjtanaa commented Dec 11, 2025

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

Setting up CI and AMD Docker images

AMD CI

The vLLM Omni AMD CI is setup following the vLLM AMD CI. Given that the whole setup (agents and agent environment) is the same, I have adapted the bootstrap files from vLLM AMD CI to make it work with vLLM Omni AMD CI.

I suggest that we keep the bootstrap code confined in this repo first for easier maintainability. Let me know if anyone thinks otherwise.

The difference between the vLLM Omni AMD CI and vLLM AMD CI is that we try to make everything upstream compatible. The AMD CI docker image is store at the same aws repo as the other hardware.

The tests that are currently enabled:

  • "Diffusion Model Test"
  • "Diffusion Cache Backend Test"
  • "Omni Model Test Qwen2-5-Omni"

The remaining test is under investigation:

  • "Omni Model Test Qwen3-Omni"

AMD Docker images

Build docker image. The docker image is also used for CI.

docker build -f docker/Dockerfile.rocm -t vllm-omni-rocm .

How to launch the docker image

#!/bin/bash
docker run -it \
   --privileged \
   --network=host \
   --group-add=video \
   --ipc=host \
   --cap-add=SYS_PTRACE \
   --security-opt seccomp=unconfined \
   --device /dev/kfd \
   --device /dev/dri \
   --name vllm-omni-rocm \
   vllm-omni-rocm \
   bash

Update installation step to include docker build instruction and docker launch instruction.

Test Plan

Built locally and is able to run the examples without failure.

Passed the vllm-omni AMD CI.
Latest Run: https://buildkite.com/vllm/vllm-omni-amd-ci/builds/43/steps/canvas You can also find it in the CI check at the bottom of this page.

Test Result

Ran the Qwen-image examples

cd examples/offline_inference/qwen_image/

python text_to_image.py \
  --prompt "a cup of coffee on the table" \
  --seed 42 \
  --cfg_scale 4.0 \
  --num_images_per_prompt 1 \
  --num_inference_steps 50 \
  --height 1024 \
  --width 1024 \
  --output outputs/coffee.png

Extra details for future PR

create_new_process_for_each_test decorator does not seem to be working. When it is in spawn mode, the content of the test decorated by create_new_process_for_each_test is not going to run, and the test returned as PASSED immediately.
The tests on CUDA and NPU are able to run correctly because on these two platforms, the create_new_process_for_each_test will be evaluated as using fork and vLLM internally is running as spawn.


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

Signed-off-by: tjtanaa <[email protected]>
Comment on lines 1 to 2
ARG ROCM_BASE_IMAGE=rocm/vllm-dev
ARG ROCM_BASE_TAG=nightly_main_20251005
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please simply this.

Suggested change
ARG ROCM_BASE_IMAGE=rocm/vllm-dev
ARG ROCM_BASE_TAG=nightly_main_20251005
ARG BASE_IMAGE=rocm/vllm-dev:nightly_main_20251005

@@ -0,0 +1,41 @@
ARG ROCM_BASE_IMAGE=rocm/vllm-dev
ARG ROCM_BASE_TAG=nightly_main_20251005
FROM ${ROCM_BASE_IMAGE}:${ROCM_BASE_TAG}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
FROM ${ROCM_BASE_IMAGE}:${ROCM_BASE_TAG}
FROM ${BASE_IMAGE}

ARG ROCM_BASE_TAG=nightly_main_20251005
FROM ${ROCM_BASE_IMAGE}:${ROCM_BASE_TAG}

ARG APP_DIR=/workspace/vllm-omni
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use the same /app directory as original vllm image

Suggested change
ARG APP_DIR=/workspace/vllm-omni
ARG COMMON_WORKDIR=/app

ARG VLLM_VERSION=v0.11.0
ARG PYTORCH_ROCM_ARCH="gfx942;gfx950"

WORKDIR ${APP_DIR}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
WORKDIR ${APP_DIR}
WORKDIR ${COMMON_WORKDIR}

rm -rf /var/lib/apt/lists/*

# Step 2: Reinstall vllm from source
RUN cd ../ && python3 -m pip uninstall -y vllm && \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
RUN cd ../ && python3 -m pip uninstall -y vllm && \
RUN python3 -m pip uninstall -y vllm && \

Comment on lines 36 to 39
ENV VLLM_ROCM_USE_AITER=1
ENV VLLM_ROCM_USE_AITER_MHA=1
ENV VLLM_ROCM_USE_AITER_LINEAR=0
ENV VLLM_ROCM_USE_AITER_RMSNORM=0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should not put this in the common Dockerfile.

Suggested change
ENV VLLM_ROCM_USE_AITER=1
ENV VLLM_ROCM_USE_AITER_MHA=1
ENV VLLM_ROCM_USE_AITER_LINEAR=0
ENV VLLM_ROCM_USE_AITER_RMSNORM=0

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These environment variables, if needed, can be set when invoking/running the unit tests.

rm -rf vllm/.git

# Step 3: Copy vllm-omni code and install without uv
COPY . ${APP_DIR}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adjust to use {WORKDIR}

tjtanaa and others added 15 commits December 12, 2025 11:54
Co-authored-by: Hongxia Yang <[email protected]>
Signed-off-by: tjtanaa <[email protected]>
Signed-off-by: tjtanaa <[email protected]>
Signed-off-by: tjtanaa <[email protected]>
… is rocm to handle unit tests that are failing

Signed-off-by: tjtanaa <[email protected]>
@tjtanaa tjtanaa changed the title [ROCm] Add ROCm Dockerfile [ROCm] [CI] Add ROCm Dockerfile and AMD CI Dec 17, 2025
@tjtanaa
Copy link
Contributor Author

tjtanaa commented Dec 18, 2025

@tjtanaa
Copy link
Contributor Author

tjtanaa commented Dec 20, 2025

This PR also resolves #387

@tjtanaa tjtanaa changed the title [ROCm] [CI] Add ROCm Dockerfile and AMD CI [ROCm] [CI] Add ROCm Dockerfile and AMD CI (vllm v0.12.0 Dec 20, 2025
@tjtanaa
Copy link
Contributor Author

tjtanaa commented Dec 20, 2025

@hongxiayang your requested review has been addressed.

@hsliuustc0106 Could you help to review this PR? Thank you. The unit tests for the Teacache and DiTCache are also passing.

@tjtanaa tjtanaa changed the title [ROCm] [CI] Add ROCm Dockerfile and AMD CI (vllm v0.12.0 [ROCm] [CI] Add ROCm Dockerfile and AMD CI (vllm v0.12.0) Dec 20, 2025
@ywang96 ywang96 added the ready label to trigger buildkite CI label Dec 20, 2025
@ywang96 ywang96 enabled auto-merge (squash) December 20, 2025 06:56
self.is_neox_style = is_neox_style
self.interleaved = not is_neox_style
self.triton_rotary_emb = None
if is_rocm():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

@tjtanaa tjtanaa Dec 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. We can simplify the conditions and not use is_rocm() follow the implementation on vLLM side. Initially, I added is_rocm so that other platform can handle the case themselves. I do not know what are the libraries the other platforms have. Like how we missed the path because authors are not aware of what library does other platform has.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ZJY0516 Sorry. I don't know if I understand your question correctly, this is a new abstraction https://github.com/vllm-project/vllm/blob/ff2168bca3a195b835c64a5c9012d7b6a9f34e61/vllm/model_executor/layers/rotary_embedding/common.py#L122 that is after vllm v0.12.0. If we want to adopt this code, I suggest we do this in another PR.

auto-merge was automatically disabled December 20, 2025 19:46

Head branch was pushed to by a user without write access

@DarkLight1337 DarkLight1337 merged commit 13584fd into vllm-project:main Dec 21, 2025
7 checks passed
wtomin pushed a commit to wtomin/vllm-omni that referenced this pull request Dec 22, 2025
yenuo26 pushed a commit to yenuo26/vllm-omni that referenced this pull request Dec 29, 2025
princepride pushed a commit to princepride/vllm-omni that referenced this pull request Jan 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready label to trigger buildkite CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants