-
-
Notifications
You must be signed in to change notification settings - Fork 11.7k
[CI/Build][ROCm] Enabling LoRA tests on ROCm #7369
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
simon-mo
merged 32 commits into
vllm-project:main
from
akondrat-amd:lora_test_enablement
Sep 4, 2024
Merged
Changes from all commits
Commits
Show all changes
32 commits
Select commit
Hold shift + click to select a range
31655b7
Modifying test_quant_model.py, AWQ is not supported on ROCm
akondrat-amd 26d8429
xfailing Gemma test for further investigation
akondrat-amd 670e3c6
Enabling LoRA tests for AMD in Buildkite
akondrat-amd a58f019
Adding reason for Gemma test xfail
akondrat-amd 39f9bec
Fixing MODELS re-definition
akondrat-amd d8207ed
Removing - csrc/punica for LoRA dependancies
akondrat-amd df0efb1
Sorting imports
akondrat-amd a4f78e5
Update alignment
akondrat-amd 91ce9c4
Make yapf happy
akondrat-amd 32f3a10
Update test_quant_model.py
akondrat-amd a103778
Update test_quant_model.py
akondrat-amd 1d35abb
Make yapf(3.11) happy
akondrat-amd 1d1a86c
Removing csrc/punica dependency for LoRA long context test test-pipe…
akondrat-amd 85f76e9
Exposing single GPU to the container
akondrat-amd 23017cc
Passing Bildkite env vars to container for pytest
akondrat-amd d7cde25
Explicitly setting number of parallel jobs(shards in pytest) to 1
akondrat-amd f7794a6
Removing unused arguments in test shell script
akondrat-amd d371265
Placing the quotes around the test commands
akondrat-amd f853c58
Remove single quotes
akondrat-amd fa9f388
Automatically replacing CUDA_VISIBLE_DEVICES with HIP_VISIBLE_DEVICE…
akondrat-amd 64f12e4
trying to run four instances in parallel
akondrat-amd dfb0e4c
adding GPU number to container name
akondrat-amd 27f6e00
Re-enabling LoRA tests
akondrat-amd bbe7696
Running four docker processes in parallel
akondrat-amd 46682e3
Using pipefail option to propagate the error code
akondrat-amd b3fc101
Run 8 parallel jobs
akondrat-amd dd12ded
Fixing pipe operator
akondrat-amd 6c43c65
Removing comment
akondrat-amd 5c6cdc3
Merge remote-tracking branch 'upstream/main' into lora_test_enablement
akondrat-amd c48c569
Resolving conflict in run-amd-test.sh and merging with main
akondrat-amd 16ea3cf
Update comment in .buildkite/run-amd-test.sh
akondrat-amd 1de52bd
Removed commented out string run-amd-test.sh
akondrat-amd File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an incorrect implementation of the sharding. Buildkite should already started X number of jobs under the same name. Each run script should just receive the environment variable, and pass it along to the command.
The current implementation is trying to run all shards in the same command
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in more detail, it looks like the they are indeed running in the same job in parallel
https://buildkite.com/vllm/ci-aws/builds/7784#01919a58-e1eb-48b3-9fd5-872f0328e913
this might break more often than we wanted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This invocation is compatible with the general way we launch tests. IMHO unless there is a problem with execution of the "payload" tests, we shouldn't be restricted in the way we implement the invocation logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our engineering choices are dictated by the specific nature of our HW infrastructure and its initialization/decoupling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We cannot use docker Buildkite plugin, so we have parallelize the jobs ourselves. Our shell script receives the command with empty "--shard-id=" argument, so we have to substitute it and run as background jobs while exposing one GPU to each job.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry how many GPUs is there on each node? You can run multiple buildkite agent on the host and pin each to a GPU using environment variable. This can drastically help accelerate the test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my main concern with this approach is now the way sharding is handling is implemented differently and can cause issues when developers are debugging the test failures on amd devices.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Each node has 8 GPUs. The restarting procedure is indiscriminate though,- we're restarting all GPUs on a given node at once. This strategy has advantage of complete between-test decoupling. The unfortunate downside is that we can't rely on multiple Buildkite agents running on the same host.
We achieved the current level of HW stability with this approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. Let's refine this PR a bit and we can merge it in