-
-
Notifications
You must be signed in to change notification settings - Fork 11.7k
[CI/Build][ROCm] Enabling LoRA tests on ROCm #7369
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 30 commits
31655b7
26d8429
670e3c6
a58f019
39f9bec
d8207ed
df0efb1
a4f78e5
91ce9c4
32f3a10
a103778
1d35abb
1d1a86c
85f76e9
23017cc
d7cde25
f7794a6
d371265
f853c58
fa9f388
64f12e4
dfb0e4c
27f6e00
bbe7696
46682e3
b3fc101
dd12ded
6c43c65
5c6cdc3
c48c569
16ea3cf
1de52bd
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,5 +1,6 @@ | ||
| # This script runs test inside the corresponding ROCm docker container. | ||
| set -ex | ||
| #set -ex | ||
| set -o pipefail | ||
|
|
||
| # Print ROCm version | ||
| echo "--- Confirming Clean Initial State" | ||
|
|
@@ -70,16 +71,51 @@ HF_CACHE="$(realpath ~)/huggingface" | |
| mkdir -p ${HF_CACHE} | ||
| HF_MOUNT="/root/.cache/huggingface" | ||
|
|
||
| docker run \ | ||
| commands=$@ | ||
| PARALLEL_JOB_COUNT=8 | ||
| #check if the command contains shard flag | ||
akondrat-amd marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| if [[ $commands == *"--shard-id="* ]]; then | ||
| for GPU in $(seq 0 $(($PARALLEL_JOB_COUNT-1))); do | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is an incorrect implementation of the sharding. Buildkite should already started X number of jobs under the same name. Each run script should just receive the environment variable, and pass it along to the command. The current implementation is trying to run all shards in the same command
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. in more detail, it looks like the they are indeed running in the same job in parallel this might break more often than we wanted?
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This invocation is compatible with the general way we launch tests. IMHO unless there is a problem with execution of the "payload" tests, we shouldn't be restricted in the way we implement the invocation logic.
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Our engineering choices are dictated by the specific nature of our HW infrastructure and its initialization/decoupling.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We cannot use docker Buildkite plugin, so we have parallelize the jobs ourselves. Our shell script receives the command with empty "--shard-id=" argument, so we have to substitute it and run as background jobs while exposing one GPU to each job.
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. sorry how many GPUs is there on each node? You can run multiple buildkite agent on the host and pin each to a GPU using environment variable. This can drastically help accelerate the test
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. my main concern with this approach is now the way sharding is handling is implemented differently and can cause issues when developers are debugging the test failures on amd devices.
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Each node has 8 GPUs. The restarting procedure is indiscriminate though,- we're restarting all GPUs on a given node at once. This strategy has advantage of complete between-test decoupling. The unfortunate downside is that we can't rely on multiple Buildkite agents running on the same host. We achieved the current level of HW stability with this approach.
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Got it. Let's refine this PR a bit and we can merge it in |
||
| #replace shard arguments | ||
| commands=${@//"--shard-id= "/"--shard-id=${GPU} "} | ||
| commands=${commands//"--num-shards= "/"--num-shards=${PARALLEL_JOB_COUNT} "} | ||
| docker run \ | ||
| --device /dev/kfd --device /dev/dri \ | ||
| --network host \ | ||
| --shm-size=16gb \ | ||
| --rm \ | ||
| -e HIP_VISIBLE_DEVICES=0 \ | ||
| -e HIP_VISIBLE_DEVICES=${GPU} \ | ||
| -e HF_TOKEN \ | ||
| -v ${HF_CACHE}:${HF_MOUNT} \ | ||
| -e HF_HOME=${HF_MOUNT} \ | ||
| --name ${container_name} \ | ||
| --name ${container_name}_${GPU} \ | ||
| ${image_name} \ | ||
| /bin/bash -c "${@}" | ||
|
|
||
| /bin/bash -c "${commands}" \ | ||
| |& while read -r line; do echo ">>Shard $GPU: $line"; done & | ||
| PIDS+=($!) | ||
| done | ||
| #wait for all processes to finish and collect exit codes | ||
| for pid in ${PIDS[@]}; do | ||
| wait ${pid} | ||
| STATUS+=($?) | ||
| done | ||
| for st in ${STATUS[@]}; do | ||
| if [[ ${st} -ne 0 ]]; then | ||
| echo "One of the processes failed with $st" | ||
| exit ${st} | ||
| fi | ||
| done | ||
| else | ||
| docker run \ | ||
| --device /dev/kfd --device /dev/dri \ | ||
| --network host \ | ||
| --shm-size=16gb \ | ||
| --rm \ | ||
| -e HIP_VISIBLE_DEVICES=0 \ | ||
| -e HF_TOKEN \ | ||
| -v ${HF_CACHE}:${HF_MOUNT} \ | ||
| -e HF_HOME=${HF_MOUNT} \ | ||
| --name ${container_name} \ | ||
| ${image_name} \ | ||
| /bin/bash -c "${commands}" | ||
| fi | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove