-
Notifications
You must be signed in to change notification settings - Fork 662
Implement ciflow/rocm on Torchtitan #2114
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 12 commits
0539dbe
19a5a15
a2a55a7
cc29b39
4485cd1
132a3eb
75a1d53
8f8d9df
9417b90
924444d
e2b8f12
db8cb50
c07b45d
a53e303
b5d35a4
1b429ca
878e01b
b001e4f
17e7fd5
e54daee
255ab3d
ffd1da3
c6822c9
4fd8306
158571f
a2f63ab
917886d
fab557f
968cdbf
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,6 @@ | ||
| "ciflow/8gpu": | ||
| - .ci/docker/** | ||
| - .github/workflows/** | ||
| - scripts/** | ||
| - tests/** | ||
| - torchtitan/** | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,3 @@ | ||
| ciflow_push_tags: | ||
| - ciflow/8gpu | ||
akashveramd marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| labeler_config: labeler.yml | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -3,9 +3,12 @@ name: 8 GPU Feature Tests | |
| on: | ||
| push: | ||
| branches: [ main ] | ||
| tags: | ||
| - ciflow/8gpu/* | ||
|
Comment on lines
+6
to
+7
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What's this for -- why do we need both
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As per my understanding the PR workflows and tag workflows are totally independent. Tags provides CI flow, meaning tags can be pushed to trigger CI runs on specific commits even after the PR is closed. They can also be used for versioning releases. |
||
| paths-ignore: | ||
| - 'torchtitan/experiments/**' | ||
| pull_request: | ||
| types: [opened, synchronize, reopened, labeled, unlabeled] | ||
akashveramd marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| paths-ignore: | ||
| - 'torchtitan/experiments/**' | ||
| schedule: | ||
|
|
@@ -27,33 +30,7 @@ permissions: | |
| jobs: | ||
| # Step 1: Dynamically compute the matrix based on conditions | ||
| set-matrix: | ||
| runs-on: ubuntu-latest | ||
| outputs: | ||
| matrix: ${{ steps.set.outputs.matrix }} | ||
| steps: | ||
| - id: set | ||
| run: | | ||
| # Decide which matrix entries to include based on event type | ||
| if [[ "${{ github.event_name }}" == "push" && "${{ github.ref }}" == "refs/heads/main" ]] || [[ "${{ github.event_name }}" == "schedule" ]]; then | ||
| # Include both CUDA and ROCm | ||
| echo '{"include":[ | ||
| {"name":"cuda","runner":"linux.g5.48xlarge.nvidia.gpu","gpu-arch-type":"cuda","gpu-arch-version":"12.6","docker-image":"torchtitan-ubuntu-20.04-clang12","index-url":"https://download.pytorch.org/whl/nightly/cu126"}, | ||
| {"name":"rocm","runner":"linux.rocm.gpu.gfx942.8","gpu-arch-type":"rocm","gpu-arch-version":"7.0","docker-image":"torchtitan-rocm-ubuntu-22.04-clang12","index-url":"https://download.pytorch.org/whl/nightly/rocm7.0"} | ||
| ]}' > matrix.json | ||
| else | ||
| # Include only CUDA | ||
| echo '{"include":[ | ||
| {"name":"cuda","runner":"linux.g5.48xlarge.nvidia.gpu","gpu-arch-type":"cuda","gpu-arch-version":"12.6","docker-image":"torchtitan-ubuntu-20.04-clang12","index-url":"https://download.pytorch.org/whl/nightly/cu126"} | ||
| ]}' > matrix.json | ||
| fi | ||
|
|
||
| # Export matrix to job outputs | ||
| { | ||
| echo 'matrix<<EOF' | ||
| cat matrix.json | ||
| echo 'EOF' | ||
| } >> $GITHUB_OUTPUT | ||
|
|
||
| uses: ./.github/workflows/set-matrix.yaml | ||
|
|
||
| # Step 2: Use the dynamic matrix in the build-test job | ||
| build-test: | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,80 @@ | ||
| name: Set Matrix | ||
|
|
||
| on: | ||
| workflow_call: | ||
| outputs: | ||
| matrix: | ||
| description: dynamically set matrix | ||
| value: ${{ jobs.set.outputs.matrix }} | ||
|
|
||
| jobs: | ||
| set: | ||
| runs-on: ubuntu-latest | ||
| outputs: | ||
| matrix: ${{ steps.set.outputs.matrix }} | ||
| env: | ||
| # Event flags evaluated by github actions before the step runs: | ||
| IS_MAIN_PUSH: ${{ github.event_name == 'push' && github.ref == 'refs/heads/main' }} | ||
| IS_SCHEDULE: ${{ github.event_name == 'schedule' }} | ||
| IS_PR: ${{ github.event_name == 'pull_request' }} | ||
| HAS_8GPU_LABEL: ${{ github.event_name == 'pull_request' && contains(github.event.pull_request.labels.*.name, 'ciflow/8gpu') }} | ||
| IS_8GPU_TAG: ${{ startsWith(github.ref, 'refs/tags/ciflow/8gpu/') }} | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. why do we need this, if we already have the one above?
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Tags and labels are independent and different events. Hence, created a separate variable for it. |
||
| TRIGGERED_8GPU_LABEL: ${{ github.event_name == 'pull_request' && github.event.action == 'labeled' }} | ||
|
|
||
| steps: | ||
| - id: set | ||
| run: | | ||
| # Define ROCm matrix | ||
| ROCM_MATRIX='{ | ||
| "name": "rocm", | ||
| "runner": "linux.rocm.gpu.gfx942.8", | ||
| "gpu-arch-type": "rocm", | ||
| "gpu-arch-version": "7.0", | ||
| "docker-image": "torchtitan-rocm-ubuntu-22.04-clang12", | ||
| "index-url": "https://download.pytorch.org/whl/nightly/rocm7.0" | ||
| }' | ||
| # Define CUDA matrix | ||
| CUDA_MATRIX='{ | ||
| "name": "cuda", | ||
| "runner": "linux.g5.48xlarge.nvidia.gpu", | ||
| "gpu-arch-type": "cuda", | ||
| "gpu-arch-version": "12.6", | ||
| "docker-image": "torchtitan-ubuntu-20.04-clang12", | ||
| "index-url": "https://download.pytorch.org/whl/nightly/cu126" | ||
| }' | ||
| # Use default value as 'false' for unset environment variables | ||
| IS_MAIN_PUSH="${IS_MAIN_PUSH:-false}" | ||
| IS_SCHEDULE="${IS_SCHEDULE:-false}" | ||
| IS_PR="${IS_PR:-false}" | ||
| HAS_8GPU_LABEL="${HAS_8GPU_LABEL:-false}" | ||
| IS_8GPU_TAG="${IS_8GPU_TAG:-false}" | ||
| TRIGGERED_8GPU_LABEL="${TRIGGERED_8GPU_LABEL:-false}" | ||
| # Decide which matrix entries to include based on event type | ||
| # Runs ROCm only for push tag OR when PR label gets triggered | ||
| if [[ "$IS_8GPU_TAG" == "true" || "$TRIGGERED_8GPU_LABEL" == "true" ]]; then | ||
| cat > matrix.json <<JSON | ||
| {"include": [$ROCM_MATRIX]} | ||
| JSON | ||
| # Runs CUDA and ROCm for normal PR (if PR label is present) OR for push to main, cron schedule | ||
| elif [[ ("$HAS_8GPU_LABEL" == "true" && "$IS_PR" == "true") || ("$IS_MAIN_PUSH" == "true" || "$IS_SCHEDULE" == "true") ]]; then | ||
| cat > matrix.json <<JSON | ||
| {"include": [$CUDA_MATRIX,$ROCM_MATRIX]} | ||
| JSON | ||
| # Runs CUDA as default (includes normal PR, if PR label is NOT present) | ||
| else | ||
| cat > matrix.json <<JSON | ||
| {"include": [$CUDA_MATRIX]} | ||
| JSON | ||
| fi | ||
| # Export matrix to job outputs | ||
| { | ||
| echo 'matrix<<EOF' | ||
| cat matrix.json | ||
| echo 'EOF' | ||
| } >> $GITHUB_OUTPUT | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what are these for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If PR modifies any of these locations, then GitHub automatically adds the label.