-
Notifications
You must be signed in to change notification settings - Fork 259
Spot vm integration #4830
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Spot vm integration #4830
Conversation
3f11b43 to
28cd55d
Compare
tools/cloud-build/daily-tests/ansible_playbooks/test-validation/test-a3m-preemption.yml
Outdated
Show resolved
Hide resolved
tools/cloud-build/daily-tests/tests/ml-a3-megagpu-onspot-slurm-ubuntu.yml
Outdated
Show resolved
Hide resolved
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
|
|
||
| set +e |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are we not existing on error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you mean exit instead of exist. If we exit on error, it will not loop through regions if gcloud create command fails for one region on line 84.
28cd55d to
3a386a7
Compare
| locals { | ||
| spot_label = { spot = tostring(var.enable_spot_vm) } | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the rationale for adding this label ?
| IMAGE_PROJECT=$PROJECT_ID | ||
| PROVISIONING_MODEL="SPOT" | ||
| TERMINATION_ACTION="DELETE" | ||
| OPTIONS_GCS_PATH="gs://hpc-ctk1357/options/options.txt" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we storing the zone and region in a GCS file ? Can they be stored in the script ?
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds support for Spot VMs to A3Mega Slurm clusters, including a new daily test. My review focuses on improving the maintainability and robustness of the new Terraform and Ansible code. I've suggested refactoring repetitive code in Ansible playbooks, improving efficiency in a shell script, and simplifying some logic for better readability. I also found a potential copy-paste error in a build configuration file that could affect test concurrency.
tools/cloud-build/daily-tests/builds/ml-a3-megagpu-onspot-slurm-ubuntu.yaml
Outdated
Show resolved
Hide resolved
tools/cloud-build/daily-tests/ansible_playbooks/slurm-integration-test.yml
Outdated
Show resolved
Hide resolved
tools/cloud-build/daily-tests/ansible_playbooks/slurm-integration-test.yml
Outdated
Show resolved
Hide resolved
tools/cloud-build/daily-tests/ansible_playbooks/test-validation/test-a3m-preemption.yml
Outdated
Show resolved
Hide resolved
tools/cloud-build/daily-tests/ansible_playbooks/test-validation/test-a3m-preemption.yml
Outdated
Show resolved
Hide resolved
tools/cloud-build/daily-tests/ansible_playbooks/test-validation/test-a3m-preemption.yml
Outdated
Show resolved
Hide resolved
83260c7 to
7bafb7b
Compare
7bafb7b to
c0b7ace
Compare
This pull request introduces support for Spot VMs in A3Mega Slurm clusters and adds a new daily test to validate this functionality. It also adds a spot label to instances when Spot VMs are enabled.
Submission Checklist
NOTE: Community submissions can take up to 2 weeks to be reviewed.
Please take the following actions before submitting this pull request.