Skip to content

Conversation

@simrankaurb
Copy link
Contributor

@simrankaurb simrankaurb commented Nov 10, 2025

This pull request introduces support for Spot VMs in A3Mega Slurm clusters and adds a new daily test to validate this functionality. It also adds a spot label to instances when Spot VMs are enabled.

Submission Checklist

NOTE: Community submissions can take up to 2 weeks to be reviewed.

Please take the following actions before submitting this pull request.

  • Fork your PR branch from the Toolkit "develop" branch (not main)
  • Test all changes with pre-commit in a local branch #
  • Confirm that "make tests" passes all tests
  • Add or modify unit tests to cover code changes
  • Ensure that unit test coverage remains above 80%
  • Update all applicable documentation
  • Follow Cluster Toolkit Contribution guidelines #

@simrankaurb simrankaurb force-pushed the spot_vm_integration branch 10 times, most recently from 3f11b43 to 28cd55d Compare November 12, 2025 13:08
@simrankaurb simrankaurb marked this pull request as ready for review November 12, 2025 13:09
@simrankaurb simrankaurb requested review from a team and samskillman as code owners November 12, 2025 13:09
@simrankaurb simrankaurb added the release-chore To not include into release notes label Nov 12, 2025
@simrankaurb simrankaurb added release-module-improvements Added to release notes under the "Module Improvements" heading. and removed release-chore To not include into release notes labels Nov 12, 2025
# See the License for the specific language governing permissions and
# limitations under the License.

set +e
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are we not existing on error?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you mean exit instead of exist. If we exit on error, it will not loop through regions if gcloud create command fails for one region on line 84.

@LAVEEN LAVEEN requested a review from cboneti November 12, 2025 13:27
locals {
spot_label = { spot = tostring(var.enable_spot_vm) }
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the rationale for adding this label ?

IMAGE_PROJECT=$PROJECT_ID
PROVISIONING_MODEL="SPOT"
TERMINATION_ACTION="DELETE"
OPTIONS_GCS_PATH="gs://hpc-ctk1357/options/options.txt"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we storing the zone and region in a GCS file ? Can they be stored in the script ?

@bytetwin
Copy link
Collaborator

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for Spot VMs to A3Mega Slurm clusters, including a new daily test. My review focuses on improving the maintainability and robustness of the new Terraform and Ansible code. I've suggested refactoring repetitive code in Ansible playbooks, improving efficiency in a shell script, and simplifying some logic for better readability. I also found a potential copy-paste error in a build configuration file that could affect test concurrency.

@LAVEEN LAVEEN requested a review from bytetwin November 12, 2025 13:48
@simrankaurb simrankaurb marked this pull request as draft November 13, 2025 12:04
@simrankaurb simrankaurb force-pushed the spot_vm_integration branch 2 times, most recently from 83260c7 to 7bafb7b Compare November 18, 2025 07:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-module-improvements Added to release notes under the "Module Improvements" heading.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants