feat(deploy): multi-node GPU support with dynamic OSMO pool configuration#410
Merged
agreaves-ms merged 10 commits intomainfrom Feb 27, 2026
Merged
feat(deploy): multi-node GPU support with dynamic OSMO pool configuration#410agreaves-ms merged 10 commits intomainfrom
agreaves-ms merged 10 commits intomainfrom
Conversation
- add RTX PRO and H100 GPU platforms to configuration - update default platform and GPU instance type variables - remove redundant access key ID from workflow identity configuration 🔧 - Generated by Copilot
- introduce .env.local.example for local environment configuration - update deployment scripts to include --use-local-osmo option - enhance README with instructions for using local OSMO CLI - add osmo-dev.sh script for running OSMO from local source 🔧 - Generated by Copilot
The NVIDIA Container Runtime defaults to utility,compute capabilities, which excludes the graphics (Vulkan/OpenGL) libraries that Isaac Sim requires for its rendering subsystem and clean shutdown. Without Vulkan, vkCreateDevice fails and simulation_app.close() hangs indefinitely in Kit's C++ render loop. - Add NVIDIA_DRIVER_CAPABILITIES=all to all 6 OSMO workflow templates - Document the requirement in docs/gpu-configuration.md with evidence from vulkaninfo on the RTX PRO 6000 DC-4-96Q vGPU profile - Remove simulation_shutdown.py workaround module (no longer needed) - Clean up all training/inference scripts to use direct close() calls
- add AnsiStrippingStream to clean ANSI codes from output - integrate install_ansi_stripping in training scripts 🔧 - Generated by Copilot
…lans - detail organization of deploy, src, workflows, and scripts - describe infrastructure, data pipeline, management, and evaluation strategies - outline synthetic data generation and deployment processes 📄 - Generated by Copilot
…ation management - refactor preflight checks for version validation - simplify OSMO login process with new function - add support for per-pool configuration overrides - introduce new workflow configuration template - remove deprecated access key workflow templates 🔧 - Generated by Copilot
Dependency Review✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.Scanned FilesNone |
… in training scripts - include resource parameters in submit-osmo-dataset-training.sh - update submit-osmo-training.sh with resource options - enhance submit-skrl-test.sh to accept resource configurations - modify YAML workflows to utilize dynamic resource values 🔧 - Generated by Copilot
…e app launcher variable - delete unused SKRL training test script - rename app launcher variable for clarity - clean up imports in monitoring and play scripts 🔧 - Generated by Copilot
🐛 - Generated by Copilot
WilliamBerryiii
approved these changes
Feb 27, 2026
Member
WilliamBerryiii
left a comment
There was a problem hiding this comment.
One question ... but other wise this looks fine. I'm gonna be changing a bunch of this with subsequent PRs around SPDX headers and footers, and frontmatter conformance. So i think we can get this in.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
feat(deploy): multi-node GPU support with dynamic OSMO pool configuration
This PR enables mixed GPU node pools in the AKS cluster, refactors OSMO deployment from static single-pool configuration to dynamic multi-pool generation from Terraform state, and modernizes infrastructure connectivity patterns. Changes span Terraform infrastructure, deployment scripts, training/inference Python code, and OSMO workflow definitions across 67 files.
Description
Infrastructure — Multi-GPU Node Pool Support
nvidia.com/gpu.deploy.driver=falsefor vGPU nodes that use pre-installed Microsoft GRID driversnode_poolsTerraform output exposing VM size, taints, priority, and labels for downstream OSMO pool generationshould_enable_microsoft_defenderbooleaneviction_policyto apply only to Spot pools — Azure rejects this parameter on Regular priority poolsInfrastructure — PostgreSQL & Networking
postgresql_locationazurerm_subnet.postgresqland associated DNS zone linkazurerm_private_endpoint.postgresqlwith managed DNS zone groupDeployment — Dynamic Multi-Pool Configuration
04-deploy-osmo-backend.shto dynamically generate pool configurations from Terraform state, replacing static single-pool setupnode_poolsTerraform output and generates per-pool OSMO platform entries with auto-computed Kubernetes tolerations from node taintsDEFAULT_POOLfor simplified workflow submissionDeployment — OSMO Control Plane & Prerelease Support
osmo_login_and_setup()after service deploymenthelm/osmotohelm/serviceandhelm/web-ui.env.localauto-loading for per-developer configuration overrides without repository modificationsTraining & Inference — Shutdown & Logging
AnsiStrippingStreamfor container-friendly log output — strips ANSI escape codes and converts carriage returns to newlinesprepare_for_shutdown()andinstall_ansi_stripping()across all training and inference scripts, replacingsimulation_app.close()with directos._exit(0)callsWorkflows & Version Updates
NVIDIA_DRIVER_CAPABILITIES=allto all 6 OSMO workflow templates for Vulkan rendering support required by Isaac SimDeveloper Experience
--use-local-osmoflag to all 8 OSMO submission scripts and both cleanup scripts for local OSMO CLI development via Bazel-built sourceType of Change
Component(s) Affected
deploy/000-prerequisites- Azure subscription setupdeploy/001-iac- Terraform infrastructuredeploy/002-setup- OSMO control plane / Helmdeploy/004-workflow- Training workflowssrc/training- Python training scriptsscripts/- Scripts for kicking off trainingdocs/- DocumentationTesting Performed
planreviewed (no unexpected changes)applytested in dev environmentsmoke_test_azure.py)Documentation Impact
Bug Fix Checklist
Complete this section for bug fix PRs. Skip for other contribution types.
Checklist
Notes
simulation_shutdown.pyappears as a net addition in the cumulative diff despite a commit message referencing its removal; the file is present in the final branch stateFollow-up Tasks