Skip to content

feat(deploy): multi-node GPU support with dynamic OSMO pool configuration#410

Merged
agreaves-ms merged 10 commits intomainfrom
feat/multi-node
Feb 27, 2026
Merged

feat(deploy): multi-node GPU support with dynamic OSMO pool configuration#410
agreaves-ms merged 10 commits intomainfrom
feat/multi-node

Conversation

@agreaves-ms
Copy link
Collaborator

feat(deploy): multi-node GPU support with dynamic OSMO pool configuration

This PR enables mixed GPU node pools in the AKS cluster, refactors OSMO deployment from static single-pool configuration to dynamic multi-pool generation from Terraform state, and modernizes infrastructure connectivity patterns. Changes span Terraform infrastructure, deployment scripts, training/inference Python code, and OSMO workflow definitions across 67 files.

Description

Infrastructure — Multi-GPU Node Pool Support

Mixed GPU clusters (e.g., RTX PRO 6000 vGPU + H100 datacenter) require distinct driver strategies, node labels, and tolerations per pool. These changes enable first-class support for heterogeneous GPU configurations.

  • Added node labels and upgrade settings to AKS node pool definitions, enabling nvidia.com/gpu.deploy.driver=false for vGPU nodes that use pre-installed Microsoft GRID drivers
  • Introduced a GRID driver installer DaemonSet (gpu-grid-driver-installer.yaml) targeting labeled vGPU nodes with Microsoft GRID driver v580.105.08-grid-azure
  • Propagated new node_pools Terraform output exposing VM size, taints, priority, and labels for downstream OSMO pool generation
  • Added optional Microsoft Defender for Containers integration via should_enable_microsoft_defender boolean
  • Fixed eviction_policy to apply only to Spot pools — Azure rejects this parameter on Regular priority pools

Infrastructure — PostgreSQL & Networking

  • Refactored PostgreSQL connectivity from VNet integration (delegated subnet) to private endpoint, enabling cross-region deployments with configurable postgresql_location
    • Removed azurerm_subnet.postgresql and associated DNS zone link
    • Added azurerm_private_endpoint.postgresql with managed DNS zone group
  • Added lifecycle ignore rule for NAT gateway IP tags to prevent Terraform drift
  • Added 90-minute timeouts for Redis CRUD operations to prevent context cancellation during long-running Azure operations

Deployment — Dynamic Multi-Pool Configuration

  • Refactored 04-deploy-osmo-backend.sh to dynamically generate pool configurations from Terraform state, replacing static single-pool setup
    • Reads node_pools Terraform output and generates per-pool OSMO platform entries with auto-computed Kubernetes tolerations from node taints
    • Creates a shared "default" pool pointing to DEFAULT_POOL for simplified workflow submission
    • Supports per-pool JSON overrides from config/overrides/
  • Unified two workflow configuration templates (workflow-config-access-keys and workflow-config-workload-identity) into a single workflow-config.template.json with conditional post-processing
  • Added new platform template (platform-template-config.template.json) for pod spec and pool platform generation

Deployment — OSMO Control Plane & Prerelease Support

  • Added preflight version validation enforcing tested chart/image version combinations
  • Introduced NGC authentication for pre-release OSMO images from nvcr.io, with conditional pull secret creation
  • Added automated OSMO login and user setup via osmo_login_and_setup() after service deployment
  • Fixed ACR Helm OCI paths from helm/osmo to helm/service and helm/web-ui
  • Updated GPU Operator from v24.9.1 → v25.3.4 and OSMO chart from 1.0.0 → 1.0.1
  • Added .env.local auto-loading for per-developer configuration overrides without repository modifications

Training & Inference — Shutdown & Logging

  • Added simulation_shutdown.py implementing the Isaac Sim 4.x vGPU shutdown hang workaround — disables stop-handle, unsubscribes timeline callback, forks a 30-second SIGKILL watchdog
  • Added stream.py with AnsiStrippingStream for container-friendly log output — strips ANSI escape codes and converts carriage returns to newlines
  • Integrated prepare_for_shutdown() and install_ansi_stripping() across all training and inference scripts, replacing simulation_app.close() with direct os._exit(0) calls

Workflows & Version Updates

  • Added NVIDIA_DRIVER_CAPABILITIES=all to all 6 OSMO workflow templates for Vulkan rendering support required by Isaac Sim
  • Updated default Isaac Lab container image from 2.2.0 → 2.3.2
  • Increased workflow resource allocations to match full GPU node capacity (cpu: 120, memory: 480Gi, storage: 200Gi)
  • Added new skrl-test.yaml diagnostic workflow and submit-skrl-test.sh submission script for minimal training validation

Developer Experience

  • Added --use-local-osmo flag to all 8 OSMO submission scripts and both cleanup scripts for local OSMO CLI development via Bazel-built source
  • Added osmo-dev.sh wrapper script for building OSMO CLI from local repository checkout
  • Enhanced GPU Operator Helm values for mixed GPU pool strategy with MIG single-instance mode (RTX PRO 6000 requirement)
  • Added disabled oauth2Proxy and authz sidecar configurations to OSMO Helm values, deferring enablement to OIDC setup

Type of Change

  • 🐛 Bug fix (non-breaking change fixing an issue)
  • ✨ New feature (non-breaking change adding functionality)
  • 💥 Breaking change (fix or feature causing existing functionality to change)
  • 📚 Documentation update
  • 🏗️ Infrastructure change (Terraform/IaC)
  • ♻️ Refactoring (no functional changes)

Component(s) Affected

  • deploy/000-prerequisites - Azure subscription setup
  • deploy/001-iac - Terraform infrastructure
  • deploy/002-setup - OSMO control plane / Helm
  • deploy/004-workflow - Training workflows
  • src/training - Python training scripts
  • scripts/ - Scripts for kicking off training
  • docs/ - Documentation

Testing Performed

  • Terraform plan reviewed (no unexpected changes)
  • Terraform apply tested in dev environment
  • Training scripts tested locally with Isaac Sim
  • OSMO workflow submitted successfully
  • Smoke tests passed (smoke_test_azure.py)

Documentation Impact

  • No documentation changes needed
  • Documentation updated in this PR
  • Documentation issue filed

Bug Fix Checklist

Complete this section for bug fix PRs. Skip for other contribution types.

  • Linked to issue being fixed
  • Regression test included, OR
  • Justification for no regression test:

Checklist

Notes

  • PostgreSQL connectivity refactor is a breaking change for existing deployments — removes the delegated subnet entirely in favor of private endpoints
  • OSMO sidecar features (oauth2Proxy, authz) are disabled by default pending OIDC provider configuration
  • Backend-test-runner CronJob disabled — NVCR image not available in ACR without NGC authentication setup
  • simulation_shutdown.py appears as a net addition in the cumulative diff despite a commit message referencing its removal; the file is present in the final branch state

Follow-up Tasks

  • Configure per-pool overrides in config/overrides/ for production GPU node pools
  • Enable oauth2Proxy and authz sidecars after OIDC provider setup
  • Set up NVCR authentication pipeline for backend-test-runner CronJob
  • Validate PostgreSQL private endpoint migration path for existing deployments

- add RTX PRO and H100 GPU platforms to configuration
- update default platform and GPU instance type variables
- remove redundant access key ID from workflow identity configuration

🔧 - Generated by Copilot
- introduce .env.local.example for local environment configuration
- update deployment scripts to include --use-local-osmo option
- enhance README with instructions for using local OSMO CLI
- add osmo-dev.sh script for running OSMO from local source

🔧 - Generated by Copilot
The NVIDIA Container Runtime defaults to utility,compute capabilities,
which excludes the graphics (Vulkan/OpenGL) libraries that Isaac Sim
requires for its rendering subsystem and clean shutdown. Without Vulkan,
vkCreateDevice fails and simulation_app.close() hangs indefinitely in
Kit's C++ render loop.

- Add NVIDIA_DRIVER_CAPABILITIES=all to all 6 OSMO workflow templates
- Document the requirement in docs/gpu-configuration.md with evidence
  from vulkaninfo on the RTX PRO 6000 DC-4-96Q vGPU profile
- Remove simulation_shutdown.py workaround module (no longer needed)
- Clean up all training/inference scripts to use direct close() calls
- add AnsiStrippingStream to clean ANSI codes from output
- integrate install_ansi_stripping in training scripts

🔧 - Generated by Copilot
…lans

- detail organization of deploy, src, workflows, and scripts
- describe infrastructure, data pipeline, management, and evaluation strategies
- outline synthetic data generation and deployment processes

📄 - Generated by Copilot
…ation management

- refactor preflight checks for version validation
- simplify OSMO login process with new function
- add support for per-pool configuration overrides
- introduce new workflow configuration template
- remove deprecated access key workflow templates

🔧 - Generated by Copilot
@github-actions
Copy link

github-actions bot commented Feb 26, 2026

Dependency Review

✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.

Scanned Files

None

… in training scripts

- include resource parameters in submit-osmo-dataset-training.sh
- update submit-osmo-training.sh with resource options
- enhance submit-skrl-test.sh to accept resource configurations
- modify YAML workflows to utilize dynamic resource values

🔧 - Generated by Copilot
…e app launcher variable

- delete unused SKRL training test script
- rename app launcher variable for clarity
- clean up imports in monitoring and play scripts

🔧 - Generated by Copilot
Copy link
Member

@WilliamBerryiii WilliamBerryiii left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One question ... but other wise this looks fine. I'm gonna be changing a bunch of this with subsequent PRs around SPDX headers and footers, and frontmatter conformance. So i think we can get this in.

@agreaves-ms agreaves-ms merged commit 6c98f05 into main Feb 27, 2026
15 checks passed
@agreaves-ms agreaves-ms deleted the feat/multi-node branch February 27, 2026 01:19
@WilliamBerryiii WilliamBerryiii added this to the v0.4.0 milestone Mar 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants