You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+61-29Lines changed: 61 additions & 29 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -4,35 +4,58 @@ Production-ready framework for orchestrating robotics and AI workloads on [Azure
4
4
5
5
## 🚀 Features
6
6
7
-
-**Infrastructure as Code** – [Terraform modules](deploy/001-iac/) for reproducible deployments
8
-
-**Dual Orchestration** – [AzureML jobs](workflows/azureml/) and [NVIDIA OSMO workflows](workflows/osmo/) both available
9
-
-**Workload Identity** – Key-less authentication via Azure AD federation ([setup](deploy/002-setup/README.md#scenario-2-workload-identity--ngc-production))
10
-
-**Private Networking** – Azure services secured on a private VNet with [private endpoints](deploy/001-iac/variables.tf) and private links; optional [VPN gateway](deploy/001-iac/vpn/) for secure remote access; public access configurable when needed
11
-
-**MLflow Integration** – Experiment tracking with Azure ML ([details](docs/mlflow-integration.md))
12
-
-**GPU Scheduling** – [KAI Scheduler](deploy/002-setup/values/kai-scheduler.yaml) for efficient GPU utilization
13
-
-**Auto-scaling** – Pay-per-use GPU compute on AKS Spot nodes
7
+
| Capability | Description |
8
+
|------------|-------------|
9
+
| Infrastructure as Code |[Terraform modules](deploy/001-iac/) for reproducible Azure deployments |
10
+
| Dual Orchestration | Submit jobs via [AzureML](workflows/azureml/) or [OSMO](workflows/osmo/)|
11
+
| Workload Identity | Key-less auth via Azure AD ([setup guide](deploy/002-setup/README.md#scenario-2-workload-identity--ngc)) |
12
+
| Private Networking | Services on private VNet with optional [VPN gateway](deploy/001-iac/vpn/)|
13
+
| MLflow Integration | Experiment tracking with Azure ML ([details](docs/mlflow-integration.md)) |
14
+
| GPU Scheduling |[KAI Scheduler](deploy/002-setup/values/kai-scheduler.yaml) for efficient utilization |
15
+
| Auto-scaling | Pay-per-use GPU compute on AKS Spot nodes |
14
16
15
17
## 🏗️ Architecture
16
18
17
19
The infrastructure deploys an AKS cluster with GPU node pools running the NVIDIA GPU Operator and KAI Scheduler. Training workloads can be submitted via OSMO workflows (control plane and backend operator) and AzureML jobs (ML extension). Both platforms share common infrastructure: Azure Storage for checkpoints and data, Key Vault for secrets, and Azure Container Registry for container images. OSMO additionally uses PostgreSQL for workflow state and Redis for caching.
18
20
19
-
**Core Components**:
20
-
21
-
-**AKS Cluster** – Hosts GPU workloads with Spot node pools for cost optimization
22
-
-**NVIDIA GPU Operator** – Manages GPU drivers and device plugins
23
-
-**KAI Scheduler** – GPU-aware scheduling for training jobs
24
-
-**AzureML Extension** – Enables Azure ML job submission to Kubernetes
25
-
-**OSMO Control Plane** – Workflow orchestration (service, router, web-ui)
26
-
-**OSMO Backend Operator** – Executes workflows on the cluster
21
+
**Azure Infrastructure** (deployed by [Terraform](deploy/001-iac/)):
22
+
23
+
| Component | Purpose |
24
+
|-----------|--------|
25
+
| Virtual Network | Private networking with NAT Gateway and DNS Resolver |
26
+
| Private Endpoints | Secure access to Azure services (7 endpoints, 11+ DNS zones) |
27
+
| AKS Cluster | Kubernetes with GPU Spot node pools and Workload Identity |
28
+
| Key Vault | Secrets management with RBAC authorization |
29
+
| Azure ML Workspace | Experiment tracking, model registry, compute management |
30
+
| Storage Account | Training data, checkpoints, and workflow artifacts |
31
+
| Container Registry | Training and OSMO container images |
| KAI Scheduler | GPU-aware scheduling with bin-packing |
43
+
| AzureML Extension | ML training and inference job submission |
44
+
| OSMO Control Plane | Workflow API, router, and web interface |
45
+
| OSMO Backend Operator | Workflow execution on cluster |
46
+
47
+
⚙️ = Optional component
27
48
28
49
## 🌍 Real World Examples
29
50
30
51
OSMO orchestration on Azure enables production-scale robotics training across industries:
31
52
32
-
-**Warehouse AMRs** – Train navigation policies with 1000+ parallel environments on auto-scaling GPU nodes, checkpoint to Azure Storage, track experiments in Azure ML
33
-
-**Manufacturing Arms** – Develop manipulation strategies with physics-accurate simulation, leveraging pay-per-use GPU compute and global Azure regions
34
-
-**Legged Robots** – Optimize locomotion policies with MLflow experiment tracking for sim-to-real transfer
35
-
-**Collaborative Robots** – Create safe interaction policies with Azure Monitor logging for compliance auditing
53
+
| Use Case | Training Scenario |
54
+
|----------|-------------------|
55
+
| Warehouse AMRs | Navigation policies with 1000+ parallel environments, checkpointing to Azure Storage |
56
+
| Manufacturing Arms | Manipulation strategies with physics-accurate simulation on pay-per-use GPU |
57
+
| Legged Robots | Locomotion optimization with MLflow tracking for sim-to-real transfer |
58
+
| Collaborative Robots | Safe interaction policies with Azure Monitor logging for compliance |
36
59
37
60
## 📋 Prerequisites
38
61
@@ -63,19 +86,23 @@ OSMO orchestration on Azure enables production-scale robotics training across in
63
86
### 1. Deploy Infrastructure
64
87
65
88
```bash
66
-
#Login to Azure CLI (required for Terraform and cluster configuration)
Terraform configuration for the robotics reference architecture. Deploys Azure resources including AKS with GPU node pools, Azure ML workspace, storage, and OSMO backend services (PostgreSQL, Redis).
4
+
5
+
## Prerequisites
6
+
7
+
- Azure CLI authenticated (`az login`)
8
+
- Terraform 1.5+ (`terraform version`)
9
+
- GPU VM quota in target region (e.g., `Standard_NV36ads_A10_v5`)
Azure Automation account for scheduled infrastructure operations.
4
+
5
+
## Purpose
6
+
7
+
Runs scheduled PowerShell runbooks to manage infrastructure resources, such as starting PostgreSQL and AKS at the beginning of business hours to reduce costs.
0 commit comments