Skip to content

Commit 3fcc6b6

Browse files
authored
feat(docs): documenation updates (#27)
1 parent 4986caa commit 3fcc6b6

File tree

10 files changed

+513
-60
lines changed

10 files changed

+513
-60
lines changed

README.md

Lines changed: 61 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -4,35 +4,58 @@ Production-ready framework for orchestrating robotics and AI workloads on [Azure
44

55
## 🚀 Features
66

7-
- **Infrastructure as Code**[Terraform modules](deploy/001-iac/) for reproducible deployments
8-
- **Dual Orchestration**[AzureML jobs](workflows/azureml/) and [NVIDIA OSMO workflows](workflows/osmo/) both available
9-
- **Workload Identity** – Key-less authentication via Azure AD federation ([setup](deploy/002-setup/README.md#scenario-2-workload-identity--ngc-production))
10-
- **Private Networking** – Azure services secured on a private VNet with [private endpoints](deploy/001-iac/variables.tf) and private links; optional [VPN gateway](deploy/001-iac/vpn/) for secure remote access; public access configurable when needed
11-
- **MLflow Integration** – Experiment tracking with Azure ML ([details](docs/mlflow-integration.md))
12-
- **GPU Scheduling**[KAI Scheduler](deploy/002-setup/values/kai-scheduler.yaml) for efficient GPU utilization
13-
- **Auto-scaling** – Pay-per-use GPU compute on AKS Spot nodes
7+
| Capability | Description |
8+
|------------|-------------|
9+
| Infrastructure as Code | [Terraform modules](deploy/001-iac/) for reproducible Azure deployments |
10+
| Dual Orchestration | Submit jobs via [AzureML](workflows/azureml/) or [OSMO](workflows/osmo/) |
11+
| Workload Identity | Key-less auth via Azure AD ([setup guide](deploy/002-setup/README.md#scenario-2-workload-identity--ngc)) |
12+
| Private Networking | Services on private VNet with optional [VPN gateway](deploy/001-iac/vpn/) |
13+
| MLflow Integration | Experiment tracking with Azure ML ([details](docs/mlflow-integration.md)) |
14+
| GPU Scheduling | [KAI Scheduler](deploy/002-setup/values/kai-scheduler.yaml) for efficient utilization |
15+
| Auto-scaling | Pay-per-use GPU compute on AKS Spot nodes |
1416

1517
## 🏗️ Architecture
1618

1719
The infrastructure deploys an AKS cluster with GPU node pools running the NVIDIA GPU Operator and KAI Scheduler. Training workloads can be submitted via OSMO workflows (control plane and backend operator) and AzureML jobs (ML extension). Both platforms share common infrastructure: Azure Storage for checkpoints and data, Key Vault for secrets, and Azure Container Registry for container images. OSMO additionally uses PostgreSQL for workflow state and Redis for caching.
1820

19-
**Core Components**:
20-
21-
- **AKS Cluster** – Hosts GPU workloads with Spot node pools for cost optimization
22-
- **NVIDIA GPU Operator** – Manages GPU drivers and device plugins
23-
- **KAI Scheduler** – GPU-aware scheduling for training jobs
24-
- **AzureML Extension** – Enables Azure ML job submission to Kubernetes
25-
- **OSMO Control Plane** – Workflow orchestration (service, router, web-ui)
26-
- **OSMO Backend Operator** – Executes workflows on the cluster
21+
**Azure Infrastructure** (deployed by [Terraform](deploy/001-iac/)):
22+
23+
| Component | Purpose |
24+
|-----------|--------|
25+
| Virtual Network | Private networking with NAT Gateway and DNS Resolver |
26+
| Private Endpoints | Secure access to Azure services (7 endpoints, 11+ DNS zones) |
27+
| AKS Cluster | Kubernetes with GPU Spot node pools and Workload Identity |
28+
| Key Vault | Secrets management with RBAC authorization |
29+
| Azure ML Workspace | Experiment tracking, model registry, compute management |
30+
| Storage Account | Training data, checkpoints, and workflow artifacts |
31+
| Container Registry | Training and OSMO container images |
32+
| Azure Monitor | Log Analytics, Prometheus metrics, Managed Grafana |
33+
| PostgreSQL | OSMO workflow state persistence |
34+
| Redis | OSMO job queue and caching |
35+
| VPN Gateway ⚙️ | Point-to-Site and Site-to-Site connectivity |
36+
37+
**Kubernetes Components** (deployed by [setup scripts](deploy/002-setup/)):
38+
39+
| Component | Purpose |
40+
|-----------|--------|
41+
| NVIDIA GPU Operator | GPU drivers, device plugin, DCGM metrics exporter |
42+
| KAI Scheduler | GPU-aware scheduling with bin-packing |
43+
| AzureML Extension | ML training and inference job submission |
44+
| OSMO Control Plane | Workflow API, router, and web interface |
45+
| OSMO Backend Operator | Workflow execution on cluster |
46+
47+
⚙️ = Optional component
2748

2849
## 🌍 Real World Examples
2950

3051
OSMO orchestration on Azure enables production-scale robotics training across industries:
3152

32-
- **Warehouse AMRs** – Train navigation policies with 1000+ parallel environments on auto-scaling GPU nodes, checkpoint to Azure Storage, track experiments in Azure ML
33-
- **Manufacturing Arms** – Develop manipulation strategies with physics-accurate simulation, leveraging pay-per-use GPU compute and global Azure regions
34-
- **Legged Robots** – Optimize locomotion policies with MLflow experiment tracking for sim-to-real transfer
35-
- **Collaborative Robots** – Create safe interaction policies with Azure Monitor logging for compliance auditing
53+
| Use Case | Training Scenario |
54+
|----------|-------------------|
55+
| Warehouse AMRs | Navigation policies with 1000+ parallel environments, checkpointing to Azure Storage |
56+
| Manufacturing Arms | Manipulation strategies with physics-accurate simulation on pay-per-use GPU |
57+
| Legged Robots | Locomotion optimization with MLflow tracking for sim-to-real transfer |
58+
| Collaborative Robots | Safe interaction policies with Azure Monitor logging for compliance |
3659

3760
## 📋 Prerequisites
3861

@@ -63,19 +86,23 @@ OSMO orchestration on Azure enables production-scale robotics training across in
6386
### 1. Deploy Infrastructure
6487

6588
```bash
66-
# Login to Azure CLI (required for Terraform and cluster configuration)
67-
source deploy/000-prerequisites/az-sub-init.sh # --tenant <tenant-id> (Optionally) Specify tenant
89+
# Set subscription for Terraform
90+
source deploy/000-prerequisites/az-sub-init.sh
6891

6992
cd deploy/001-iac
70-
cp terraform.tfvars.example terraform.tfvars # Edit with your values
71-
terraform init && terraform apply
7293

73-
# Optional: Deploy VPN for secure access to private resources
74-
cd vpn
94+
# Create terraform.tfvars with your values
95+
cat > terraform.tfvars << 'EOF'
96+
environment = "dev"
97+
resource_prefix = "robotst" # Your prefix (3-8 chars)
98+
location = "eastus2" # Azure region with GPU quota
99+
EOF
100+
75101
terraform init && terraform apply
76-
# Download VPN client config from Azure Portal > Virtual Network Gateway > Point-to-site configuration
77102
```
78103

104+
For optional VPN deployment and additional configuration, see [deploy/001-iac/README.md](deploy/001-iac/README.md).
105+
79106
### 2. Configure Cluster
80107

81108
```bash
@@ -173,9 +200,14 @@ See [002-setup/README.md](deploy/002-setup/README.md) for detailed instructions.
173200

174201
## 📖 Documentation
175202

176-
- [002-setup README](deploy/002-setup/README.md) – Cluster setup and deployment scenarios
177-
- [Workflows README](workflows/README.md) – Job and workflow templates
178-
- [MLflow Integration](docs/mlflow-integration.md) – Experiment tracking setup
203+
| Guide | Description |
204+
|-------|-------------|
205+
| [Deploy Overview](deploy/README.md) | Deployment order and quick path |
206+
| [Infrastructure](deploy/001-iac/README.md) | Terraform configuration and modules |
207+
| [Cluster Setup](deploy/002-setup/README.md) | Scripts and deployment scenarios |
208+
| [Scripts](scripts/README.md) | Training and validation submission |
209+
| [Workflows](workflows/README.md) | Job and workflow templates |
210+
| [MLflow Integration](docs/mlflow-integration.md) | Experiment tracking setup |
179211

180212
## 🪪 License
181213

deploy/000-prerequisites/README.md

Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
# Prerequisites
2+
3+
Azure CLI initialization to set the subscription ID for Terraform deployments.
4+
5+
## Scripts
6+
7+
| Script | Purpose |
8+
|--------|---------|
9+
| `az-sub-init.sh` | Azure login and `ARM_SUBSCRIPTION_ID` export |
10+
11+
## Usage
12+
13+
Source the script to set `ARM_SUBSCRIPTION_ID` for Terraform:
14+
15+
```bash
16+
source az-sub-init.sh
17+
```
18+
19+
For a specific tenant:
20+
21+
```bash
22+
source az-sub-init.sh --tenant your-tenant.onmicrosoft.com
23+
```
24+
25+
## What It Does
26+
27+
1. Checks for existing Azure CLI session
28+
2. Prompts for login if needed (optionally with tenant)
29+
3. Exports `ARM_SUBSCRIPTION_ID` to current shell
30+
31+
The subscription ID is required by Terraform's Azure provider when not running in a managed identity context.
32+
33+
## Next Step
34+
35+
After initialization, proceed to [001-iac](../001-iac/) to deploy infrastructure.

deploy/000-prerequisites/mek-generation.sh

Lines changed: 0 additions & 27 deletions
This file was deleted.

deploy/001-iac/README.md

Lines changed: 132 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,132 @@
1+
# Infrastructure as Code
2+
3+
Terraform configuration for the robotics reference architecture. Deploys Azure resources including AKS with GPU node pools, Azure ML workspace, storage, and OSMO backend services (PostgreSQL, Redis).
4+
5+
## Prerequisites
6+
7+
- Azure CLI authenticated (`az login`)
8+
- Terraform 1.5+ (`terraform version`)
9+
- GPU VM quota in target region (e.g., `Standard_NV36ads_A10_v5`)
10+
- Subscription initialized (`source ../000-prerequisites/az-sub-init.sh`)
11+
12+
## Quick Start
13+
14+
```bash
15+
cd deploy/001-iac
16+
17+
# Initialize subscription
18+
source ../000-prerequisites/az-sub-init.sh
19+
20+
# Configure (edit values as needed)
21+
cp terraform.tfvars.example terraform.tfvars
22+
23+
# Deploy
24+
terraform init && terraform apply
25+
```
26+
27+
## Configuration
28+
29+
Key variables in `terraform.tfvars`:
30+
31+
| Variable | Description | Default |
32+
|----------|-------------|---------|
33+
| `environment` | Deployment environment | - |
34+
| `resource_prefix` | Resource naming prefix | - |
35+
| `location` | Azure region | - |
36+
| `node_pools.gpu.vm_size` | GPU VM SKU | `Standard_NV36ads_A10_v5` |
37+
| `should_deploy_postgresql` | Deploy PostgreSQL for OSMO | `true` |
38+
| `should_deploy_redis` | Deploy Redis for OSMO | `true` |
39+
40+
See [variables.tf](variables.tf) for all configuration options.
41+
42+
### OSMO Workload Identity
43+
44+
Enable managed identity for OSMO services (recommended for production):
45+
46+
```hcl
47+
osmo_config = {
48+
should_enable_identity = true
49+
should_federate_identity = true
50+
control_plane_namespace = "osmo-control-plane"
51+
operator_namespace = "osmo-operator"
52+
workflows_namespace = "osmo-workflows"
53+
}
54+
```
55+
56+
## Modules
57+
58+
| Module | Purpose |
59+
|--------|---------|
60+
| [platform](modules/platform/) | Networking, storage, Key Vault, ML workspace, PostgreSQL, Redis |
61+
| [sil](modules/sil/) | AKS cluster with GPU node pools and AzureML extension |
62+
| [vpn](modules/vpn/) | VPN Gateway module (used by vpn/ standalone deployment) |
63+
64+
## Outputs
65+
66+
```bash
67+
# View all outputs
68+
terraform output
69+
70+
# Get AKS cluster name
71+
terraform output -json aks_cluster | jq -r '.name'
72+
73+
# OSMO connection details (PostgreSQL, Redis)
74+
terraform output postgresql_connection_info
75+
terraform output managed_redis_connection_info
76+
```
77+
78+
## Optional Components
79+
80+
These standalone deployments extend the base infrastructure.
81+
82+
### VPN Gateway
83+
84+
Point-to-Site VPN for secure remote access to private endpoints:
85+
86+
```bash
87+
cd vpn
88+
cp terraform.tfvars.example terraform.tfvars
89+
terraform init && terraform apply
90+
```
91+
92+
See [vpn/README.md](vpn/README.md) for client setup and AAD authentication.
93+
94+
### Private DNS for OSMO UI
95+
96+
Configure DNS resolution for the OSMO UI LoadBalancer (requires VPN):
97+
98+
```bash
99+
cd dns
100+
terraform apply -var="osmo_loadbalancer_ip=10.0.x.x"
101+
```
102+
103+
See [dns/README.md](dns/README.md) for details.
104+
105+
### Automation Account
106+
107+
Azure Automation resources for scheduled operations:
108+
109+
```bash
110+
cd automation
111+
terraform init && terraform apply
112+
```
113+
114+
See [automation/README.md](automation/README.md) for runbook configuration.
115+
116+
## Directory Structure
117+
118+
```
119+
001-iac/
120+
├── main.tf # Module composition
121+
├── variables.tf # Input variables
122+
├── outputs.tf # Output values
123+
├── versions.tf # Provider versions
124+
├── terraform.tfvars # Configuration (gitignored)
125+
├── modules/
126+
│ ├── platform/ # Shared Azure services
127+
│ ├── sil/ # AKS + ML extension
128+
│ └── vpn/ # VPN Gateway module
129+
├── vpn/ # Standalone VPN deployment
130+
├── dns/ # OSMO UI DNS configuration
131+
└── automation/ # Automation account
132+
```
Lines changed: 53 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,53 @@
1+
# Automation
2+
3+
Azure Automation account for scheduled infrastructure operations.
4+
5+
## Purpose
6+
7+
Runs scheduled PowerShell runbooks to manage infrastructure resources, such as starting PostgreSQL and AKS at the beginning of business hours to reduce costs.
8+
9+
## Prerequisites
10+
11+
- Platform infrastructure deployed (`cd .. && terraform apply`)
12+
- Core variables matching parent deployment
13+
14+
## Usage
15+
16+
```bash
17+
cd deploy/001-iac/automation
18+
19+
# Configure schedule and resources
20+
# Edit terraform.tfvars with your schedule
21+
22+
terraform init && terraform apply
23+
```
24+
25+
## Configuration
26+
27+
Example `terraform.tfvars`:
28+
29+
```hcl
30+
environment = "dev"
31+
location = "westus3"
32+
resource_prefix = "rob"
33+
instance = "001"
34+
35+
should_start_postgresql = true
36+
37+
schedule_config = {
38+
start_time = "13:00"
39+
week_days = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday"]
40+
timezone = "UTC"
41+
}
42+
```
43+
44+
## Resources Created
45+
46+
- Azure Automation Account with system-assigned managed identity
47+
- PowerShell 7.2 runbook for starting resources
48+
- Weekly schedule with configurable days and start time
49+
- Role assignments for AKS and PostgreSQL management
50+
51+
## Related
52+
53+
- [Parent README](../README.md) - Main infrastructure documentation

0 commit comments

Comments
 (0)