Skip to content
This repository was archived by the owner on Mar 11, 2026. It is now read-only.

Commit 7baf903

Browse files
authored
docs: enhance README with architecture diagram and deployment documentation (#33)
1 parent 864006b commit 7baf903

4 files changed

Lines changed: 104 additions & 12 deletions

File tree

README.md

Lines changed: 74 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,56 @@ Production-ready framework for orchestrating robotics and AI workloads on [Azure
1818

1919
The infrastructure deploys an AKS cluster with GPU node pools running the NVIDIA GPU Operator and KAI Scheduler. Training workloads can be submitted via OSMO workflows (control plane and backend operator) and AzureML jobs (ML extension). Both platforms share common infrastructure: Azure Storage for checkpoints and data, Key Vault for secrets, and Azure Container Registry for container images. OSMO additionally uses PostgreSQL for workflow state and Redis for caching.
2020

21+
```text
22+
+==========================================================================+
23+
| Resource Group |
24+
| |
25+
| :--- Virtual Network (10.0.0.0/16) ---------------------------------: |
26+
| : : |
27+
| : +-------------+ +------------------+ +---------------+ : |
28+
| : | NAT Gateway |---->| AKS Cluster |<--->| ACR | : |
29+
| : +------+------+ +--------+---------+ +-------+-------+ : |
30+
| : | | | : |
31+
| : v v | : |
32+
| : +-------------+ +------------------+ | : |
33+
| : | GPU Node | | AzureML Extension|-------------+ : |
34+
| : | Pool (A10) | | KAI Scheduler | : |
35+
| : +-------------+ | GPU Operator | : |
36+
| : | OSMO Backend | : |
37+
| : +------------------+ : |
38+
| : | : |
39+
| : +------------------+ | +------------------+ : |
40+
| : | PostgreSQL |<-------+------->| Azure Redis | : |
41+
| : | Flexible Server | | (Enterprise) | : |
42+
| : +------------------+ +------------------+ : |
43+
| : : |
44+
| : +-- Private Endpoint Subnet --------------------------------+ : |
45+
| : | PE-KeyVault PE-Storage PE-ACR PE-AzureML PE-Monitor | : |
46+
| : +-----------------------------------------------------------+ : |
47+
| : : |
48+
| :-------------------------------------------------------------------: |
49+
| |
50+
| +------------------+ +------------------+ |
51+
| | Key Vault | | Storage Account | |
52+
| | (RBAC-enabled) | | - ml-workspace | |
53+
| +------------------+ | - osmo | |
54+
| | - datasets | |
55+
| +------------------+ |
56+
| |
57+
| +------------------+ +------------------+ |
58+
| | AzureML Workspace|------->| Log Analytics | |
59+
| | + App Insights | | + Grafana | |
60+
| +------------------+ | + Monitor WS | |
61+
| +------------------+ |
62+
| |
63+
| +------------------+ +------------------+ |
64+
| | Managed Identity | | Managed Identity | |
65+
| | (ML Workloads) | | (OSMO Workloads) | |
66+
| +------------------+ +------------------+ |
67+
| |
68+
+==========================================================================+
69+
```
70+
2171
**Azure Infrastructure** (deployed by [Terraform](deploy/001-iac/)):
2272

2373
| Component | Purpose |
@@ -67,21 +117,19 @@ OSMO orchestration on Azure enables production-scale robotics training across in
67117
| Tool | Version | Installation |
68118
|------|---------|--------------|
69119
| [Azure CLI](https://docs.microsoft.com/en-us/cli/azure/install-azure-cli) | 2.50+ | `brew install azure-cli` |
70-
| [Terraform](https://www.terraform.io/downloads) | 1.5+ | `brew install terraform` |
120+
| [Terraform](https://www.terraform.io/downloads) | 1.9.8+ | `brew install terraform` |
71121
| [kubectl](https://kubernetes.io/docs/tasks/tools/) | 1.28+ | `brew install kubectl` |
72122
| [Helm](https://helm.sh/docs/intro/install/) | 3.x | `brew install helm` |
73123
| [jq](https://stedolan.github.io/jq/) | latest | `brew install jq` |
74124
| [OSMO CLI](https://developer.nvidia.com/osmo) | latest | See NVIDIA docs |
75125

76126
### Azure Requirements
77127

78-
- Azure subscription with **Contributor** access
128+
- Azure subscription with **Contributor** + **Role Based Access Control Administrator**
129+
- Scope: Subscription (if creating new resource group) or Resource Group (if using existing)
130+
- Terraform creates role assignments for managed identities
131+
- Alternative: **Owner** (grants more permissions than required)
79132
- GPU VM quota for your target region (e.g., `Standard_NV36ads_A10_v5`)
80-
- Permissions to create: Resource Groups, AKS, Storage, Key Vault, AzureML Workspace
81-
82-
### NVIDIA Requirements
83-
84-
- [NVIDIA Developer](https://developer.nvidia.com/) account with OSMO access
85133

86134
## 🏃 Quick Start
87135

@@ -202,6 +250,25 @@ See [002-setup/README.md](deploy/002-setup/README.md) for detailed instructions.
202250
| [Workflows](workflows/README.md) | Job and workflow templates |
203251
| [MLflow Integration](docs/mlflow-integration.md) | Experiment tracking setup |
204252

253+
## 💰 Cost Estimation
254+
255+
Use the [Azure Pricing Calculator](https://azure.microsoft.com/pricing/calculator/) to estimate costs. Add these services based on the architecture:
256+
257+
| Service | Configuration | Notes |
258+
|---------|---------------|-------|
259+
| Azure Kubernetes Service (AKS) | System pool: Standard_D4s_v3 (3 nodes) | Always-on control plane |
260+
| Virtual Machines (Spot) | Standard_NV36ads_A10_v5 or NC-series | GPU nodes scale to zero when idle |
261+
| Azure Database for PostgreSQL | Flexible Server, Burstable B1ms | OSMO workflow state |
262+
| Azure Cache for Redis | Basic C0 or Standard C1 | OSMO job queue |
263+
| Azure Machine Learning | Basic workspace | No additional compute costs (uses AKS) |
264+
| Storage Account | Standard LRS, ~100GB | Checkpoints and datasets |
265+
| Container Registry | Basic or Standard | Image storage |
266+
| Log Analytics | ~5GB/day ingestion | Monitoring data |
267+
| Azure Managed Grafana | Essential tier | Dashboards (optional) |
268+
| VPN Gateway | VpnGw1 | Point-to-site access (optional) |
269+
270+
GPU Spot VMs provide significant savings (60-90%) compared to on-demand pricing. Actual costs depend on training frequency, job duration, and data volumes.
271+
205272
## 🪪 License
206273

207274
MIT License. See [LICENSE.md](LICENSE.md).

deploy/001-iac/README.md

Lines changed: 15 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,20 @@ Terraform configuration for the robotics reference architecture. Deploys Azure r
1010
| Terraform | 1.5+ | `terraform version` |
1111
| GPU VM quota | Region-specific | e.g., `Standard_NV36ads_A10_v5` |
1212

13+
### Azure RBAC Permissions
14+
15+
| Role | Scope |
16+
|------|-------|
17+
| Contributor | Subscription (new RG) or Resource Group (existing RG) |
18+
| Role Based Access Control Administrator | Subscription (new RG) or Resource Group (existing RG) |
19+
20+
Terraform creates role assignments for managed identities, requiring `Microsoft.Authorization/roleAssignments/write` permission. The Contributor role explicitly blocks this action; the RBAC Administrator role provides it.
21+
22+
> [!NOTE]
23+
> Use subscription scope if creating a new resource group (`should_create_resource_group = true`). Use resource group scope if the resource group already exists.
24+
25+
**Alternative**: Owner role (grants more permissions than required).
26+
1327
## 🚀 Quick Start
1428

1529
```bash
@@ -260,7 +274,7 @@ Issues and resolutions encountered during infrastructure deployment and teardown
260274

261275
### Destroy Takes a Long Time
262276

263-
Terraform destroy removes resources in dependency order. Private Endpoints, AKS clusters, and PostgreSQL servers commonly take 10-15 minutes each.
277+
Terraform destroy removes resources in dependency order. Private Endpoints, AKS clusters, and PostgreSQL servers commonly take 5-10 minutes each. Full destruction typically takes 20-30 minutes.
264278

265279
Monitor remaining resources during destruction:
266280

deploy/002-setup/README.md

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,17 @@ AKS cluster configuration for robotics workloads with AzureML and NVIDIA OSMO.
99
- kubectl, Helm 3.x, jq installed
1010
- OSMO CLI (`osmo`) for backend deployment
1111

12+
### Azure RBAC Permissions
13+
14+
For least-privilege access:
15+
16+
| Role | Scope | Purpose |
17+
|------|-------|---------|
18+
| Azure Kubernetes Service Cluster User Role | AKS Cluster | Get cluster credentials |
19+
| Contributor | Resource Group | Extension and FIC creation |
20+
| Key Vault Secrets User | Key Vault | Read PostgreSQL/Redis credentials |
21+
| Storage Blob Data Contributor | Storage Account | Create workflow containers |
22+
1223
## 🚀 Quick Start
1324

1425
```bash

deploy/README.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -7,8 +7,8 @@ Infrastructure deployment and cluster configuration for the robotics reference a
77
| Step | Folder | Description | Time |
88
|:----:|--------|-------------|------|
99
| 1 | [000-prerequisites](000-prerequisites/) | Azure CLI login, subscription setup | 2 min |
10-
| 2 | [001-iac](001-iac/) | Terraform: AKS, ML workspace, storage, PostgreSQL, Redis | 15-20 min |
11-
| 3 | [002-setup](002-setup/) | Cluster config: GPU Operator, OSMO, AzureML extension | 10-15 min |
10+
| 2 | [001-iac](001-iac/) | Terraform: AKS, ML workspace, storage, PostgreSQL, Redis | 30-40 min |
11+
| 3 | [002-setup](002-setup/) | Cluster config: GPU Operator, OSMO, AzureML extension | 30 min |
1212

1313
## 🚀 Quick Path
1414

@@ -50,8 +50,8 @@ Remove deployed components in reverse order. Cluster components must be removed
5050

5151
| Step | Folder | Description | Time |
5252
|:----:|--------|-------------|------|
53-
| 1 | [002-setup/cleanup](002-setup/cleanup/) | Uninstall Helm charts, extensions, namespaces | 5-10 min |
54-
| 2 | [001-iac](001-iac/) | Terraform destroy or resource group deletion | 10-15 min |
53+
| 1 | [002-setup/cleanup](002-setup/cleanup/) | Uninstall Helm charts, extensions, namespaces | 10-15 min |
54+
| 2 | [001-iac](001-iac/) | Terraform destroy or resource group deletion | 20-30 min |
5555

5656
### Partial Cleanup (Cluster Components Only)
5757

0 commit comments

Comments
 (0)