|
| 1 | +--- |
| 2 | +title: "Quickstart: Clone to First Training Job" |
| 3 | +description: Deploy infrastructure and submit your first robotics training job in 8 steps |
| 4 | +author: Microsoft Robotics-AI Team |
| 5 | +ms.date: 2026-02-22 |
| 6 | +ms.topic: tutorial |
| 7 | +keywords: |
| 8 | + - quickstart |
| 9 | + - deployment |
| 10 | + - training |
| 11 | + - tutorial |
| 12 | +--- |
| 13 | + |
| 14 | +Deploy the full Azure NVIDIA Robotics stack and submit a training job in ~1.5-2 hours. This guide uses full-public networking and Access Keys authentication for the simplest path. |
| 15 | + |
| 16 | +> [!NOTE] |
| 17 | +> This guide expands on the [Getting Started hub](README.md). |
| 18 | +
|
| 19 | +## Prerequisites |
| 20 | + |
| 21 | +| Requirement | Details | |
| 22 | +| --- | --- | |
| 23 | +| Azure subscription | Contributor + User Access Administrator roles | |
| 24 | +| GPU quota | `Standard_NC24ads_A100_v4` in target region | |
| 25 | +| NVIDIA NGC account | Sign up at <https://ngc.nvidia.com/> for API key | |
| 26 | +| Development environment | Devcontainer (recommended) or local tools | |
| 27 | + |
| 28 | +See [Prerequisites](../contributing/prerequisites.md) for installation commands and version requirements. |
| 29 | + |
| 30 | +## Step 1: Clone and Set Up Environment |
| 31 | + |
| 32 | +Clone the repository and initialize the development environment. |
| 33 | + |
| 34 | +```bash |
| 35 | +git clone https://github.com/Azure-Samples/azure-nvidia-robotics-reference-architecture.git |
| 36 | +cd azure-nvidia-robotics-reference-architecture |
| 37 | +``` |
| 38 | + |
| 39 | +Use the devcontainer (recommended) or run local setup: |
| 40 | + |
| 41 | +```bash |
| 42 | +./setup-dev.sh |
| 43 | +``` |
| 44 | + |
| 45 | +## Step 2: Configure Azure Subscription |
| 46 | + |
| 47 | +Authenticate with Azure and register required resource providers. |
| 48 | + |
| 49 | +```bash |
| 50 | +source deploy/000-prerequisites/az-sub-init.sh |
| 51 | +bash deploy/000-prerequisites/register-azure-providers.sh |
| 52 | +``` |
| 53 | + |
| 54 | +Verify your subscription: |
| 55 | + |
| 56 | +```bash |
| 57 | +az account show --query "{name:name, id:id}" -o table |
| 58 | +``` |
| 59 | + |
| 60 | +## Step 3: Configure Terraform Variables |
| 61 | + |
| 62 | +Create a Terraform variables file for the full-public deployment path. From the repository root: |
| 63 | + |
| 64 | +```bash |
| 65 | +cd deploy/001-iac |
| 66 | +cp terraform.tfvars.example terraform.tfvars |
| 67 | +``` |
| 68 | + |
| 69 | +Edit `terraform.tfvars` with these values: |
| 70 | + |
| 71 | +```hcl |
| 72 | +project_name = "robotics" |
| 73 | +environment = "dev" |
| 74 | +location = "eastus" |
| 75 | +gpu_vm_size = "Standard_NC24ads_A100_v4" |
| 76 | +
|
| 77 | +enable_azure_ml = true |
| 78 | +enable_osmo = true |
| 79 | +enable_vpn_gateway = false |
| 80 | +enable_private_dns = false |
| 81 | +``` |
| 82 | + |
| 83 | +> [!TIP] |
| 84 | +> For private networking, set `enable_vpn_gateway = true` and `enable_private_dns = true`. See the [Infrastructure Guide](../../deploy/001-iac/README.md) for details. |
| 85 | +
|
| 86 | +## Step 4: Deploy Infrastructure |
| 87 | + |
| 88 | +Initialize and apply the Terraform configuration. This step takes ~30-40 minutes. |
| 89 | + |
| 90 | +```bash |
| 91 | +terraform init |
| 92 | +terraform plan -out=tfplan |
| 93 | +terraform apply tfplan |
| 94 | +``` |
| 95 | + |
| 96 | +Verify deployment: |
| 97 | + |
| 98 | +```bash |
| 99 | +terraform output |
| 100 | +``` |
| 101 | + |
| 102 | +Connect to the AKS cluster: |
| 103 | + |
| 104 | +```bash |
| 105 | +az aks get-credentials \ |
| 106 | + --resource-group "$(terraform output -raw resource_group_name)" \ |
| 107 | + --name "$(terraform output -raw aks_cluster_name)" |
| 108 | +``` |
| 109 | + |
| 110 | +## Step 5: Configure AKS Cluster |
| 111 | + |
| 112 | +Deploy GPU Operator, KAI Scheduler, and the AzureML extension. From the repository root: |
| 113 | + |
| 114 | +```bash |
| 115 | +cd deploy/002-setup |
| 116 | +bash 01-deploy-robotics-charts.sh |
| 117 | +bash 02-deploy-azureml-extension.sh |
| 118 | +``` |
| 119 | + |
| 120 | +Verify GPU operator pods: |
| 121 | + |
| 122 | +```bash |
| 123 | +kubectl get pods -n gpu-operator |
| 124 | +``` |
| 125 | + |
| 126 | +## Step 6: Deploy OSMO Components |
| 127 | + |
| 128 | +Deploy the OSMO control plane and backend using Access Keys authentication. |
| 129 | + |
| 130 | +```bash |
| 131 | +bash 03-deploy-osmo-control-plane.sh |
| 132 | +bash 04-deploy-osmo-backend.sh --use-access-keys |
| 133 | +``` |
| 134 | + |
| 135 | +Verify OSMO pods: |
| 136 | + |
| 137 | +```bash |
| 138 | +kubectl get pods -n osmo-control-plane |
| 139 | +``` |
| 140 | + |
| 141 | +## Step 7: Submit First Training Job |
| 142 | + |
| 143 | +Navigate to the scripts directory and submit a training job. From the repository root: |
| 144 | + |
| 145 | +```bash |
| 146 | +cd scripts |
| 147 | +bash submit-osmo-training.sh |
| 148 | +``` |
| 149 | + |
| 150 | +Scripts auto-detect configuration from Terraform outputs. Override values with CLI arguments or environment variables as needed. See [Scripts](../../scripts/README.md) for all submission options. |
| 151 | + |
| 152 | +## Step 8: Verify Results |
| 153 | + |
| 154 | +Confirm the training job is running: |
| 155 | + |
| 156 | +```bash |
| 157 | +kubectl get pods -n osmo-control-plane --watch |
| 158 | +``` |
| 159 | + |
| 160 | +Check OSMO training status through the OSMO web UI or query pod logs: |
| 161 | + |
| 162 | +```bash |
| 163 | +kubectl logs -n osmo-control-plane -l app=osmo-training --tail=50 |
| 164 | +``` |
| 165 | + |
| 166 | +## Cleanup |
| 167 | + |
| 168 | +Destroy all infrastructure when finished to stop incurring costs. From the repository root: |
| 169 | + |
| 170 | +```bash |
| 171 | +cd deploy/001-iac |
| 172 | +terraform destroy |
| 173 | +``` |
| 174 | + |
| 175 | +See [Cost Considerations](../contributing/cost-considerations.md) for detailed pricing. |
| 176 | + |
| 177 | +## Next Steps |
| 178 | + |
| 179 | +| Resource | Description | |
| 180 | +| --- | --- | |
| 181 | +| [LeRobot Inference](../lerobot-inference.md) | Run inference with trained LeRobot models | |
| 182 | +| [MLflow Integration](../mlflow-integration.md) | Track experiments with MLflow | |
| 183 | +| [Deployment Guide](../../deploy/README.md) | Full deployment reference and options | |
| 184 | +| [Contributing Guide](../contributing/README.md) | Development workflow and code standards | |
0 commit comments