-
Notifications
You must be signed in to change notification settings - Fork 8
Description
Feature Proposal
Full design proposal: https://gist.github.com/niksacdev/53a256771240bf338ac135011da840e5
Problem Statement
The reference architecture provides strong training and inference capabilities. ROVE extends this with a structured evaluation layer — enabling teams to validate whether their complete agent stacks (perception, planning, control, verification) solve domain-specific tasks end-to-end, rather than only benchmarking individual models.
For example, in automotive parts kitting, teams need to evaluate whether their chosen VLM can distinguish M6 from M8 fasteners under factory lighting, whether the policy grasps correctly, and whether the full pipeline places parts accurately — including cost-accuracy tradeoffs (e.g., GPT-4.1-mini vs. Phi-4-Multimodal).
Proposed Solution
ROVE (Robot Observation & Vision Evaluation) — a domain evaluation module for complete agent stacks.
Core Design
- Deterministic pipeline:
perceive → plan → act → verify— clear results attribution (agent failure vs orchestrator failure) - Adapter Protocol pattern: typed Protocols for each component (VLM, policy, simulator), enabling model substitution via YAML config
- VLM-as-Judge calibration: measures VLM judge agreement with physics ground truth — critical for production where simulation isn't available
Three Evaluation Workflows
- Single task — Does this stack solve the task?
- VLM domain comparison — Which VLM understands this workspace best?
- Variation study — How robust is the stack to instruction rephrasing?
Azure AI Foundry Integration
- Models deployed through Foundry endpoints (GPT-4.1, o4-mini, Phi-4-Multimodal)
- Custom evaluators registered with
azure-ai-evaluationSDK - Results surface in Foundry dashboard
- Foundry Local support for edge evaluation (Phi-4-Mini)
- JSONL export compatible with Foundry's evaluation system
Implementation Phases
| Phase | Focus | Success Metric |
|---|---|---|
| 1 | Mock-first foundation, CLI | <5s completion, no external deps |
| 2 | Foundry VLM integration, custom evaluators | Results visible in Foundry dashboard |
| 3 | Variation studies, fine-tuning feedback | Per-phrasing success rates |
Alternatives Considered
- Existing benchmarks (LIBERO, etc.) — evaluate individual models, not composed agent stacks on domain-specific tasks
- LLM-routed orchestration — rejected in favor of deterministic pipeline for clear failure attribution
Additional Context
Integrates with existing repo infrastructure (src/training/, src/inference/). Phase 1 can proceed with no Azure dependencies (mock adapters only).