feat(evaluation): add ROVE task-level domain evaluation module

## Feature Proposal

> Full design proposal: https://gist.github.com/niksacdev/53a256771240bf338ac135011da840e5

## Problem Statement

The reference architecture provides strong training and inference capabilities. ROVE extends this with a structured evaluation layer — enabling teams to validate whether their complete agent stacks (perception, planning, control, verification) solve domain-specific tasks end-to-end, rather than only benchmarking individual models.

For example, in automotive parts kitting, teams need to evaluate whether their chosen VLM can distinguish M6 from M8 fasteners under factory lighting, whether the policy grasps correctly, and whether the full pipeline places parts accurately — including cost-accuracy tradeoffs (e.g., GPT-4.1-mini vs. Phi-4-Multimodal).

## Proposed Solution

**ROVE (Robot Observation & Vision Evaluation)** — a domain evaluation module for complete agent stacks.

### Core Design

- **Deterministic pipeline**: `perceive → plan → act → verify` — clear results attribution (agent failure vs orchestrator failure)
- **Adapter Protocol pattern**: typed Protocols for each component (VLM, policy, simulator), enabling model substitution via YAML config
- **VLM-as-Judge calibration**: measures VLM judge agreement with physics ground truth — critical for production where simulation isn't available

### Three Evaluation Workflows

1. **Single task** — Does this stack solve the task?
2. **VLM domain comparison** — Which VLM understands this workspace best?
3. **Variation study** — How robust is the stack to instruction rephrasing?

### Azure AI Foundry Integration

- Models deployed through Foundry endpoints (GPT-4.1, o4-mini, Phi-4-Multimodal)
- Custom evaluators registered with `azure-ai-evaluation` SDK
- Results surface in Foundry dashboard
- Foundry Local support for edge evaluation (Phi-4-Mini)
- JSONL export compatible with Foundry's evaluation system

### Implementation Phases

| Phase | Focus | Success Metric |
|-------|-------|----------------|
| 1 | Mock-first foundation, CLI | <5s completion, no external deps |
| 2 | Foundry VLM integration, custom evaluators | Results visible in Foundry dashboard |
| 3 | Variation studies, fine-tuning feedback | Per-phrasing success rates |

## Alternatives Considered

- **Existing benchmarks (LIBERO, etc.)** — evaluate individual models, not composed agent stacks on domain-specific tasks
- **LLM-routed orchestration** — rejected in favor of deterministic pipeline for clear failure attribution

## Additional Context

Integrates with existing repo infrastructure (`src/training/`, `src/inference/`). Phase 1 can proceed with no Azure dependencies (mock adapters only).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(evaluation): add ROVE task-level domain evaluation module #328

Feature Proposal

Problem Statement

Proposed Solution

Core Design

Three Evaluation Workflows

Azure AI Foundry Integration

Implementation Phases

Alternatives Considered

Additional Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Phase	Focus	Success Metric
1	Mock-first foundation, CLI	<5s completion, no external deps
2	Foundry VLM integration, custom evaluators	Results visible in Foundry dashboard
3	Variation studies, fine-tuning feedback	Per-phrasing success rates

feat(evaluation): add ROVE task-level domain evaluation module #328

Description

Feature Proposal

Problem Statement

Proposed Solution

Core Design

Three Evaluation Workflows

Azure AI Foundry Integration

Implementation Phases

Alternatives Considered

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions