Skip to content

feat(evaluation): add ROVE task-level domain evaluation module #328

@niksacdev

Description

@niksacdev

Feature Proposal

Full design proposal: https://gist.github.com/niksacdev/53a256771240bf338ac135011da840e5

Problem Statement

The reference architecture provides strong training and inference capabilities. ROVE extends this with a structured evaluation layer — enabling teams to validate whether their complete agent stacks (perception, planning, control, verification) solve domain-specific tasks end-to-end, rather than only benchmarking individual models.

For example, in automotive parts kitting, teams need to evaluate whether their chosen VLM can distinguish M6 from M8 fasteners under factory lighting, whether the policy grasps correctly, and whether the full pipeline places parts accurately — including cost-accuracy tradeoffs (e.g., GPT-4.1-mini vs. Phi-4-Multimodal).

Proposed Solution

ROVE (Robot Observation & Vision Evaluation) — a domain evaluation module for complete agent stacks.

Core Design

  • Deterministic pipeline: perceive → plan → act → verify — clear results attribution (agent failure vs orchestrator failure)
  • Adapter Protocol pattern: typed Protocols for each component (VLM, policy, simulator), enabling model substitution via YAML config
  • VLM-as-Judge calibration: measures VLM judge agreement with physics ground truth — critical for production where simulation isn't available

Three Evaluation Workflows

  1. Single task — Does this stack solve the task?
  2. VLM domain comparison — Which VLM understands this workspace best?
  3. Variation study — How robust is the stack to instruction rephrasing?

Azure AI Foundry Integration

  • Models deployed through Foundry endpoints (GPT-4.1, o4-mini, Phi-4-Multimodal)
  • Custom evaluators registered with azure-ai-evaluation SDK
  • Results surface in Foundry dashboard
  • Foundry Local support for edge evaluation (Phi-4-Mini)
  • JSONL export compatible with Foundry's evaluation system

Implementation Phases

Phase Focus Success Metric
1 Mock-first foundation, CLI <5s completion, no external deps
2 Foundry VLM integration, custom evaluators Results visible in Foundry dashboard
3 Variation studies, fine-tuning feedback Per-phrasing success rates

Alternatives Considered

  • Existing benchmarks (LIBERO, etc.) — evaluate individual models, not composed agent stacks on domain-specific tasks
  • LLM-routed orchestration — rejected in favor of deterministic pipeline for clear failure attribution

Additional Context

Integrates with existing repo infrastructure (src/training/, src/inference/). Phase 1 can proceed with no Azure dependencies (mock adapters only).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions