Skip to content

feat(evaluation): policy rollout engine for offline evaluation #260

@mayurpatel312

Description

@mayurpatel312

Summary

Build the core evaluation engine that loads a LeRobot dataset and a trained PyTorch policy checkpoint, replays each episode's observations through the policy network, and compares predicted actions against ground truth. Supports Diffusion Policy, ACT, and BC architectures. Outputs evaluation results conforming to the metrics schema.

Deliverables

  • Rollout engine module (src/robo_il/evaluation/rollout_engine.py)
  • Dataset and checkpoint loader (src/robo_il/evaluation/loader.py)
  • Architecture adapter layer for Diffusion Policy, ACT, BC
  • CLI: robo-il eval run --dataset URL --checkpoint URL

Acceptance Criteria

  • Load LeRobot dataset + PyTorch policy checkpoint
  • For each episode: feed observations to policy, collect predicted actions
  • Compute action MSE between predicted and ground truth (per timestep, per joint)
  • Support architectures: Diffusion Policy, ACT, BC
  • Output: evaluation results JSON per episode
  • Batched inference on GPU for throughput
  • Fail fast if observation dimensions mismatch policy input shape
  • Results cached to blob storage keyed by hash(dataset_id + checkpoint_id)

Metadata

Metadata

Assignees

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions