-
Notifications
You must be signed in to change notification settings - Fork 8
Open
Labels
Milestone
Description
Summary
Build the core evaluation engine that loads a LeRobot dataset and a trained PyTorch policy checkpoint, replays each episode's observations through the policy network, and compares predicted actions against ground truth. Supports Diffusion Policy, ACT, and BC architectures. Outputs evaluation results conforming to the metrics schema.
Deliverables
- Rollout engine module (src/robo_il/evaluation/rollout_engine.py)
- Dataset and checkpoint loader (src/robo_il/evaluation/loader.py)
- Architecture adapter layer for Diffusion Policy, ACT, BC
- CLI: robo-il eval run --dataset URL --checkpoint URL
Acceptance Criteria
- Load LeRobot dataset + PyTorch policy checkpoint
- For each episode: feed observations to policy, collect predicted actions
- Compute action MSE between predicted and ground truth (per timestep, per joint)
- Support architectures: Diffusion Policy, ACT, BC
- Output: evaluation results JSON per episode
- Batched inference on GPU for throughput
- Fail fast if observation dimensions mismatch policy input shape
- Results cached to blob storage keyed by hash(dataset_id + checkpoint_id)
Reactions are currently unavailable