feat(evaluation): policy rollout engine for offline evaluation

## Summary

Build the core evaluation engine that loads a LeRobot dataset and a trained PyTorch policy checkpoint, replays each episode's observations through the policy network, and compares predicted actions against ground truth. Supports Diffusion Policy, ACT, and BC architectures. Outputs evaluation results conforming to the metrics schema.

## Deliverables

- Rollout engine module (src/robo_il/evaluation/rollout_engine.py)
- Dataset and checkpoint loader (src/robo_il/evaluation/loader.py)
- Architecture adapter layer for Diffusion Policy, ACT, BC
- CLI: robo-il eval run --dataset URL --checkpoint URL

## Acceptance Criteria

- [ ] Load LeRobot dataset + PyTorch policy checkpoint
- [ ] For each episode: feed observations to policy, collect predicted actions
- [ ] Compute action MSE between predicted and ground truth (per timestep, per joint)
- [ ] Support architectures: Diffusion Policy, ACT, BC
- [ ] Output: evaluation results JSON per episode
- [ ] Batched inference on GPU for throughput
- [ ] Fail fast if observation dimensions mismatch policy input shape
- [ ] Results cached to blob storage keyed by hash(dataset_id + checkpoint_id)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(evaluation): policy rollout engine for offline evaluation #260

Summary

Deliverables

Acceptance Criteria

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

feat(evaluation): policy rollout engine for offline evaluation #260

Description

Summary

Deliverables

Acceptance Criteria

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions