-
Notifications
You must be signed in to change notification settings - Fork 0
Evolve experiment plan schema: YAML frontmatter for machine-readable metadata #590
Copy link
Copy link
Open
Labels
recipe:researchResearch recipe improvementsResearch recipe improvementsstagedImplementation staged and waiting for promotion to mainImplementation staged and waiting for promotion to main
Description
Summary
Update plan-experiment to produce a YAML frontmatter block at the top of experiment plan files. This structured metadata enables automated design review (review-design) to parse experiment type, hypotheses, metrics, baselines, and statistical plan without LLM extraction.
Changes
plan-experiment skill update
- Add Step 3a after prose plan is drafted: extract structured info into YAML frontmatter between
---delimiters before the# Experiment Plan:heading - Accept optional second positional argument
revision_guidance(path to revision feedback from review-design). When present, read it and incorporate the feedback. When absent or empty, proceed normally (first pass).
Frontmatter schema
---
experiment_type: benchmark
# REQUIRED. One of: benchmark, configuration_study, causal_inference, robustness_audit, exploratory
estimand:
# RECOMMENDED. Required when experiment_type = causal_inference.
treatment: "{the intervention or manipulation}"
outcome: "{the measured effect}"
population: "{scope of units/datasets/contexts}"
contrast: "{A vs B vs C comparison}"
hypothesis_h0: "{null hypothesis with measurable threshold}" # REQUIRED
hypothesis_h1: "{alt hypothesis with measurable threshold}" # REQUIRED
metrics:
# REQUIRED, min 1
- name: "{metric_name}"
unit: "{unit of measurement}"
canonical_name: "{src/metrics.rs entry or NEW}"
collection_method: "{exact command or code path}"
threshold: "{success threshold}"
direction: "higher_is_better" # optional: higher_is_better | lower_is_better | target_value
primary: true # optional: true for the one metric H1 references
baselines:
# REQUIRED for benchmark/causal_inference
- name: "{comparator name}"
version: "{package==version or git SHA}"
tuning_budget: "{what tuning was done, or 'default'}"
statistical_plan:
# REQUIRED unless experiment_type = exploratory
test: "{primary statistical test name}"
alpha: 0.05
power_target: 0.80
correction_method: "Holm-Bonferroni" # null | Bonferroni | Holm-Bonferroni | BH
sample_size_justification: "{why N is sufficient}"
min_detectable_effect: "{MDE in metric units}" # optional
environment:
# REQUIRED
type: "custom" # standard | custom
spec_path: "research/{slug}/environment.yml" # required when type=custom
success_criteria:
# REQUIRED, all three sub-fields
conclusive_positive: "{conditions supporting H1, referencing metrics}"
conclusive_negative: "{conditions supporting H0}"
inconclusive: "{conditions where no conclusion can be drawn}"
experiment_slug: "{YYYY-MM-DD-slug}" # optional
---Field requirements by experiment type
| Field | benchmark | config_study | causal_inference | robustness_audit | exploratory |
|---|---|---|---|---|---|
| experiment_type | required | required | required | required | required |
| estimand | recommended | recommended | required (with contrast) | recommended | optional |
| hypothesis_h0/h1 | required | required | required | required | required |
| metrics | required | required | required | required | required |
| baselines | required | optional | required | optional | optional |
| statistical_plan | required | required | required | required | waived |
| environment | required | required | required | required | required |
| success_criteria | required | required | required | required | required |
Validation rules (applied before writing frontmatter)
V1: benchmark/causal_inference → len(baselines) >= 1 AND each baseline.version not empty
ERROR: "Benchmark/causal_inference experiments require at least one named baseline with a version"
V2: causal_inference → estimand.contrast is not null
ERROR: "causal_inference requires estimand with treatment, outcome, and contrast fields"
V3: !exploratory → statistical_plan present AND test not null
ERROR: "Non-exploratory experiments require a statistical_plan; use {test: 'none'} to waive"
V4: environment.type=custom → spec_path not null
ERROR: "Custom environment requires spec_path pointing to environment.yml"
V5: len(metrics) >= 2 → exactly one metric has primary: true
WARNING: "Multiple metrics but no primary designated; H1 threshold ambiguous"
V6: any metric.canonical_name = "NEW"
WARNING: "Plan includes NEW metrics not yet in src/metrics.rs"
V7: hypothesis_h1 has no numeric threshold
WARNING: "H1 should include a measurable numeric threshold"
V8: success_criteria.conclusive_positive should reference at least one metric.name
WARNING: "Success criteria does not reference any declared metric"
Log warnings as YAML comments (# WARNING: ...) in the frontmatter block.
Prose section ↔ frontmatter mapping
| Prose Section | Frontmatter Field(s) |
|---|---|
## Hypothesis (H0/H1 bold labels) |
hypothesis_h0, hypothesis_h1, estimand |
## Independent Variables table |
estimand.contrast, baselines[] |
## Dependent Variables (Metrics) table |
metrics[] |
## Environment |
environment |
## Analysis Plan |
statistical_plan |
## Success Criteria |
success_criteria |
## Experiment Directory Layout |
experiment_slug |
Backward compatibility
- Plans without frontmatter must still be consumable by all downstream skills (scope, review-design, implement-experiment, run-experiment, write-report)
- Frontmatter is additive — all existing prose sections remain unchanged
review-designhandles missing frontmatter via targeted LLM extraction fallback (per-field, not whole-plan)
Tests
- plan-experiment output has valid YAML frontmatter with at least
experiment_type,hypothesis_h0,hypothesis_h1,metrics - All 8 validation rules apply correctly for each experiment type
- Plans without frontmatter are handled gracefully by downstream skills
- plan-experiment accepts and uses optional revision_guidance second argument
- plan-experiment without second argument works identically to current behavior
Dependencies
Depends on #589 (recipe simplification — plan-experiment needs to accept revision_guidance arg)
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
recipe:researchResearch recipe improvementsResearch recipe improvementsstagedImplementation staged and waiting for promotion to mainImplementation staged and waiting for promotion to main