Skip to content

Evolve experiment plan schema: YAML frontmatter for machine-readable metadata #590

@Trecek

Description

@Trecek

Summary

Update plan-experiment to produce a YAML frontmatter block at the top of experiment plan files. This structured metadata enables automated design review (review-design) to parse experiment type, hypotheses, metrics, baselines, and statistical plan without LLM extraction.

Changes

plan-experiment skill update

  • Add Step 3a after prose plan is drafted: extract structured info into YAML frontmatter between --- delimiters before the # Experiment Plan: heading
  • Accept optional second positional argument revision_guidance (path to revision feedback from review-design). When present, read it and incorporate the feedback. When absent or empty, proceed normally (first pass).

Frontmatter schema

---
experiment_type: benchmark
# REQUIRED. One of: benchmark, configuration_study, causal_inference, robustness_audit, exploratory

estimand:
  # RECOMMENDED. Required when experiment_type = causal_inference.
  treatment: "{the intervention or manipulation}"
  outcome: "{the measured effect}"
  population: "{scope of units/datasets/contexts}"
  contrast: "{A vs B vs C comparison}"

hypothesis_h0: "{null hypothesis with measurable threshold}"   # REQUIRED
hypothesis_h1: "{alt hypothesis with measurable threshold}"    # REQUIRED

metrics:
  # REQUIRED, min 1
  - name: "{metric_name}"
    unit: "{unit of measurement}"
    canonical_name: "{src/metrics.rs entry or NEW}"
    collection_method: "{exact command or code path}"
    threshold: "{success threshold}"
    direction: "higher_is_better"    # optional: higher_is_better | lower_is_better | target_value
    primary: true                     # optional: true for the one metric H1 references

baselines:
  # REQUIRED for benchmark/causal_inference
  - name: "{comparator name}"
    version: "{package==version or git SHA}"
    tuning_budget: "{what tuning was done, or 'default'}"

statistical_plan:
  # REQUIRED unless experiment_type = exploratory
  test: "{primary statistical test name}"
  alpha: 0.05
  power_target: 0.80
  correction_method: "Holm-Bonferroni"   # null | Bonferroni | Holm-Bonferroni | BH
  sample_size_justification: "{why N is sufficient}"
  min_detectable_effect: "{MDE in metric units}"   # optional

environment:
  # REQUIRED
  type: "custom"   # standard | custom
  spec_path: "research/{slug}/environment.yml"   # required when type=custom

success_criteria:
  # REQUIRED, all three sub-fields
  conclusive_positive: "{conditions supporting H1, referencing metrics}"
  conclusive_negative: "{conditions supporting H0}"
  inconclusive: "{conditions where no conclusion can be drawn}"

experiment_slug: "{YYYY-MM-DD-slug}"   # optional
---

Field requirements by experiment type

Field benchmark config_study causal_inference robustness_audit exploratory
experiment_type required required required required required
estimand recommended recommended required (with contrast) recommended optional
hypothesis_h0/h1 required required required required required
metrics required required required required required
baselines required optional required optional optional
statistical_plan required required required required waived
environment required required required required required
success_criteria required required required required required

Validation rules (applied before writing frontmatter)

V1: benchmark/causal_inference → len(baselines) >= 1 AND each baseline.version not empty
    ERROR: "Benchmark/causal_inference experiments require at least one named baseline with a version"

V2: causal_inference → estimand.contrast is not null
    ERROR: "causal_inference requires estimand with treatment, outcome, and contrast fields"

V3: !exploratory → statistical_plan present AND test not null
    ERROR: "Non-exploratory experiments require a statistical_plan; use {test: 'none'} to waive"

V4: environment.type=custom → spec_path not null
    ERROR: "Custom environment requires spec_path pointing to environment.yml"

V5: len(metrics) >= 2 → exactly one metric has primary: true
    WARNING: "Multiple metrics but no primary designated; H1 threshold ambiguous"

V6: any metric.canonical_name = "NEW"
    WARNING: "Plan includes NEW metrics not yet in src/metrics.rs"

V7: hypothesis_h1 has no numeric threshold
    WARNING: "H1 should include a measurable numeric threshold"

V8: success_criteria.conclusive_positive should reference at least one metric.name
    WARNING: "Success criteria does not reference any declared metric"

Log warnings as YAML comments (# WARNING: ...) in the frontmatter block.

Prose section ↔ frontmatter mapping

Prose Section Frontmatter Field(s)
## Hypothesis (H0/H1 bold labels) hypothesis_h0, hypothesis_h1, estimand
## Independent Variables table estimand.contrast, baselines[]
## Dependent Variables (Metrics) table metrics[]
## Environment environment
## Analysis Plan statistical_plan
## Success Criteria success_criteria
## Experiment Directory Layout experiment_slug

Backward compatibility

  • Plans without frontmatter must still be consumable by all downstream skills (scope, review-design, implement-experiment, run-experiment, write-report)
  • Frontmatter is additive — all existing prose sections remain unchanged
  • review-design handles missing frontmatter via targeted LLM extraction fallback (per-field, not whole-plan)

Tests

  • plan-experiment output has valid YAML frontmatter with at least experiment_type, hypothesis_h0, hypothesis_h1, metrics
  • All 8 validation rules apply correctly for each experiment type
  • Plans without frontmatter are handled gracefully by downstream skills
  • plan-experiment accepts and uses optional revision_guidance second argument
  • plan-experiment without second argument works identically to current behavior

Dependencies

Depends on #589 (recipe simplification — plan-experiment needs to accept revision_guidance arg)

Metadata

Metadata

Assignees

No one assigned

    Labels

    recipe:researchResearch recipe improvementsstagedImplementation staged and waiting for promotion to main

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions