TalonT-Org · Trecek · Apr 4, 2026 · Apr 4, 2026 · Apr 4, 2026 · Apr 4, 2026
diff --git a/src/autoskillit/skills_extended/review-design/SKILL.md b/src/autoskillit/skills_extended/review-design/SKILL.md
@@ -1,7 +1,7 @@
 ---
 name: review-design
 categories: [research]
-description: Validate an experiment plan before execution. Emits verdict (GO/REVISE/STOP), experiment_type, evaluation_dashboard, and revision_guidance.
+description: Validate an experiment plan before execution using triage-first, fail-fast dimensional analysis with an adversarial red-team. Emits GO/REVISE/STOP verdict.
 hooks:
   PreToolUse:
     - matcher: "*"
@@ -14,13 +14,8 @@ hooks:
 # Review Design Skill
 
 Validate the quality and feasibility of an experiment plan before compute is spent.
-Returns a structured verdict to drive the research recipe's design review loop.
-
-## Arguments
-
-`/autoskillit:review-design {experiment_plan_path}`
-
-- **experiment_plan_path** — Absolute path to the experiment plan file
+Runs a triage-first, fail-fast multi-level analysis hierarchy with parallel subagents
+and an adversarial red-team, then synthesizes a GO/REVISE/STOP verdict.
 
 ## When to Use
 
@@ -29,24 +24,290 @@ recipe calls this skill after `plan_experiment` to gate execution on a quality c
 This skill is bounded by `retries: 2` — on exhaustion the recipe proceeds with the
 best available plan.
 
+## Arguments
+
+`/autoskillit:review-design {experiment_plan_path}`
+
+- **experiment_plan_path** — Absolute path to the experiment plan file produced by
+  `/autoskillit:plan-experiment`. Scan tokens after the skill name for the first
+  path-like token (starts with `/`, `./`, or `.autoskillit/`).
+
 ## Critical Constraints
 
 **NEVER:**
-- Write output outside `.autoskillit/temp/review-design/`
+- Modify the plan file, any source code, or any file outside `.autoskillit/temp/review-design/`
 - Halt the pipeline for a REVISE verdict — emit the verdict and let the recipe route
+- Proceed to Level 2, 3, or 4 analysis when any Level 1 finding is critical (halt
+  at fail-fast gate)
+- Spawn SILENT (S) dimension subagents — they are not run and not mentioned in output
+- Exit non-zero — GO, REVISE, and STOP are all normal outcomes (exit 0 in all cases)
 
 **ALWAYS:**
-- Emit `verdict = GO`, `verdict = REVISE`, or `verdict = STOP` on the final output line
-- Write `evaluation_dashboard` and `revision_guidance` as absolute paths when present
+- Use `model: "sonnet"` when spawning all subagents via the Task tool
+- Write output to `.autoskillit/temp/review-design/` (relative to the current working directory)
+- After writing output files, emit the **absolute paths** as structured output tokens
+  immediately before `%%ORDER_UP%%`. Resolve relative save paths to absolute by prepending
+  the full CWD:
+    verdict = GO|REVISE|STOP
+    experiment_type = {type}
+    evaluation_dashboard = /absolute/cwd/.autoskillit/temp/review-design/{filename}.md
+    revision_guidance = /absolute/cwd/.autoskillit/temp/review-design/{filename}.md   (REVISE only)
+    %%ORDER_UP%%
+- `revision_guidance` is written and emitted ONLY when verdict = REVISE
+- `evaluation_dashboard` is ALWAYS written and emitted
+- Red-team agent always sets `requires_human: true` on all its findings
+- Halt at Level 1 fail-fast gate: if any Level 1 finding is critical, emit STOP immediately —
+  do NOT proceed to Level 2, 3, or 4
 
-## Output
+## Workflow
+
+### Step 0: Read Plan & Setup
+
+1. Create `.autoskillit/temp/review-design/` if absent.
+2. Extract `experiment_plan_path` from arguments (first path-like token starting with `/`,
+   `./`, or `.autoskillit/`).
+3. Read the plan file. Parse YAML frontmatter using the **backward-compatible two-level
+   fallback**:
+   - **Level 1 (frontmatter)**: Read YAML frontmatter between `---` delimiters directly
+     (zero LLM tokens). Return present fields and note which are missing.
+     Record `source: frontmatter` for each extracted field.
+   - **Level 2 (LLM extraction)**: For each missing field, launch a targeted LLM
+     extraction subagent against the corresponding prose section. All extractions are
+     independent and run in parallel. Record `source: extracted` for each field from
+     this path.
+   - Fields: `experiment_type`, `hypothesis_h0/h1`, `estimand`, `metrics`, `baselines`,
+     `statistical_plan`, `success_criteria`
+   - Missing-field to prose-target mapping:
+
+     | Missing Field | Prose Target | Extraction Prompt |
+     |---|---|---|
+     | experiment_type | Full plan | "Classify: benchmark, configuration_study, causal_inference, robustness_audit, exploratory" |
+     | hypothesis_h0/h1 | ## Hypothesis | "Extract the null/alternative hypothesis" |
+     | estimand | ## Hypothesis + ## Independent Variables | "Extract: treatment, outcome, population, contrast" |
+     | metrics | ## Dependent Variables table | "Extract each row as structured object" |
+     | baselines | ## Independent Variables | "Extract comparators: name, version, tuning" |
+     | statistical_plan | ## Analysis Plan | "Extract: test, alpha, power, correction, sample size" |
+     | success_criteria | ## Success Criteria | "Extract three criteria" |
+
+   This two-level approach ensures backward compatibility with plans that lack frontmatter:
+   the provenance (`source: frontmatter` or `source: extracted`) is tracked for each field
+   and included in the evaluation dashboard.
+
+### Step 1: Triage Dispatcher
+
+Launch one subagent. Receives full plan text plus parsed fields. Returns:
+- `experiment_type`: one of `benchmark | configuration_study | causal_inference |
+  robustness_audit | exploratory`
+- `dimension_weights`: the complete weight matrix for this plan (H/M/L/S per dimension)
+- `secondary_modifiers`: list of active modifiers with their effects on weights
+
+**Triage classification rules (first-match):**
+
+| Rule | Type | Trigger |
+|---|---|---|
+| 1 | benchmark | IVs are system/method names, DVs are performance metrics, multiple comparators |
+| 2 | configuration_study | IVs are numeric parameters of one system, grid/sweep structure |
+| 3 | causal_inference | Causal language ("causes", "effect of"), confounders in threats |
+| 4 | robustness_audit | Tests generalization/stability, deliberately varied conditions |
+| 5 | exploratory | Default — no prior rule fires, or hypothesis absent |
+
+**Secondary modifiers** (additive, increase dimension weights):
+- `+causal`: mechanism claim in non-causal type → causal_structure weight +1 tier
+- `+high_cost`: resources > 4 GPU-hours → resource_proportionality L→M
+- `+deployment`: motivation references production/users → ecological_validity floor = M
+- `+multi_metric`: ≥3 DVs → statistical_corrections weight +1 tier
+
+**Full dimension-to-weight matrix** (W = weight per experiment type):
+
+| Dimension | benchmark | config_study | causal_inf | robust_audit | exploratory |
+|---|---|---|---|---|---|
+| causal_structure | S | S | H | M | L |
+| variance_protocol | H | H | L | M | L |
+| statistical_corrections | M | H | H | S | S |
+| ecological_validity | M | L | L | H | M |
+| measurement_alignment | M | M | M | H | M |
+| resource_proportionality | L | L | L | L | L |
+
+Weight tiers: H (High), M (Medium), L (Low), S (SILENT — dimension not spawned, not mentioned).
+
+### Step 2: Level 1 Analysis — Fail-Fast (parallel)
+
+Two subagents run in parallel. Both are always H-weight regardless of triage output.
+
+- `estimand_clarity` agent: "Can the claim be written as a formal contrast (A vs B on Y in Z)?"
+  Reference the exp-lens-estimand-clarity philosophical mode as guidance (do NOT invoke
+  the skill — reference its lens question only in the subagent prompt).
+- `hypothesis_falsifiability` agent: "What result would cause the author to conclude H0?"
+
+Each subagent returns findings in the standard JSON structure (see Finding Format below).
+
+**FAIL-FAST GATE**: After both Level 1 subagents complete, check for critical findings.
+If ANY Level 1 subagent returns a finding with `"severity": "critical"`:
+- Collect these as `stop_triggers`
+- Do not proceed to Level 2, 3, or 4 analysis
+- Do not start the red-team agent
+- Skip directly to Step 7 (Synthesis) with only L1 findings
+- The verdict logic will produce STOP; halt and emit tokens
+
+### Step 3: Level 2 + Red-Team (concurrent)
+
+When the L1 gate passes (no critical L1 findings), launch 2–3 Level 2 subagents AND the
+red-team agent concurrently — all at the same time without waiting for each other.
+
+**Level 2 subagents** (parallel, weights from the matrix):
+- `baseline_fairness`: "Are all compared systems given symmetric resources and tuning effort?"
+- `causal_structure`: weight from matrix (S for benchmark/config_study, H for causal_inference).
+  Only spawn when weight ≥ L.
+- `unit_interference`: "Can treatments spill over between experimental units?"
 
-Emit on the final line of output:
+**Red-team agent** (concurrent with L2 and L4 — does NOT block L3):
+- Receives full plan text and `experiment_type`
+- Five universal challenges (challenge every plan regardless of type):
+  1. **Goodhart exploitation** — cheapest way to score well without solving the research question
+  2. **Data leakage** — test-set info contaminating training/hyperparameter selection
+  3. **Asymmetric tuning** — proposed method tuned against eval while baselines use defaults
+  4. **Survivorship bias** — cherry-picking best run from multiple seeds
+  5. **Evaluation collision** — same infrastructure in both treatment and measurement
+- Type-specific focus per experiment type:
+  - benchmark → asymmetric effort
+  - configuration_study → overfitting to held-out set
+  - causal_inference → unblocked backdoor path
+  - robustness_audit → unrealistic threat distribution
+  - exploratory → HARKing vulnerability
+- ALL red-team findings must set `"requires_human": true` and `"dimension": "red_team"`
+
+### Step 4: Level 3 (parallel)
+
+Run after Level 2 completes (Level 2 findings may inform statistical planning context).
+Do not wait for the red-team agent before starting Level 3.
+
+Three subagents run in parallel:
+- `error_budget`: "Is power analysis present? Are error rates (Type I / Type II) acknowledged?"
+- `statistical_corrections`: "Are multiple comparisons corrections pre-specified for all DVs?"
+- `variance_protocol`: "Are seeds fixed? Is run-to-run variance addressed?"
+  NOTE: absent seeds IS a valid finding for this dimension at H-weight — do not suppress
+  via foothold validation.
+
+### Step 5: Level 4 (parallel, gated by triage)
+
+2–4 subagents. Only spawn subagents for dimensions with weight ≥ L in the matrix.
+SILENT (S) dimensions are NOT spawned and NOT mentioned in output.
+
+Level 4 dimensions (spawn when not SILENT):
+- `benchmark_representativeness`: "Does this generalize beyond the specific test bed?"
+- `ecological_validity`: "Do test conditions match the intended deployment context?"
+- `measurement_alignment`: "Do the metrics actually measure what the research question claims?"
+- `reproducibility_spec`: "Could an independent party reproduce this experiment?"
+
+Level 3 and Level 4 may run concurrently with the red-team agent (do not block on red-team).
+
+**Three-layer silencing** (prevents orphan warnings):
+1. **Static SILENT** from matrix — dimensions not spawned, not mentioned in output.
+2. **Foothold validation** — before spawning M/L dimensions, check plan text has relevant
+   content. If absent: M→L, L→S. Exception: absent seeds IS a finding for
+   `variance_protocol` at H-weight.
+3. **Finding-count suppression** — L-weight dimensions with zero findings: omit from output
+   entirely. H/M dimensions with zero findings: emit "No issues identified" (deliberate
+   clean bill of health).
+
+### Step 6: Wait for Red-Team
+
+After Levels 3 and 4 complete, wait for the red-team agent if still running.
+All red-team findings are merged into the finding pool with their
+`"requires_human": true` flag preserved.
+
+### Step 7: Synthesis
+
+One synthesis pass (no subagent — orchestrator synthesizes directly):
+
+1. **Merge all findings** from L1, L2, L3, L4, and red-team into a single list.
+2. **Deduplicate** by `(dimension, section, message)` — identical findings from parallel
+   agents are collapsed into one entry.
+3. **Apply verdict logic**:
+   ```python
+   stop_triggers = [f for f in critical if f.dimension in {"estimand_clarity", "hypothesis_falsifiability"}]
+   stop_triggers += [f for f in critical if f.dimension == "red_team"]
+
+   if stop_triggers:
+       verdict = "STOP"
+   elif critical_findings or len(warning_findings) >= 3:
+       verdict = "REVISE"
+   else:
+       verdict = "GO"
+   ```
+4. **Write `evaluation_dashboard_{slug}_{YYYY-MM-DD_HHMMSS}.md`** — always written.
+   Must include:
+   - Verdict banner and classification summary
+   - Dimension scorecard table (dimension → weight → findings count → severity summary)
+   - Adversarial findings section (red-team findings, each marked `requires_human: true`)
+   - **Cannot Assess** section with at least 2 items (dimensions where evaluation was
+     impossible due to absent plan content; minimum 2 entries, e.g.,
+     "Randomization mechanism not described — cannot assess unit interference risk",
+     "No resource budgets stated — cannot assess resource_proportionality")
+   - Mechanizable check log (binary checks that could be automated in future)
+   - Machine-readable YAML summary block at end:
+     ```yaml
+     # --- review-design machine summary ---
+     verdict: GO|REVISE|STOP
+     experiment_type: {type}
+     critical_count: {n}
+     warning_count: {n}
+     red_team_count: {n}
+     ```
+   The YAML summary block enables downstream recipe steps to parse verdict and counts
+   without re-reading prose.
+5. **Write `revision_guidance_{slug}_{YYYY-MM-DD_HHMMSS}.md`** — written ONLY when
+   verdict = REVISE. Must include:
+   - Required revisions: critical findings with concrete, actionable fix descriptions
+   - Recommended revisions: warning findings
+   - Red-team findings with mitigation options
+   The `revision_guidance` path is passed back to `plan-experiment` in the recipe loop.
+
+### Step 8: Emit Output Tokens
+
+Emit these lines immediately before `%%ORDER_UP%%`:
 
 ```
-verdict = {GO|REVISE|STOP}
-experiment_type = {string}
-evaluation_dashboard = {absolute_path}
-revision_guidance = {absolute_path}
+verdict = GO|REVISE|STOP
+experiment_type = {experiment_type}
+evaluation_dashboard = /absolute/path/.autoskillit/temp/review-design/evaluation_dashboard_{slug}_{YYYY-MM-DD_HHMMSS}.md
+revision_guidance = /absolute/path/.autoskillit/temp/review-design/revision_guidance_{slug}_{YYYY-MM-DD_HHMMSS}.md
 %%ORDER_UP%%
 ```
+
+`revision_guidance` line is emitted ONLY when verdict = REVISE. When verdict is GO or STOP,
+omit the `revision_guidance` line entirely.
+
+## Finding Format
+
+All subagents must return findings in this JSON structure:
+
+```json
+{
+  "section": "## Hypothesis",
+  "dimension": "estimand_clarity",
+  "level": 1,
+  "severity": "critical | warning | info",
+  "message": "{clear, actionable description of the issue}",
+  "requires_human": false
+}
+```
+
+Red-team findings: always `"requires_human": true`, `"dimension": "red_team"`.
+
+## Output
+
+```
+.autoskillit/temp/review-design/
+├── evaluation_dashboard_{slug}_{YYYY-MM-DD_HHMMSS}.md   (always written)
+└── revision_guidance_{slug}_{YYYY-MM-DD_HHMMSS}.md      (REVISE only)
+```
+
+Emit structured output tokens (absolute paths) immediately before `%%ORDER_UP%%`.
+
+## Related Skills
+
+- `/autoskillit:plan-experiment` — produces the plan this skill validates
+- `/autoskillit:scope` — first step in the research recipe chain
+- `/autoskillit:implement-experiment` — consumes this skill's GO output (via recipe routing)
+- exp-lens skills — philosophical modes referenced in dimension prompts (not invoked directly)