diff --git a/src/autoskillit/skills_extended/review-design/SKILL.md b/src/autoskillit/skills_extended/review-design/SKILL.md
index 1d7c54e7..9dd675e7 100644
--- a/src/autoskillit/skills_extended/review-design/SKILL.md
+++ b/src/autoskillit/skills_extended/review-design/SKILL.md
@@ -1,7 +1,7 @@
 ---
 name: review-design
 categories: [research]
-description: Validate an experiment plan before execution. Emits verdict (GO/REVISE/STOP), experiment_type, evaluation_dashboard, and revision_guidance.
+description: Validate an experiment plan before execution using triage-first, fail-fast dimensional analysis with an adversarial red-team. Emits verdict (GO/REVISE/STOP), experiment_type, evaluation_dashboard, and revision_guidance.
 hooks:
   PreToolUse:
     - matcher: "*"
@@ -14,13 +14,8 @@ hooks:
 # Review Design Skill
 
 Validate the quality and feasibility of an experiment plan before compute is spent.
-Returns a structured verdict to drive the research recipe's design review loop.
-
-## Arguments
-
-`/autoskillit:review-design {experiment_plan_path}`
-
-- **experiment_plan_path** — Absolute path to the experiment plan file
+Runs a triage-first, fail-fast multi-level analysis hierarchy with parallel subagents
+and an adversarial red-team, then synthesizes a GO/REVISE/STOP verdict.
 
 ## When to Use
 
@@ -29,24 +24,310 @@ recipe calls this skill after `plan_experiment` to gate execution on a quality c
 This skill is bounded by `retries: 2` — on exhaustion the recipe proceeds with the
 best available plan.
 
+## Arguments
+
+`/autoskillit:review-design {experiment_plan_path}`
+
+- **experiment_plan_path** — Absolute path to the experiment plan file produced by
+  `/autoskillit:plan-experiment`. Scan tokens after the skill name for the first
+  path-like token (starts with `/`, `./`, or `.autoskillit/`).
+
 ## Critical Constraints
 
 **NEVER:**
-- Write output outside `.autoskillit/temp/review-design/`
+- Modify the plan file, any source code, or any file outside `.autoskillit/temp/review-design/`
 - Halt the pipeline for a REVISE verdict — emit the verdict and let the recipe route
+- Proceed to Level 2, 3, or 4 analysis when any Level 1 finding is critical (halt
+  at fail-fast gate)
+- Spawn SILENT (S) dimension subagents — they are not run and not mentioned in output
+- Exit non-zero — GO, REVISE, and STOP are all normal outcomes (exit 0 in all cases)
 
 **ALWAYS:**
-- Emit `verdict = GO`, `verdict = REVISE`, or `verdict = STOP` on the final output line
-- Write `evaluation_dashboard` and `revision_guidance` as absolute paths when present
+- Use `model: "sonnet"` when spawning all subagents via the Task tool
+- Write output to `.autoskillit/temp/review-design/` (relative to the current working directory)
+- After writing output files, emit the **absolute paths** as structured output tokens
+  immediately before `%%ORDER_UP%%`. Resolve relative save paths to absolute by prepending
+  the full CWD:
+    verdict = GO|REVISE|STOP
+    experiment_type = {type}
+    evaluation_dashboard = /absolute/cwd/.autoskillit/temp/review-design/{filename}.md
+    revision_guidance = /absolute/cwd/.autoskillit/temp/review-design/{filename}.md   (REVISE only)
+    %%ORDER_UP%%
+- `revision_guidance` is written and emitted ONLY when verdict = REVISE
+- `evaluation_dashboard` is ALWAYS written and emitted
+- Red-team agent always sets `requires_decision: true` on all its findings
+- Halt at Level 1 fail-fast gate: if any Level 1 finding is critical, emit STOP immediately —
+  do NOT proceed to Level 2, 3, or 4
 
-## Output
+## Workflow
+
+### Step 0: Read Plan & Setup
+
+1. Create `.autoskillit/temp/review-design/` if absent.
+2. Extract `experiment_plan_path` from arguments (first path-like token starting with `/`,
+   `./`, or `.autoskillit/`).
+   **Error handling:** If no path-like token is found in the arguments, emit
+   `verdict = STOP` with message "No experiment_plan_path provided" and exit 0 (per
+   the NEVER exit-non-zero constraint).
+3. Read the plan file.
+   **Error handling:** If the file does not exist or is unreadable at the resolved path,
+   emit `verdict = STOP` with message "Plan file not found: {path}" and exit 0.
+   Parse YAML frontmatter using the **backward-compatible two-level fallback**:
+   - **Level 1 (frontmatter)**: Read YAML frontmatter between `---` delimiters directly
+     (zero LLM tokens). Return present fields and note which are missing.
+     Record `source: frontmatter` for each extracted field.
+     **Error handling:** If the YAML between `---` delimiters is malformed, treat all
+     fields as missing and fall through to Level 2 for all fields (graceful degradation).
+   - **Level 2 (LLM extraction)**: For each missing field, launch a targeted LLM
+     extraction subagent against the corresponding prose section. All extractions are
+     independent and run in parallel. Record `source: extracted` for each field from
+     this path.
+   - Fields: `experiment_type`, `hypothesis_h0/h1`, `estimand`, `metrics`, `baselines`,
+     `statistical_plan`, `success_criteria`
+   - Missing-field to prose-target mapping:
+
+     | Missing Field | Prose Target | Extraction Prompt |
+     |---|---|---|
+     | experiment_type | Full plan | "Classify: benchmark, configuration_study, causal_inference, robustness_audit, exploratory" |
+     | hypothesis_h0/h1 | ## Hypothesis | "Extract the null/alternative hypothesis" |
+     | estimand | ## Hypothesis + ## Independent Variables | "Extract: treatment, outcome, population, contrast" |
+     | metrics | ## Dependent Variables table | "Extract each row as structured object" |
+     | baselines | ## Independent Variables | "Extract comparators: name, version, tuning" |
+     | statistical_plan | ## Analysis Plan | "Extract: test, alpha, power, correction, sample size" |
+     | success_criteria | ## Success Criteria | "Extract three criteria" |
+
+   This two-level approach ensures backward compatibility with plans that lack frontmatter:
+   the provenance (`source: frontmatter` or `source: extracted`) is tracked for each field
+   and included in the evaluation dashboard.
+
+### Step 1: Triage Dispatcher
+
+Launch one subagent. Receives full plan text plus parsed fields. Returns:
+- `experiment_type`: one of `benchmark | configuration_study | causal_inference |
+  robustness_audit | exploratory`
+- `dimension_weights`: the complete weight matrix for this plan (H/M/L/S per dimension)
+- `secondary_modifiers`: list of active modifiers with their effects on weights
+
+**Schema validation:** After the subagent returns, verify that `experiment_type` is one of
+the five enumerated values above. If the returned value is unrecognized, default to
+`exploratory` and log a warning — do not silently pass an invalid type into the weight
+matrix lookup, as this would corrupt all subsequent spawning decisions.
+
+**Triage classification rules (first-match):**
+
+| Rule | Type | Trigger |
+|---|---|---|
+| 1 | benchmark | IVs are system/method names, DVs are performance metrics, multiple comparators |
+| 2 | configuration_study | IVs are numeric parameters of one system, grid/sweep structure |
+| 3 | causal_inference | Causal language ("causes", "effect of"), confounders in threats |
+| 4 | robustness_audit | Tests generalization/stability, deliberately varied conditions |
+| 5 | exploratory | Default — no prior rule fires, or hypothesis absent |
+
+**Secondary modifiers** (additive, increase dimension weights):
+- `+causal`: mechanism claim in non-causal type → causal_structure weight +1 tier
+- `+high_cost`: resources > 4 GPU-hours → resource_proportionality L→M
+- `+deployment`: motivation references production/users → ecological_validity floor = M
+- `+multi_metric`: ≥3 DVs → statistical_corrections weight +1 tier
+
+**Full dimension-to-weight matrix** (W = weight per experiment type):
+
+| Dimension | benchmark | config_study | causal_inf | robust_audit | exploratory |
+|---|---|---|---|---|---|
+| causal_structure | S | S | H | M | L |
+| variance_protocol | H | H | L | M | L |
+| statistical_corrections | M | H | H | S | S |
+| ecological_validity | M | L | L | H | M |
+| measurement_alignment | M | M | M | H | M |
+| resource_proportionality | L | L | L | L | L |
+
+Weight tiers: H (High), M (Medium), L (Low), S (SILENT — dimension not spawned, not mentioned).
+
+### Step 2: Level 1 Analysis — Fail-Fast (parallel)
+
+Two subagents run in parallel. Both are always H-weight regardless of triage output.
+
+- `estimand_clarity` agent: "Can the claim be written as a formal contrast (A vs B on Y in Z)?"
+  Reference the exp-lens-estimand-clarity philosophical mode as guidance (do NOT invoke
+  the skill — reference its lens question only in the subagent prompt).
+- `hypothesis_falsifiability` agent: "What result would cause the author to conclude H0?"
+
+Each subagent returns findings in the standard JSON structure (see Finding Format below).
+
+**FAIL-FAST GATE**: After both Level 1 subagents complete, check for critical findings.
+If ANY Level 1 subagent returns a finding with `"severity": "critical"`:
+- Collect these as `stop_triggers`
+- Do not proceed to Level 2, 3, or 4 analysis
+- Do not start the red-team agent
+- Skip directly to Step 7 (Synthesis) with only L1 findings
+- The verdict logic will produce STOP; halt and emit tokens
+
+**Subagent parse failure:** If a Level 1 subagent returns unparseable output (malformed
+JSON, empty response, token-limit truncation), treat it as if it returned one critical
+finding with `message: "L1 subagent did not return parseable findings"`. This ensures
+parse failures trigger the fail-fast gate rather than silently passing it.
+
+### Step 3: Level 2 + Red-Team (concurrent)
+
+When the L1 gate passes (no critical L1 findings), launch 2–3 Level 2 subagents AND the
+red-team agent concurrently — all at the same time without waiting for each other.
 
-Emit on the final line of output:
+**Level 2 subagents** (parallel, weights from the matrix):
+- `baseline_fairness`: "Are all compared systems given symmetric resources and tuning effort?"
+- `causal_structure`: weight from matrix (S for benchmark/config_study, H for causal_inference).
+  Only spawn when weight ≥ L.
+- `unit_interference`: "Can treatments spill over between experimental units?"
+
+**Red-team agent** (concurrent with L2 and L4 — does NOT block L3):
+- Receives full plan text and `experiment_type`
+- Five universal challenges (challenge every plan regardless of type):
+  1. **Goodhart exploitation** — cheapest way to score well without solving the research question
+  2. **Data leakage** — test-set info contaminating training/hyperparameter selection
+  3. **Asymmetric tuning** — proposed method tuned against eval while baselines use defaults
+  4. **Survivorship bias** — cherry-picking best run from multiple seeds
+  5. **Evaluation collision** — same infrastructure in both treatment and measurement
+- Type-specific focus per experiment type:
+  - benchmark → asymmetric effort
+  - configuration_study → overfitting to held-out set
+  - causal_inference → unblocked backdoor path
+  - robustness_audit → unrealistic threat distribution
+  - exploratory → HARKing vulnerability
+- ALL red-team findings must set `"requires_decision": true` and `"dimension": "red_team"`
+
+### Step 4: Level 3 (parallel)
+
+Run after Level 2 completes (Level 2 findings may inform statistical planning context).
+Do not wait for the red-team agent before starting Level 3.
+
+Three subagents run in parallel:
+- `error_budget`: "Is power analysis present? Are error rates (Type I / Type II) acknowledged?"
+- `statistical_corrections`: "Are multiple comparisons corrections pre-specified for all DVs?"
+- `variance_protocol`: "Are seeds fixed? Is run-to-run variance addressed?"
+  NOTE: absent seeds IS a valid finding for this dimension at H-weight — do not suppress
+  via foothold validation.
+
+### Step 5: Level 4 (parallel, gated by triage)
+
+2–4 subagents. Only spawn subagents for dimensions with weight ≥ L in the matrix.
+SILENT (S) dimensions are NOT spawned and NOT mentioned in output.
+
+Level 4 dimensions (spawn when not SILENT):
+- `benchmark_representativeness`: "Does this generalize beyond the specific test bed?"
+- `ecological_validity`: "Do test conditions match the intended deployment context?"
+- `measurement_alignment`: "Do the metrics actually measure what the research question claims?"
+- `reproducibility_spec`: "Could an independent party reproduce this experiment?"
+
+Level 3 and Level 4 may run concurrently with the red-team agent (do not block on red-team).
+
+**Three-layer silencing** (prevents orphan warnings):
+1. **Static SILENT** from matrix — dimensions not spawned, not mentioned in output.
+2. **Foothold validation** — before spawning M/L dimensions, check plan text has relevant
+   content. If absent: M→L, L→S. Exception: absent seeds IS a finding for
+   `variance_protocol` at H-weight.
+3. **Finding-count suppression** — L-weight dimensions with zero findings: omit from output
+   entirely. H/M dimensions with zero findings: emit "No issues identified" (deliberate
+   clean bill of health).
+
+### Step 6: Wait for Red-Team
+
+After Levels 3 and 4 complete, wait for the red-team agent if still running.
+All red-team findings are merged into the finding pool with their
+`"requires_decision": true` flag preserved.
+
+### Step 7: Synthesis
+
+One synthesis pass (no subagent — orchestrator synthesizes directly):
+
+1. **Merge all findings** from L1, L2, L3, L4, and red-team into a single list.
+2. **Deduplicate** by `(dimension, section, message)` — identical findings from parallel
+   agents are collapsed into one entry.
+3. **Apply verdict logic**:
+   ```python
+   # L1 fail-fast path: structural defects that block all further analysis
+   stop_triggers = [f for f in critical if f.dimension in {"estimand_clarity", "hypothesis_falsifiability"}]
+   # Red-team STOP path: adversarial critical findings after full analysis (L2–L4)
+   # These fire only when the L1 gate passed; any critical red_team finding is a STOP.
+   stop_triggers += [f for f in critical if f.dimension == "red_team"]
+
+   if stop_triggers:
+       verdict = "STOP"
+   elif critical_findings or len(warning_findings) >= 3:
+       verdict = "REVISE"
+   else:
+       verdict = "GO"
+   ```
+4. **Write `evaluation_dashboard_{slug}_{YYYY-MM-DD_HHMMSS}.md`** — always written.
+   Must include:
+   - Verdict banner and classification summary
+   - Dimension scorecard table (dimension → weight → findings count → severity summary)
+   - Adversarial findings section (red-team findings, each marked `requires_decision: true`)
+   - **Cannot Assess** section with at least 2 items (dimensions where evaluation was
+     impossible due to absent plan content; minimum 2 entries, e.g.,
+     "Randomization mechanism not described — cannot assess unit interference risk",
+     "No resource budgets stated — cannot assess resource_proportionality")
+   - Mechanizable check log (binary checks that could be automated in future)
+   - Machine-readable YAML summary block at end:
+     ```yaml
+     # --- review-design machine summary ---
+     verdict: GO|REVISE|STOP
+     experiment_type: {type}
+     critical_count: {n}
+     warning_count: {n}
+     red_team_count: {n}
+     ```
+   The YAML summary block enables downstream recipe steps to parse verdict and counts
+   without re-reading prose.
+5. **Write `revision_guidance_{slug}_{YYYY-MM-DD_HHMMSS}.md`** — written ONLY when
+   verdict = REVISE. Must include:
+   - Required revisions: critical findings with concrete, actionable fix descriptions
+   - Recommended revisions: warning findings
+   - Red-team findings with mitigation options
+   The `revision_guidance` path is passed back to `plan-experiment` in the recipe loop.
+
+### Step 8: Emit Output Tokens
+
+Emit these lines immediately before `%%ORDER_UP%%`:
 
 ```
-verdict = {GO|REVISE|STOP}
-experiment_type = {string}
-evaluation_dashboard = {absolute_path}
-revision_guidance = {absolute_path}
+verdict = GO|REVISE|STOP
+experiment_type = {experiment_type}
+evaluation_dashboard = /absolute/path/.autoskillit/temp/review-design/evaluation_dashboard_{slug}_{YYYY-MM-DD_HHMMSS}.md
+revision_guidance = /absolute/path/.autoskillit/temp/review-design/revision_guidance_{slug}_{YYYY-MM-DD_HHMMSS}.md
 %%ORDER_UP%%
 ```
+
+`revision_guidance` line is emitted ONLY when verdict = REVISE. When verdict is GO or STOP,
+omit the `revision_guidance` line entirely.
+
+## Finding Format
+
+All subagents must return findings in this JSON structure:
+
+```json
+{
+  "section": "## Hypothesis",
+  "dimension": "estimand_clarity",
+  "level": 1,
+  "severity": "critical | warning | info",
+  "message": "{clear, actionable description of the issue}",
+  "requires_decision": false
+}
+```
+
+Red-team findings: always `"requires_decision": true`, `"dimension": "red_team"`.
+
+## Output
+
+```
+.autoskillit/temp/review-design/
+├── evaluation_dashboard_{slug}_{YYYY-MM-DD_HHMMSS}.md   (always written)
+└── revision_guidance_{slug}_{YYYY-MM-DD_HHMMSS}.md      (REVISE only)
+```
+
+Emit structured output tokens (absolute paths) immediately before `%%ORDER_UP%%`.
+
+## Related Skills
+
+- `/autoskillit:plan-experiment` — produces the plan this skill validates
+- `/autoskillit:scope` — first step in the research recipe chain
+- `/autoskillit:implement-experiment` — consumes this skill's GO output (via recipe routing)
+- exp-lens skills — philosophical modes referenced in dimension prompts (not invoked directly)
diff --git a/tests/skills/test_review_design_contracts.py b/tests/skills/test_review_design_contracts.py
new file mode 100644
index 00000000..75726dac
--- /dev/null
+++ b/tests/skills/test_review_design_contracts.py
@@ -0,0 +1,183 @@
+"""Contract tests for review-design SKILL.md behavioral encoding."""
+
+from pathlib import Path
+
+import pytest
+
+SKILL_MD = Path(__file__).parents[2] / "src/autoskillit/skills_extended/review-design/SKILL.md"
+
+
+@pytest.fixture(scope="module")
+def skill_text() -> str:
+    return SKILL_MD.read_text()
+
+
+# ── Triage classification ──────────────────────────────────────────────────
+
+
+def test_triage_lists_all_five_experiment_types(skill_text):
+    """All five first-match triage classes must be named in the SKILL.md."""
+    for name in [
+        "benchmark",
+        "configuration_study",
+        "causal_inference",
+        "robustness_audit",
+        "exploratory",
+    ]:
+        assert name in skill_text, f"Missing triage type: {name!r}"
+
+
+# ── Dimension weight matrix ────────────────────────────────────────────────
+
+
+def test_dimension_weight_tiers_defined(skill_text):
+    """H/M/L/S weight tiers must be defined in the SKILL.md."""
+    for tier in ["H", "M", "L", "S"]:
+        assert f"weight={tier}" in skill_text or f"| {tier} " in skill_text, (
+            f"Weight tier {tier!r} not found in weight matrix table or explicit weight= notation"
+        )
+
+
+def test_silent_tier_produces_no_output_contract(skill_text):
+    """SILENT (S) dimensions must be explicitly contracted to produce no output."""
+    assert "SILENT" in skill_text, "SILENT tier label not found in SKILL.md"
+    assert "not spawned" in skill_text.lower(), (
+        "Behavioral contract 'not spawned' not found — SILENT dimensions must be "
+        "explicitly documented as non-spawning"
+    )
+
+
+def test_universal_dimensions_always_run(skill_text):
+    """estimand_clarity and hypothesis_falsifiability must be listed as always-run L1."""
+    assert "estimand_clarity" in skill_text
+    assert "hypothesis_falsifiability" in skill_text
+    # Both must be documented as always H-weight (not gated by triage like L4 dimensions)
+    assert "always H-weight" in skill_text, (
+        "L1 dimensions must be explicitly documented as always H-weight regardless of triage"
+    )
+
+
+# ── Fail-fast gate ──────────────────────────────────────────────────────────
+
+
+def test_l1_fail_fast_gate_present(skill_text):
+    """SKILL.md must encode the L1 fail-fast gate: halt on L1 critical."""
+    text_lower = skill_text.lower()
+    assert "fail-fast" in text_lower or "fail fast" in text_lower, (
+        "L1 fail-fast gate not found in SKILL.md"
+    )
+    # Must assert that L2+ do NOT run when L1 is critical — use text_lower only
+    assert "do not proceed" in text_lower or "halt" in text_lower, (
+        "Fail-fast gate must document that L2+ analysis does not run on L1 critical findings"
+    )
+
+
+# ── Red-team agent ──────────────────────────────────────────────────────────
+
+
+def test_red_team_requires_decision_contract(skill_text):
+    """Red-team findings must always carry requires_decision: true (project-wide convention)."""
+    assert "requires_decision" in skill_text
+    # The contract must state true, not just mention the field
+    assert "requires_decision: true" in skill_text or '"requires_decision": true' in skill_text
+
+
+def test_red_team_universal_challenges_present(skill_text):
+    """All five universal red-team challenges must be named."""
+    for challenge in [
+        "Goodhart",
+        "leakage",
+        "tuning",
+        "Survivorship",
+        "collision",
+    ]:
+        assert challenge.lower() in skill_text.lower(), (
+            f"Red-team challenge {challenge!r} not found in SKILL.md"
+        )
+
+
+# ── Backward-compatible parsing ─────────────────────────────────────────────
+
+
+def test_frontmatter_fallback_documented(skill_text):
+    """SKILL.md must document the two-level frontmatter parsing fallback."""
+    assert "frontmatter" in skill_text.lower()
+    assert "LLM" in skill_text or "extraction" in skill_text.lower()
+    assert "source: frontmatter" in skill_text or "provenance" in skill_text.lower()
+
+
+# ── Verdict logic ────────────────────────────────────────────────────────────
+
+
+def test_verdict_logic_all_three_outcomes(skill_text):
+    """Verdict logic must produce GO, REVISE, and STOP outcomes as explicit assignments."""
+    for verdict in ["GO", "REVISE", "STOP"]:
+        assert f'verdict = "{verdict}"' in skill_text, (
+            f"Verdict assignment 'verdict = \"{verdict}\"' not found in verdict logic code block"
+        )
+
+
+def test_verdict_stop_on_l1_critical(skill_text):
+    """STOP must be causally linked to L1 dimensions via the stop_triggers code block."""
+    # The stop_triggers list must explicitly name the L1 dimensions as triggers
+    assert "stop_triggers" in skill_text, "stop_triggers code block not found"
+    # Verify causal linkage: stop_triggers assignment must reference both L1 dimensions
+    stop_trigger_line = 'f.dimension in {"estimand_clarity", "hypothesis_falsifiability"}'
+    assert stop_trigger_line in skill_text, (
+        "stop_triggers must explicitly name estimand_clarity and hypothesis_falsifiability "
+        "as the L1 STOP-triggering dimensions"
+    )
+
+
+def test_verdict_revise_threshold_defined(skill_text):
+    """REVISE threshold (≥3 warnings or any non-L1 critical) must be present."""
+    assert "REVISE" in skill_text
+    # Assert the precise threshold expression, not just the digit 3
+    assert ">= 3" in skill_text or "len(warning_findings) >= 3" in skill_text, (
+        "REVISE threshold must be encoded as '>= 3' or 'len(warning_findings) >= 3'"
+    )
+
+
+# ── Dashboard requirements ───────────────────────────────────────────────────
+
+
+def test_dashboard_cannot_assess_section(skill_text):
+    """evaluation_dashboard must include a 'Cannot Assess' section with ≥2 items."""
+    # Assert the coupled phrase that ties the section to its minimum count requirement
+    assert "Cannot Assess" in skill_text
+    assert "Cannot Assess** section with at least 2" in skill_text, (
+        "Cannot Assess section must be documented with its minimum count of 2 items"
+    )
+
+
+def test_dashboard_yaml_summary_block(skill_text):
+    """evaluation_dashboard must include a machine-readable YAML summary block."""
+    assert "# --- review-design machine summary ---" in skill_text, (
+        "Machine-readable YAML summary block header not found in SKILL.md"
+    )
+
+
+# ── Output token format ──────────────────────────────────────────────────────
+
+
+def test_output_tokens_all_four_present(skill_text):
+    """All four output tokens must be named in the SKILL.md."""
+    for token in ["verdict", "experiment_type", "evaluation_dashboard", "revision_guidance"]:
+        assert token in skill_text, f"Output token {token!r} not found"
+
+
+def test_revision_guidance_only_on_revise(skill_text):
+    """revision_guidance must be documented as written only when verdict=REVISE."""
+    assert "revision_guidance" in skill_text
+    assert "REVISE" in skill_text
+    # The file must couple revision_guidance to REVISE condition
+    lines_with_guidance = [line for line in skill_text.splitlines() if "revision_guidance" in line]
+    combined = "\n".join(lines_with_guidance)
+    assert "REVISE" in combined or "revise" in combined.lower(), (
+        "revision_guidance must be tied to REVISE verdict in its description"
+    )
+
+
+def test_order_up_terminator_present(skill_text):
+    """%%ORDER_UP%% must be the final terminal marker after token emission."""
+    assert "%%ORDER_UP%%" in skill_text
diff --git a/tests/skills/test_skill_output_compliance.py b/tests/skills/test_skill_output_compliance.py
index 545b5dc3..9aa017a8 100644
--- a/tests/skills/test_skill_output_compliance.py
+++ b/tests/skills/test_skill_output_compliance.py
@@ -284,6 +284,7 @@ def test_resolve_failures_skill_switches_code_index_to_worktree():
     "arch-lens-scenarios": ["diagram_path"],
     "arch-lens-state-lifecycle": ["diagram_path"],
     "arch-lens-deployment": ["diagram_path"],
+    "review-design": ["evaluation_dashboard", "revision_guidance"],
 }
 
 ABSOLUTE_PATH_KEYWORDS = ("absolute", "/abs", "$(pwd)", "$(cd")