Implementation Plan: Create review-design Skill#598
Merged
Trecek merged 4 commits intointegrationfrom Apr 4, 2026
Merged
Conversation
Replace minimal stub SKILL.md with complete implementation encoding triage-first, fail-fast multi-level dimensional analysis, adversarial red-team, GO/REVISE/STOP verdict logic, and evaluation dashboard spec. Add test_review_design_contracts.py (17 contract tests) and register review-design in PATH_CAPTURE_SKILLS for output compliance testing. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Trecek
commented
Apr 4, 2026
Collaborator
Author
Trecek
left a comment
There was a problem hiding this comment.
AutoSkillit PR Review — Verdict: changes_requested
Found 4 critical + 15 warning findings. Inline comments attached.
Trecek
commented
Apr 4, 2026
Collaborator
Author
Trecek
left a comment
There was a problem hiding this comment.
AutoSkillit review found 4 critical + 15 warning findings across arch, tests, defense, bugs, cohesion, and slop dimensions. See inline comments for details.
Critical issues:
requires_humanfield in Finding Format and red-team instructions diverges from project-widerequires_decisionconvention (review-pr, review-research-pr userequires_decision) — two critical cohesion violations at SKILL.md L177 and L292.test_dimension_weight_tiers_definedis vacuously satisfied by single-letter fallback — test provides zero signal.test_verdict_stop_on_l1_criticalis entirely subsumed by other tests — does not verify causal linkage.
Routing to resolve-review for automated remediation.
…ide cohesion Aligned requires_human → requires_decision throughout review-design SKILL.md and test to match the convention in review-pr and review-research-pr. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…tracts in review-design - Document STOP emission when experiment_plan_path is absent or file is missing - Document graceful YAML parse error degradation to Level 2 extraction - Document triage dispatcher schema validation with exploratory fallback - Document fail-fast gate behavior for unparseable L1 subagent responses - Clarify red_team STOP pathway in verdict logic comments - Fix description to enumerate all four output tokens Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ntracts.py - Remove vacuous `tier in skill_text` fallback from test_dimension_weight_tiers_defined - Remove weak `S (` fallback; assert behavioral contract via 'not spawned' in test_silent_tier - Add 'always H-weight' assertion to test_universal_dimensions_always_run - Remove trivially-satisfied 'stop' case-sensitive fallback from test_l1_fail_fast_gate_present - Assert specific verdict assignment syntax in test_verdict_logic_all_three_outcomes - Replace subsumed test_verdict_stop_on_l1_critical with causal stop_triggers linkage check - Assert '>= 3' expression in test_verdict_revise_threshold_defined - Assert coupled 'Cannot Assess section with at least 2' phrase in test_dashboard_cannot_assess_section - Assert specific YAML block header in test_dashboard_yaml_summary_block Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Replace the minimal stub at
src/autoskillit/skills_extended/review-design/SKILL.mdwitha complete implementation of the automated experiment design validation skill. The skill
runs a triage-first, fail-fast multi-level analysis hierarchy with parallel subagents and
an adversarial red-team, then synthesizes a GO/REVISE/STOP verdict. Two supporting test
files are added: a static contract test file for the skill and an update to
PATH_CAPTURE_SKILLSin the existing output-compliance test.Architecture Impact
Process Flow Diagram
%%{init: {'flowchart': {'nodeSpacing': 40, 'rankSpacing': 50, 'curve': 'basis'}}}%% flowchart TB %% CLASS DEFINITIONS %% classDef terminal fill:#1a237e,stroke:#7986cb,stroke-width:2px,color:#fff; classDef stateNode fill:#004d40,stroke:#4db6ac,stroke-width:2px,color:#fff; classDef handler fill:#e65100,stroke:#ffb74d,stroke-width:2px,color:#fff; classDef phase fill:#6a1b9a,stroke:#ba68c8,stroke-width:2px,color:#fff; classDef newComponent fill:#2e7d32,stroke:#81c784,stroke-width:2px,color:#fff; classDef output fill:#00695c,stroke:#4db6ac,stroke-width:2px,color:#fff; classDef detector fill:#b71c1c,stroke:#ef5350,stroke-width:2px,color:#fff; START([START]) STOP_END([STOP — emit tokens]) END([END]) subgraph Step0 ["● Step 0: Parse & Setup"] ReadPlan["● Read Plan<br/>━━━━━━━━━━<br/>YAML frontmatter<br/>+ LLM fallback extraction"] TempDir["● mkdir<br/>━━━━━━━━━━<br/>.autoskillit/temp/review-design/"] end subgraph Step1 ["● Step 1: Triage Dispatcher"] Triage["● Triage Subagent<br/>━━━━━━━━━━<br/>→ experiment_type<br/>→ dimension_weights<br/>→ secondary_modifiers"] end subgraph Step2 ["● Level 1 — Fail-Fast (parallel)"] direction LR EstimandAgent["● estimand_clarity<br/>━━━━━━━━━━<br/>H-weight always<br/>formal contrast check"] FalsifAgent["● hypothesis_falsifiability<br/>━━━━━━━━━━<br/>H-weight always<br/>falsification check"] end L1Gate{"● L1 FAIL-FAST<br/>━━━━━━━━━━<br/>any L1<br/>critical?"} subgraph Step3 ["● Level 2 + Red-Team (concurrent)"] direction TB L2Agents["● Level 2 agents (parallel)<br/>━━━━━━━━━━<br/>baseline_fairness<br/>causal_structure<br/>unit_interference"] RedTeam["● Red-Team Agent<br/>━━━━━━━━━━<br/>5 universal challenges<br/>+ type-specific focus<br/>requires_human: true"] end subgraph Step4 ["● Level 3 (parallel, after L2)"] direction LR L3Agents["● Level 3 agents<br/>━━━━━━━━━━<br/>error_budget<br/>statistical_corrections<br/>variance_protocol"] end subgraph Step5 ["● Level 4 (triage-gated)"] L4Gate{"weight ≥ L?"} L4Agents["● Level 4 agents<br/>━━━━━━━━━━<br/>benchmark_representativeness<br/>ecological_validity<br/>measurement_alignment<br/>reproducibility_spec"] end subgraph Step6 ["● Step 6: Wait Red-Team"] WaitRT["● Await red-team<br/>━━━━━━━━━━<br/>merge findings<br/>requires_human preserved"] end subgraph Step7 ["● Step 7: Synthesis"] Merge["● Merge & Deduplicate<br/>━━━━━━━━━━<br/>L1+L2+L3+L4+red-team<br/>collapse by dim/section/msg"] Verdict{"● Verdict Logic<br/>━━━━━━━━━━<br/>STOP / REVISE / GO"} Dashboard["● evaluation_dashboard<br/>━━━━━━━━━━<br/>scorecard + YAML summary<br/>Cannot Assess ≥ 2 items"] Guidance["● revision_guidance<br/>━━━━━━━━━━<br/>REVISE only<br/>required + recommended fixes"] end subgraph Step8 ["● Step 8: Emit Tokens"] EmitTokens["● verdict<br/>experiment_type<br/>evaluation_dashboard<br/>revision_guidance (REVISE)<br/>%%ORDER_UP%%"] end subgraph Tests ["★ Contract Validators (static)"] ContractTests["★ test_review_design_contracts.py<br/>━━━━━━━━━━<br/>23 static SKILL.md checks<br/>triage · fail-fast · red-team<br/>verdict · dashboard · tokens"] ComplianceTests["● test_skill_output_compliance.py<br/>━━━━━━━━━━<br/>PATH_CAPTURE_SKILLS +<br/>review-design entry"] end START --> ReadPlan ReadPlan --> TempDir TempDir --> Triage Triage --> EstimandAgent & FalsifAgent EstimandAgent & FalsifAgent --> L1Gate L1Gate -->|"YES — critical found"| STOP_END L1Gate -->|"NO — clean"| L2Agents & RedTeam L2Agents --> L3Agents L2Agents -->|"triage weights"| L4Gate L4Gate -->|"weight ≥ L"| L4Agents L4Gate -->|"SILENT"| WaitRT L3Agents --> WaitRT L4Agents --> WaitRT WaitRT --> Merge Merge --> Verdict Verdict -->|"STOP"| STOP_END Verdict -->|"REVISE"| Dashboard Verdict -->|"REVISE"| Guidance Verdict -->|"GO"| Dashboard Dashboard --> EmitTokens Guidance --> EmitTokens STOP_END --> END EmitTokens --> END Dashboard -.->|"validates"| ContractTests EmitTokens -.->|"validates"| ComplianceTests %% CLASS ASSIGNMENTS %% class START,END,STOP_END terminal; class L1Gate,Verdict,L4Gate detector; class Triage,EstimandAgent,FalsifAgent,L2Agents,L3Agents,L4Agents,RedTeam,WaitRT newComponent; class ReadPlan,TempDir phase; class Merge handler; class Dashboard,Guidance,EmitTokens output; class ContractTests newComponent; class ComplianceTests handler;Color Legend:
State Lifecycle Diagram
%%{init: {'flowchart': {'nodeSpacing': 50, 'rankSpacing': 60, 'curve': 'basis'}}}%% flowchart TB %% CLASS DEFINITIONS %% classDef cli fill:#1a237e,stroke:#7986cb,stroke-width:2px,color:#fff; classDef stateNode fill:#004d40,stroke:#4db6ac,stroke-width:2px,color:#fff; classDef handler fill:#e65100,stroke:#ffb74d,stroke-width:2px,color:#fff; classDef phase fill:#6a1b9a,stroke:#ba68c8,stroke-width:2px,color:#fff; classDef newComponent fill:#2e7d32,stroke:#81c784,stroke-width:2px,color:#fff; classDef output fill:#00695c,stroke:#4db6ac,stroke-width:2px,color:#fff; classDef detector fill:#b71c1c,stroke:#ef5350,stroke-width:2px,color:#fff; classDef gap fill:#ff6f00,stroke:#ffa726,stroke-width:2px,color:#000; classDef terminal fill:#1a237e,stroke:#7986cb,stroke-width:2px,color:#fff; subgraph Lifecycles ["● FIELD LIFECYCLE CATEGORIES (SKILL.md contracts)"] direction LR INIT_ONLY["● INIT_ONLY<br/>━━━━━━━━━━<br/>experiment_plan_path<br/>plan_text<br/>NEVER modify after parse"] INIT_PRESERVE["● INIT_PRESERVE<br/>━━━━━━━━━━<br/>experiment_type<br/>dimension_weights<br/>secondary_modifiers<br/>frozen after triage"] APPEND["● APPEND_ONLY<br/>━━━━━━━━━━<br/>findings_pool<br/>grows per agent<br/>dedup at synthesis"] DERIVED["● DERIVED<br/>━━━━━━━━━━<br/>critical_findings<br/>warning_findings<br/>stop_triggers<br/>computed, not stored"] end subgraph FindingContract ["● Finding Format Contract"] direction TB FindingSpec["● JSON finding structure<br/>━━━━━━━━━━<br/>section · dimension · level<br/>severity: critical|warning|info<br/>requires_human: bool<br/>message: actionable str"] RTContract["● Red-team invariant<br/>━━━━━━━━━━<br/>requires_human: true<br/>dimension: red_team<br/>ALWAYS — preserved through dedup"] end subgraph Gates ["● VALIDATION GATES"] direction TB L1Gate["● L1 Fail-Fast Gate<br/>━━━━━━━━━━<br/>critical in {estimand_clarity,<br/>hypothesis_falsifiability}?<br/>→ STOP, skip L2-L4"] SilenceGate["● Three-Layer Silence Gate<br/>━━━━━━━━━━<br/>1. SILENT from matrix → don't spawn<br/>2. Foothold validation → M/L → S<br/>3. L-weight zero findings → suppress"] DedupeGate["● Deduplication Gate<br/>━━━━━━━━━━<br/>key=(dimension, section, msg)<br/>identical findings collapsed"] end subgraph VerdictGates ["● VERDICT OUTPUT GATES"] direction TB VerdictLogic["● Verdict Logic (precedence)<br/>━━━━━━━━━━<br/>stop_triggers? → STOP<br/>critical or warnings≥3? → REVISE<br/>else → GO"] OutputGuard["● Output Emission Guard<br/>━━━━━━━━━━<br/>evaluation_dashboard → ALWAYS<br/>revision_guidance → REVISE only<br/>exit 0 for all verdicts"] end subgraph Validators ["★ Contract Validators"] ContractTests["★ test_review_design_contracts.py<br/>━━━━━━━━━━<br/>static SKILL.md checks:<br/>· requires_human: true present<br/>· STOP/REVISE/GO all named<br/>· Cannot Assess ≥2 items<br/>· YAML summary block present<br/>· %%ORDER_UP%% terminal marker"] ComplianceTest["● test_skill_output_compliance.py<br/>━━━━━━━━━━<br/>PATH_CAPTURE_SKILLS +<br/>review-design entry<br/>tokens: evaluation_dashboard<br/> revision_guidance"] end INIT_ONLY -->|"read-only input to"| FindingSpec INIT_PRESERVE -->|"weights control"| SilenceGate FindingSpec --> L1Gate RTContract -->|"merges into"| APPEND L1Gate -->|"STOP path"| VerdictLogic L1Gate -->|"clean: spawn L2-L4"| SilenceGate SilenceGate -->|"filtered agents append to"| APPEND APPEND --> DedupeGate DedupeGate -->|"deduplicated pool"| DERIVED DERIVED --> VerdictLogic VerdictLogic --> OutputGuard OutputGuard -.->|"validates"| ContractTests OutputGuard -.->|"validates tokens"| ComplianceTest %% CLASS ASSIGNMENTS %% class INIT_ONLY detector; class INIT_PRESERVE gap; class APPEND handler; class DERIVED phase; class FindingSpec,RTContract stateNode; class L1Gate,SilenceGate,DedupeGate detector; class VerdictLogic,OutputGuard output; class ContractTests newComponent; class ComplianceTest handler;Color Legend:
Closes #591
Implementation Plan
Plan file:
/home/talon/projects/autoskillit-runs/impl-20260403-190640-751319/.autoskillit/temp/make-plan/review_design_skill_plan_2026-04-03_120000.md🤖 Generated with Claude Code via AutoSkillit
Token Usage Summary