Implementation Plan: Create review-design Skill by Trecek · Pull Request #598 · TalonT-Org/AutoSkillit

Trecek · 2026-04-04T02:51:01Z

Summary

Replace the minimal stub at src/autoskillit/skills_extended/review-design/SKILL.md with
a complete implementation of the automated experiment design validation skill. The skill
runs a triage-first, fail-fast multi-level analysis hierarchy with parallel subagents and
an adversarial red-team, then synthesizes a GO/REVISE/STOP verdict. Two supporting test
files are added: a static contract test file for the skill and an update to
PATH_CAPTURE_SKILLS in the existing output-compliance test.

Architecture Impact

Process Flow Diagram

%%{init: {'flowchart': {'nodeSpacing': 40, 'rankSpacing': 50, 'curve': 'basis'}}}%%
flowchart TB
    %% CLASS DEFINITIONS %%
    classDef terminal fill:#1a237e,stroke:#7986cb,stroke-width:2px,color:#fff;
    classDef stateNode fill:#004d40,stroke:#4db6ac,stroke-width:2px,color:#fff;
    classDef handler fill:#e65100,stroke:#ffb74d,stroke-width:2px,color:#fff;
    classDef phase fill:#6a1b9a,stroke:#ba68c8,stroke-width:2px,color:#fff;
    classDef newComponent fill:#2e7d32,stroke:#81c784,stroke-width:2px,color:#fff;
    classDef output fill:#00695c,stroke:#4db6ac,stroke-width:2px,color:#fff;
    classDef detector fill:#b71c1c,stroke:#ef5350,stroke-width:2px,color:#fff;

    START([START])
    STOP_END([STOP — emit tokens])
    END([END])

    subgraph Step0 ["● Step 0: Parse & Setup"]
        ReadPlan["● Read Plan<br/>━━━━━━━━━━<br/>YAML frontmatter<br/>+ LLM fallback extraction"]
        TempDir["● mkdir<br/>━━━━━━━━━━<br/>.autoskillit/temp/review-design/"]
    end

    subgraph Step1 ["● Step 1: Triage Dispatcher"]
        Triage["● Triage Subagent<br/>━━━━━━━━━━<br/>→ experiment_type<br/>→ dimension_weights<br/>→ secondary_modifiers"]
    end

    subgraph Step2 ["● Level 1 — Fail-Fast (parallel)"]
        direction LR
        EstimandAgent["● estimand_clarity<br/>━━━━━━━━━━<br/>H-weight always<br/>formal contrast check"]
        FalsifAgent["● hypothesis_falsifiability<br/>━━━━━━━━━━<br/>H-weight always<br/>falsification check"]
    end

    L1Gate{"● L1 FAIL-FAST<br/>━━━━━━━━━━<br/>any L1<br/>critical?"}

    subgraph Step3 ["● Level 2 + Red-Team (concurrent)"]
        direction TB
        L2Agents["● Level 2 agents (parallel)<br/>━━━━━━━━━━<br/>baseline_fairness<br/>causal_structure<br/>unit_interference"]
        RedTeam["● Red-Team Agent<br/>━━━━━━━━━━<br/>5 universal challenges<br/>+ type-specific focus<br/>requires_human: true"]
    end

    subgraph Step4 ["● Level 3 (parallel, after L2)"]
        direction LR
        L3Agents["● Level 3 agents<br/>━━━━━━━━━━<br/>error_budget<br/>statistical_corrections<br/>variance_protocol"]
    end

    subgraph Step5 ["● Level 4 (triage-gated)"]
        L4Gate{"weight ≥ L?"}
        L4Agents["● Level 4 agents<br/>━━━━━━━━━━<br/>benchmark_representativeness<br/>ecological_validity<br/>measurement_alignment<br/>reproducibility_spec"]
    end

    subgraph Step6 ["● Step 6: Wait Red-Team"]
        WaitRT["● Await red-team<br/>━━━━━━━━━━<br/>merge findings<br/>requires_human preserved"]
    end

    subgraph Step7 ["● Step 7: Synthesis"]
        Merge["● Merge & Deduplicate<br/>━━━━━━━━━━<br/>L1+L2+L3+L4+red-team<br/>collapse by dim/section/msg"]
        Verdict{"● Verdict Logic<br/>━━━━━━━━━━<br/>STOP / REVISE / GO"}
        Dashboard["● evaluation_dashboard<br/>━━━━━━━━━━<br/>scorecard + YAML summary<br/>Cannot Assess ≥ 2 items"]
        Guidance["● revision_guidance<br/>━━━━━━━━━━<br/>REVISE only<br/>required + recommended fixes"]
    end

    subgraph Step8 ["● Step 8: Emit Tokens"]
        EmitTokens["● verdict<br/>experiment_type<br/>evaluation_dashboard<br/>revision_guidance (REVISE)<br/>%%ORDER_UP%%"]
    end

    subgraph Tests ["★ Contract Validators (static)"]
        ContractTests["★ test_review_design_contracts.py<br/>━━━━━━━━━━<br/>23 static SKILL.md checks<br/>triage · fail-fast · red-team<br/>verdict · dashboard · tokens"]
        ComplianceTests["● test_skill_output_compliance.py<br/>━━━━━━━━━━<br/>PATH_CAPTURE_SKILLS +<br/>review-design entry"]
    end

    START --> ReadPlan
    ReadPlan --> TempDir
    TempDir --> Triage
    Triage --> EstimandAgent & FalsifAgent
    EstimandAgent & FalsifAgent --> L1Gate
    L1Gate -->|"YES — critical found"| STOP_END
    L1Gate -->|"NO — clean"| L2Agents & RedTeam
    L2Agents --> L3Agents
    L2Agents -->|"triage weights"| L4Gate
    L4Gate -->|"weight ≥ L"| L4Agents
    L4Gate -->|"SILENT"| WaitRT
    L3Agents --> WaitRT
    L4Agents --> WaitRT
    WaitRT --> Merge
    Merge --> Verdict
    Verdict -->|"STOP"| STOP_END
    Verdict -->|"REVISE"| Dashboard
    Verdict -->|"REVISE"| Guidance
    Verdict -->|"GO"| Dashboard
    Dashboard --> EmitTokens
    Guidance --> EmitTokens
    STOP_END --> END
    EmitTokens --> END
    Dashboard -.->|"validates"| ContractTests
    EmitTokens -.->|"validates"| ComplianceTests

    %% CLASS ASSIGNMENTS %%
    class START,END,STOP_END terminal;
    class L1Gate,Verdict,L4Gate detector;
    class Triage,EstimandAgent,FalsifAgent,L2Agents,L3Agents,L4Agents,RedTeam,WaitRT newComponent;
    class ReadPlan,TempDir phase;
    class Merge handler;
    class Dashboard,Guidance,EmitTokens output;
    class ContractTests newComponent;
    class ComplianceTests handler;

Color Legend:

Color	Category	Description
Dark Blue	Terminal	START, END, and STOP-verdict states
Red	Detector	L1 fail-fast gate, verdict logic, L4 weight gate
Green	New/Modified	All ● modified skill agents; ★ new contract tests
Purple	Phase	Parse/setup control nodes
Orange	Handler	Merge, deduplication, compliance test update
Dark Teal	Output	Dashboard, guidance, token emission

State Lifecycle Diagram

%%{init: {'flowchart': {'nodeSpacing': 50, 'rankSpacing': 60, 'curve': 'basis'}}}%%
flowchart TB
    %% CLASS DEFINITIONS %%
    classDef cli fill:#1a237e,stroke:#7986cb,stroke-width:2px,color:#fff;
    classDef stateNode fill:#004d40,stroke:#4db6ac,stroke-width:2px,color:#fff;
    classDef handler fill:#e65100,stroke:#ffb74d,stroke-width:2px,color:#fff;
    classDef phase fill:#6a1b9a,stroke:#ba68c8,stroke-width:2px,color:#fff;
    classDef newComponent fill:#2e7d32,stroke:#81c784,stroke-width:2px,color:#fff;
    classDef output fill:#00695c,stroke:#4db6ac,stroke-width:2px,color:#fff;
    classDef detector fill:#b71c1c,stroke:#ef5350,stroke-width:2px,color:#fff;
    classDef gap fill:#ff6f00,stroke:#ffa726,stroke-width:2px,color:#000;
    classDef terminal fill:#1a237e,stroke:#7986cb,stroke-width:2px,color:#fff;

    subgraph Lifecycles ["● FIELD LIFECYCLE CATEGORIES (SKILL.md contracts)"]
        direction LR
        INIT_ONLY["● INIT_ONLY<br/>━━━━━━━━━━<br/>experiment_plan_path<br/>plan_text<br/>NEVER modify after parse"]
        INIT_PRESERVE["● INIT_PRESERVE<br/>━━━━━━━━━━<br/>experiment_type<br/>dimension_weights<br/>secondary_modifiers<br/>frozen after triage"]
        APPEND["● APPEND_ONLY<br/>━━━━━━━━━━<br/>findings_pool<br/>grows per agent<br/>dedup at synthesis"]
        DERIVED["● DERIVED<br/>━━━━━━━━━━<br/>critical_findings<br/>warning_findings<br/>stop_triggers<br/>computed, not stored"]
    end

    subgraph FindingContract ["● Finding Format Contract"]
        direction TB
        FindingSpec["● JSON finding structure<br/>━━━━━━━━━━<br/>section · dimension · level<br/>severity: critical|warning|info<br/>requires_human: bool<br/>message: actionable str"]
        RTContract["● Red-team invariant<br/>━━━━━━━━━━<br/>requires_human: true<br/>dimension: red_team<br/>ALWAYS — preserved through dedup"]
    end

    subgraph Gates ["● VALIDATION GATES"]
        direction TB
        L1Gate["● L1 Fail-Fast Gate<br/>━━━━━━━━━━<br/>critical in {estimand_clarity,<br/>hypothesis_falsifiability}?<br/>→ STOP, skip L2-L4"]
        SilenceGate["● Three-Layer Silence Gate<br/>━━━━━━━━━━<br/>1. SILENT from matrix → don't spawn<br/>2. Foothold validation → M/L → S<br/>3. L-weight zero findings → suppress"]
        DedupeGate["● Deduplication Gate<br/>━━━━━━━━━━<br/>key=(dimension, section, msg)<br/>identical findings collapsed"]
    end

    subgraph VerdictGates ["● VERDICT OUTPUT GATES"]
        direction TB
        VerdictLogic["● Verdict Logic (precedence)<br/>━━━━━━━━━━<br/>stop_triggers? → STOP<br/>critical or warnings≥3? → REVISE<br/>else → GO"]
        OutputGuard["● Output Emission Guard<br/>━━━━━━━━━━<br/>evaluation_dashboard → ALWAYS<br/>revision_guidance → REVISE only<br/>exit 0 for all verdicts"]
    end

    subgraph Validators ["★ Contract Validators"]
        ContractTests["★ test_review_design_contracts.py<br/>━━━━━━━━━━<br/>static SKILL.md checks:<br/>· requires_human: true present<br/>· STOP/REVISE/GO all named<br/>· Cannot Assess ≥2 items<br/>· YAML summary block present<br/>· %%ORDER_UP%% terminal marker"]
        ComplianceTest["● test_skill_output_compliance.py<br/>━━━━━━━━━━<br/>PATH_CAPTURE_SKILLS +<br/>review-design entry<br/>tokens: evaluation_dashboard<br/>         revision_guidance"]
    end

    INIT_ONLY -->|"read-only input to"| FindingSpec
    INIT_PRESERVE -->|"weights control"| SilenceGate
    FindingSpec --> L1Gate
    RTContract -->|"merges into"| APPEND
    L1Gate -->|"STOP path"| VerdictLogic
    L1Gate -->|"clean: spawn L2-L4"| SilenceGate
    SilenceGate -->|"filtered agents append to"| APPEND
    APPEND --> DedupeGate
    DedupeGate -->|"deduplicated pool"| DERIVED
    DERIVED --> VerdictLogic
    VerdictLogic --> OutputGuard
    OutputGuard -.->|"validates"| ContractTests
    OutputGuard -.->|"validates tokens"| ComplianceTest

    %% CLASS ASSIGNMENTS %%
    class INIT_ONLY detector;
    class INIT_PRESERVE gap;
    class APPEND handler;
    class DERIVED phase;
    class FindingSpec,RTContract stateNode;
    class L1Gate,SilenceGate,DedupeGate detector;
    class VerdictLogic,OutputGuard output;
    class ContractTests newComponent;
    class ComplianceTest handler;

Color Legend:

Color	Category	Description
Red	INIT_ONLY / Gates	Never-modify fields; validation gate nodes
Yellow	INIT_PRESERVE	Triage-frozen fields (experiment_type, weights)
Orange	APPEND_ONLY	findings_pool — grows through agent phases
Purple	DERIVED	Computed views over findings_pool at synthesis
Teal	Finding contracts	JSON structure + red-team invariant
Dark Teal	Output gates	Verdict logic + conditional emission guards
Green	★ New validators	Contract test file newly added by this PR

Closes #591

Implementation Plan

Plan file: /home/talon/projects/autoskillit-runs/impl-20260403-190640-751319/.autoskillit/temp/make-plan/review_design_skill_plan_2026-04-03_120000.md

🤖 Generated with Claude Code via AutoSkillit

Token Usage Summary

Step	input	output	cached	count	time
plan	58	20.2k	2.5M	1	19m 44s
verify	22	21.0k	1.0M	1	7m 14s
implement	29	16.7k	1.2M	1	6m 4s
audit_impl	15	12.1k	344.4k	1	4m 12s
open_pr	28	16.9k	961.9k	1	5m 8s
Total	152	86.9k	6.0M		42m 23s

Replace minimal stub SKILL.md with complete implementation encoding triage-first, fail-fast multi-level dimensional analysis, adversarial red-team, GO/REVISE/STOP verdict logic, and evaluation dashboard spec. Add test_review_design_contracts.py (17 contract tests) and register review-design in PATH_CAPTURE_SKILLS for output compliance testing. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Trecek

AutoSkillit PR Review — Verdict: changes_requested

Found 4 critical + 15 warning findings. Inline comments attached.

src/autoskillit/skills_extended/review-design/SKILL.md

tests/skills/test_review_design_contracts.py

Trecek

AutoSkillit review found 4 critical + 15 warning findings across arch, tests, defense, bugs, cohesion, and slop dimensions. See inline comments for details.

Critical issues:

requires_human field in Finding Format and red-team instructions diverges from project-wide requires_decision convention (review-pr, review-research-pr use requires_decision) — two critical cohesion violations at SKILL.md L177 and L292.
test_dimension_weight_tiers_defined is vacuously satisfied by single-letter fallback — test provides zero signal.
test_verdict_stop_on_l1_critical is entirely subsumed by other tests — does not verify causal linkage.

Routing to resolve-review for automated remediation.

…ide cohesion Aligned requires_human → requires_decision throughout review-design SKILL.md and test to match the convention in review-pr and review-research-pr. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…tracts in review-design - Document STOP emission when experiment_plan_path is absent or file is missing - Document graceful YAML parse error degradation to Level 2 extraction - Document triage dispatcher schema validation with exploratory fallback - Document fail-fast gate behavior for unparseable L1 subagent responses - Clarify red_team STOP pathway in verdict logic comments - Fix description to enumerate all four output tokens Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ntracts.py - Remove vacuous `tier in skill_text` fallback from test_dimension_weight_tiers_defined - Remove weak `S (` fallback; assert behavioral contract via 'not spawned' in test_silent_tier - Add 'always H-weight' assertion to test_universal_dimensions_always_run - Remove trivially-satisfied 'stop' case-sensitive fallback from test_l1_fail_fast_gate_present - Assert specific verdict assignment syntax in test_verdict_logic_all_three_outcomes - Replace subsumed test_verdict_stop_on_l1_critical with causal stop_triggers linkage check - Assert '>= 3' expression in test_verdict_revise_threshold_defined - Assert coupled 'Cannot Assess section with at least 2' phrase in test_dashboard_cannot_assess_section - Assert specific YAML block header in test_dashboard_yaml_summary_block Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Trecek commented Apr 4, 2026

View reviewed changes

Trecek and others added 3 commits April 3, 2026 20:14

Trecek added this pull request to the merge queue Apr 4, 2026

Merged via the queue into integration with commit 0119549 Apr 4, 2026
2 checks passed

Trecek deleted the create-review-design-skill-automated-experiment-design-valid/591 branch April 4, 2026 03:26

This was referenced Apr 4, 2026

review-design L1 subagents lack experiment-type severity calibration, causing false STOP on benchmarks #599

Open

Token summary table shows misleading column categories — split into 4 API token fields and rename "input" to "uncached" #604

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implementation Plan: Create review-design Skill#598

Implementation Plan: Create review-design Skill#598
Trecek merged 4 commits intointegrationfrom
create-review-design-skill-automated-experiment-design-valid/591

Trecek commented Apr 4, 2026 •

edited

Loading

Uh oh!

Trecek left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Trecek left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Trecek commented Apr 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Architecture Impact

Process Flow Diagram

State Lifecycle Diagram

Implementation Plan

Token Usage Summary

Uh oh!

Trecek left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Trecek left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Trecek commented Apr 4, 2026 •

edited

Loading