Skip to content

Implementation Plan: Create review-design Skill#598

Merged
Trecek merged 4 commits intointegrationfrom
create-review-design-skill-automated-experiment-design-valid/591
Apr 4, 2026
Merged

Implementation Plan: Create review-design Skill#598
Trecek merged 4 commits intointegrationfrom
create-review-design-skill-automated-experiment-design-valid/591

Conversation

@Trecek
Copy link
Copy Markdown
Collaborator

@Trecek Trecek commented Apr 4, 2026

Summary

Replace the minimal stub at src/autoskillit/skills_extended/review-design/SKILL.md with
a complete implementation of the automated experiment design validation skill. The skill
runs a triage-first, fail-fast multi-level analysis hierarchy with parallel subagents and
an adversarial red-team, then synthesizes a GO/REVISE/STOP verdict. Two supporting test
files are added: a static contract test file for the skill and an update to
PATH_CAPTURE_SKILLS in the existing output-compliance test.

Architecture Impact

Process Flow Diagram

%%{init: {'flowchart': {'nodeSpacing': 40, 'rankSpacing': 50, 'curve': 'basis'}}}%%
flowchart TB
    %% CLASS DEFINITIONS %%
    classDef terminal fill:#1a237e,stroke:#7986cb,stroke-width:2px,color:#fff;
    classDef stateNode fill:#004d40,stroke:#4db6ac,stroke-width:2px,color:#fff;
    classDef handler fill:#e65100,stroke:#ffb74d,stroke-width:2px,color:#fff;
    classDef phase fill:#6a1b9a,stroke:#ba68c8,stroke-width:2px,color:#fff;
    classDef newComponent fill:#2e7d32,stroke:#81c784,stroke-width:2px,color:#fff;
    classDef output fill:#00695c,stroke:#4db6ac,stroke-width:2px,color:#fff;
    classDef detector fill:#b71c1c,stroke:#ef5350,stroke-width:2px,color:#fff;

    START([START])
    STOP_END([STOP — emit tokens])
    END([END])

    subgraph Step0 ["● Step 0: Parse & Setup"]
        ReadPlan["● Read Plan<br/>━━━━━━━━━━<br/>YAML frontmatter<br/>+ LLM fallback extraction"]
        TempDir["● mkdir<br/>━━━━━━━━━━<br/>.autoskillit/temp/review-design/"]
    end

    subgraph Step1 ["● Step 1: Triage Dispatcher"]
        Triage["● Triage Subagent<br/>━━━━━━━━━━<br/>→ experiment_type<br/>→ dimension_weights<br/>→ secondary_modifiers"]
    end

    subgraph Step2 ["● Level 1 — Fail-Fast (parallel)"]
        direction LR
        EstimandAgent["● estimand_clarity<br/>━━━━━━━━━━<br/>H-weight always<br/>formal contrast check"]
        FalsifAgent["● hypothesis_falsifiability<br/>━━━━━━━━━━<br/>H-weight always<br/>falsification check"]
    end

    L1Gate{"● L1 FAIL-FAST<br/>━━━━━━━━━━<br/>any L1<br/>critical?"}

    subgraph Step3 ["● Level 2 + Red-Team (concurrent)"]
        direction TB
        L2Agents["● Level 2 agents (parallel)<br/>━━━━━━━━━━<br/>baseline_fairness<br/>causal_structure<br/>unit_interference"]
        RedTeam["● Red-Team Agent<br/>━━━━━━━━━━<br/>5 universal challenges<br/>+ type-specific focus<br/>requires_human: true"]
    end

    subgraph Step4 ["● Level 3 (parallel, after L2)"]
        direction LR
        L3Agents["● Level 3 agents<br/>━━━━━━━━━━<br/>error_budget<br/>statistical_corrections<br/>variance_protocol"]
    end

    subgraph Step5 ["● Level 4 (triage-gated)"]
        L4Gate{"weight ≥ L?"}
        L4Agents["● Level 4 agents<br/>━━━━━━━━━━<br/>benchmark_representativeness<br/>ecological_validity<br/>measurement_alignment<br/>reproducibility_spec"]
    end

    subgraph Step6 ["● Step 6: Wait Red-Team"]
        WaitRT["● Await red-team<br/>━━━━━━━━━━<br/>merge findings<br/>requires_human preserved"]
    end

    subgraph Step7 ["● Step 7: Synthesis"]
        Merge["● Merge & Deduplicate<br/>━━━━━━━━━━<br/>L1+L2+L3+L4+red-team<br/>collapse by dim/section/msg"]
        Verdict{"● Verdict Logic<br/>━━━━━━━━━━<br/>STOP / REVISE / GO"}
        Dashboard["● evaluation_dashboard<br/>━━━━━━━━━━<br/>scorecard + YAML summary<br/>Cannot Assess ≥ 2 items"]
        Guidance["● revision_guidance<br/>━━━━━━━━━━<br/>REVISE only<br/>required + recommended fixes"]
    end

    subgraph Step8 ["● Step 8: Emit Tokens"]
        EmitTokens["● verdict<br/>experiment_type<br/>evaluation_dashboard<br/>revision_guidance (REVISE)<br/>%%ORDER_UP%%"]
    end

    subgraph Tests ["★ Contract Validators (static)"]
        ContractTests["★ test_review_design_contracts.py<br/>━━━━━━━━━━<br/>23 static SKILL.md checks<br/>triage · fail-fast · red-team<br/>verdict · dashboard · tokens"]
        ComplianceTests["● test_skill_output_compliance.py<br/>━━━━━━━━━━<br/>PATH_CAPTURE_SKILLS +<br/>review-design entry"]
    end

    START --> ReadPlan
    ReadPlan --> TempDir
    TempDir --> Triage
    Triage --> EstimandAgent & FalsifAgent
    EstimandAgent & FalsifAgent --> L1Gate
    L1Gate -->|"YES — critical found"| STOP_END
    L1Gate -->|"NO — clean"| L2Agents & RedTeam
    L2Agents --> L3Agents
    L2Agents -->|"triage weights"| L4Gate
    L4Gate -->|"weight ≥ L"| L4Agents
    L4Gate -->|"SILENT"| WaitRT
    L3Agents --> WaitRT
    L4Agents --> WaitRT
    WaitRT --> Merge
    Merge --> Verdict
    Verdict -->|"STOP"| STOP_END
    Verdict -->|"REVISE"| Dashboard
    Verdict -->|"REVISE"| Guidance
    Verdict -->|"GO"| Dashboard
    Dashboard --> EmitTokens
    Guidance --> EmitTokens
    STOP_END --> END
    EmitTokens --> END
    Dashboard -.->|"validates"| ContractTests
    EmitTokens -.->|"validates"| ComplianceTests

    %% CLASS ASSIGNMENTS %%
    class START,END,STOP_END terminal;
    class L1Gate,Verdict,L4Gate detector;
    class Triage,EstimandAgent,FalsifAgent,L2Agents,L3Agents,L4Agents,RedTeam,WaitRT newComponent;
    class ReadPlan,TempDir phase;
    class Merge handler;
    class Dashboard,Guidance,EmitTokens output;
    class ContractTests newComponent;
    class ComplianceTests handler;
Loading

Color Legend:

Color Category Description
Dark Blue Terminal START, END, and STOP-verdict states
Red Detector L1 fail-fast gate, verdict logic, L4 weight gate
Green New/Modified All ● modified skill agents; ★ new contract tests
Purple Phase Parse/setup control nodes
Orange Handler Merge, deduplication, compliance test update
Dark Teal Output Dashboard, guidance, token emission

State Lifecycle Diagram

%%{init: {'flowchart': {'nodeSpacing': 50, 'rankSpacing': 60, 'curve': 'basis'}}}%%
flowchart TB
    %% CLASS DEFINITIONS %%
    classDef cli fill:#1a237e,stroke:#7986cb,stroke-width:2px,color:#fff;
    classDef stateNode fill:#004d40,stroke:#4db6ac,stroke-width:2px,color:#fff;
    classDef handler fill:#e65100,stroke:#ffb74d,stroke-width:2px,color:#fff;
    classDef phase fill:#6a1b9a,stroke:#ba68c8,stroke-width:2px,color:#fff;
    classDef newComponent fill:#2e7d32,stroke:#81c784,stroke-width:2px,color:#fff;
    classDef output fill:#00695c,stroke:#4db6ac,stroke-width:2px,color:#fff;
    classDef detector fill:#b71c1c,stroke:#ef5350,stroke-width:2px,color:#fff;
    classDef gap fill:#ff6f00,stroke:#ffa726,stroke-width:2px,color:#000;
    classDef terminal fill:#1a237e,stroke:#7986cb,stroke-width:2px,color:#fff;

    subgraph Lifecycles ["● FIELD LIFECYCLE CATEGORIES (SKILL.md contracts)"]
        direction LR
        INIT_ONLY["● INIT_ONLY<br/>━━━━━━━━━━<br/>experiment_plan_path<br/>plan_text<br/>NEVER modify after parse"]
        INIT_PRESERVE["● INIT_PRESERVE<br/>━━━━━━━━━━<br/>experiment_type<br/>dimension_weights<br/>secondary_modifiers<br/>frozen after triage"]
        APPEND["● APPEND_ONLY<br/>━━━━━━━━━━<br/>findings_pool<br/>grows per agent<br/>dedup at synthesis"]
        DERIVED["● DERIVED<br/>━━━━━━━━━━<br/>critical_findings<br/>warning_findings<br/>stop_triggers<br/>computed, not stored"]
    end

    subgraph FindingContract ["● Finding Format Contract"]
        direction TB
        FindingSpec["● JSON finding structure<br/>━━━━━━━━━━<br/>section · dimension · level<br/>severity: critical|warning|info<br/>requires_human: bool<br/>message: actionable str"]
        RTContract["● Red-team invariant<br/>━━━━━━━━━━<br/>requires_human: true<br/>dimension: red_team<br/>ALWAYS — preserved through dedup"]
    end

    subgraph Gates ["● VALIDATION GATES"]
        direction TB
        L1Gate["● L1 Fail-Fast Gate<br/>━━━━━━━━━━<br/>critical in {estimand_clarity,<br/>hypothesis_falsifiability}?<br/>→ STOP, skip L2-L4"]
        SilenceGate["● Three-Layer Silence Gate<br/>━━━━━━━━━━<br/>1. SILENT from matrix → don't spawn<br/>2. Foothold validation → M/L → S<br/>3. L-weight zero findings → suppress"]
        DedupeGate["● Deduplication Gate<br/>━━━━━━━━━━<br/>key=(dimension, section, msg)<br/>identical findings collapsed"]
    end

    subgraph VerdictGates ["● VERDICT OUTPUT GATES"]
        direction TB
        VerdictLogic["● Verdict Logic (precedence)<br/>━━━━━━━━━━<br/>stop_triggers? → STOP<br/>critical or warnings≥3? → REVISE<br/>else → GO"]
        OutputGuard["● Output Emission Guard<br/>━━━━━━━━━━<br/>evaluation_dashboard → ALWAYS<br/>revision_guidance → REVISE only<br/>exit 0 for all verdicts"]
    end

    subgraph Validators ["★ Contract Validators"]
        ContractTests["★ test_review_design_contracts.py<br/>━━━━━━━━━━<br/>static SKILL.md checks:<br/>· requires_human: true present<br/>· STOP/REVISE/GO all named<br/>· Cannot Assess ≥2 items<br/>· YAML summary block present<br/>· %%ORDER_UP%% terminal marker"]
        ComplianceTest["● test_skill_output_compliance.py<br/>━━━━━━━━━━<br/>PATH_CAPTURE_SKILLS +<br/>review-design entry<br/>tokens: evaluation_dashboard<br/>         revision_guidance"]
    end

    INIT_ONLY -->|"read-only input to"| FindingSpec
    INIT_PRESERVE -->|"weights control"| SilenceGate
    FindingSpec --> L1Gate
    RTContract -->|"merges into"| APPEND
    L1Gate -->|"STOP path"| VerdictLogic
    L1Gate -->|"clean: spawn L2-L4"| SilenceGate
    SilenceGate -->|"filtered agents append to"| APPEND
    APPEND --> DedupeGate
    DedupeGate -->|"deduplicated pool"| DERIVED
    DERIVED --> VerdictLogic
    VerdictLogic --> OutputGuard
    OutputGuard -.->|"validates"| ContractTests
    OutputGuard -.->|"validates tokens"| ComplianceTest

    %% CLASS ASSIGNMENTS %%
    class INIT_ONLY detector;
    class INIT_PRESERVE gap;
    class APPEND handler;
    class DERIVED phase;
    class FindingSpec,RTContract stateNode;
    class L1Gate,SilenceGate,DedupeGate detector;
    class VerdictLogic,OutputGuard output;
    class ContractTests newComponent;
    class ComplianceTest handler;
Loading

Color Legend:

Color Category Description
Red INIT_ONLY / Gates Never-modify fields; validation gate nodes
Yellow INIT_PRESERVE Triage-frozen fields (experiment_type, weights)
Orange APPEND_ONLY findings_pool — grows through agent phases
Purple DERIVED Computed views over findings_pool at synthesis
Teal Finding contracts JSON structure + red-team invariant
Dark Teal Output gates Verdict logic + conditional emission guards
Green ★ New validators Contract test file newly added by this PR

Closes #591

Implementation Plan

Plan file: /home/talon/projects/autoskillit-runs/impl-20260403-190640-751319/.autoskillit/temp/make-plan/review_design_skill_plan_2026-04-03_120000.md

🤖 Generated with Claude Code via AutoSkillit

Token Usage Summary

Step input output cached count time
plan 58 20.2k 2.5M 1 19m 44s
verify 22 21.0k 1.0M 1 7m 14s
implement 29 16.7k 1.2M 1 6m 4s
audit_impl 15 12.1k 344.4k 1 4m 12s
open_pr 28 16.9k 961.9k 1 5m 8s
Total 152 86.9k 6.0M 42m 23s

Replace minimal stub SKILL.md with complete implementation encoding
triage-first, fail-fast multi-level dimensional analysis, adversarial
red-team, GO/REVISE/STOP verdict logic, and evaluation dashboard spec.
Add test_review_design_contracts.py (17 contract tests) and register
review-design in PATH_CAPTURE_SKILLS for output compliance testing.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Collaborator Author

@Trecek Trecek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AutoSkillit PR Review — Verdict: changes_requested

Found 4 critical + 15 warning findings. Inline comments attached.

Copy link
Copy Markdown
Collaborator Author

@Trecek Trecek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AutoSkillit review found 4 critical + 15 warning findings across arch, tests, defense, bugs, cohesion, and slop dimensions. See inline comments for details.

Critical issues:

  1. requires_human field in Finding Format and red-team instructions diverges from project-wide requires_decision convention (review-pr, review-research-pr use requires_decision) — two critical cohesion violations at SKILL.md L177 and L292.
  2. test_dimension_weight_tiers_defined is vacuously satisfied by single-letter fallback — test provides zero signal.
  3. test_verdict_stop_on_l1_critical is entirely subsumed by other tests — does not verify causal linkage.

Routing to resolve-review for automated remediation.

Trecek and others added 3 commits April 3, 2026 20:14
…ide cohesion

Aligned requires_human → requires_decision throughout review-design SKILL.md
and test to match the convention in review-pr and review-research-pr.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…tracts in review-design

- Document STOP emission when experiment_plan_path is absent or file is missing
- Document graceful YAML parse error degradation to Level 2 extraction
- Document triage dispatcher schema validation with exploratory fallback
- Document fail-fast gate behavior for unparseable L1 subagent responses
- Clarify red_team STOP pathway in verdict logic comments
- Fix description to enumerate all four output tokens

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ntracts.py

- Remove vacuous `tier in skill_text` fallback from test_dimension_weight_tiers_defined
- Remove weak `S (` fallback; assert behavioral contract via 'not spawned' in test_silent_tier
- Add 'always H-weight' assertion to test_universal_dimensions_always_run
- Remove trivially-satisfied 'stop' case-sensitive fallback from test_l1_fail_fast_gate_present
- Assert specific verdict assignment syntax in test_verdict_logic_all_three_outcomes
- Replace subsumed test_verdict_stop_on_l1_critical with causal stop_triggers linkage check
- Assert '>= 3' expression in test_verdict_revise_threshold_defined
- Assert coupled 'Cannot Assess section with at least 2' phrase in test_dashboard_cannot_assess_section
- Assert specific YAML block header in test_dashboard_yaml_summary_block

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@Trecek Trecek added this pull request to the merge queue Apr 4, 2026
Merged via the queue into integration with commit 0119549 Apr 4, 2026
2 checks passed
@Trecek Trecek deleted the create-review-design-skill-automated-experiment-design-valid/591 branch April 4, 2026 03:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant