Implementation Plan: Add Red-Team Severity Calibration by Experiment Type in review-design by Trecek · Pull Request #614 · TalonT-Org/AutoSkillit

Trecek · 2026-04-05T03:24:33Z

Summary

The review-design skill has L1 severity calibration that correctly caps estimand_clarity and hypothesis_falsifiability by experiment_type — benchmarks can never produce L1 critical findings. But the red-team dimension has no analogous calibration, meaning any critical red-team finding triggers STOP regardless of experiment type. This creates an unresolvable loop for benchmarks: the red-team always finds new critical issues at progressively higher abstraction (the Hydra pattern), exhausting retries without ever producing GO.

The fix adds a red-team severity calibration rubric to review-design/SKILL.md (mirroring the L1 rubric), updates the verdict logic to apply the cap before building stop_triggers, and adds diminishing-return awareness to resolve-design-review/SKILL.md so it can detect goalposts-moving across rounds.

Architecture Impact

Process Flow Diagram

%%{init: {'flowchart': {'nodeSpacing': 50, 'rankSpacing': 60, 'curve': 'basis'}}}%%
flowchart TB
    classDef terminal fill:#1a237e,stroke:#7986cb,stroke-width:2px,color:#fff;
    classDef stateNode fill:#004d40,stroke:#4db6ac,stroke-width:2px,color:#fff;
    classDef handler fill:#e65100,stroke:#ffb74d,stroke-width:2px,color:#fff;
    classDef phase fill:#6a1b9a,stroke:#ba68c8,stroke-width:2px,color:#fff;
    classDef newComponent fill:#2e7d32,stroke:#81c784,stroke-width:2px,color:#fff;
    classDef detector fill:#b71c1c,stroke:#ef5350,stroke-width:2px,color:#fff;

    START([Plan submitted])
    GO([GO → execute])
    REVISE_OUT([REVISE → revise_design])
    REVISED_OUT([revised → revise_design])
    FAILED_OUT([failed → design_rejected])

    subgraph ReviewDesign ["● review-design/SKILL.md"]
        direction TB
        L1["L1 Analysis<br/>━━━━━━━━━━<br/>estimand_clarity +<br/>hypothesis_falsifiability"]
        L1GATE{"L1 Fail-Fast<br/>━━━━━━━━━━<br/>Any L1 critical?"}
        PARALLEL["L2 + L3 + L4 + RT<br/>━━━━━━━━━━<br/>Parallel analysis"]
        RTCAP["● RT Severity Cap<br/>━━━━━━━━━━<br/>RT_MAX_SEVERITY[experiment_type]<br/>Downgrade if above ceiling"]
        MERGE["Merge + Dedup<br/>━━━━━━━━━━<br/>All findings pooled"]
        VERDICT{"● Verdict Logic<br/>━━━━━━━━━━<br/>stop_triggers built<br/>AFTER rt_cap applied"}
    end

    subgraph ResolveDesign ["● resolve-design-review/SKILL.md"]
        direction TB
        PARSE["Step 1: Parse Dashboard<br/>━━━━━━━━━━<br/>Extract stop-trigger findings<br/>Classify ADDRESSABLE/STRUCTURAL/DISCUSS"]
        DIMCHECK{"prior_revision_guidance<br/>━━━━━━━━━━<br/>provided?"}
        DIMRET["● Step 1.5: Diminishing-Return<br/>━━━━━━━━━━<br/>Compare ADDRESSABLE themes<br/>vs prior guidance entries"]
        GOALPOST{"goalposts_moving<br/>━━━━━━━━━━<br/>true for any finding?"}
        RECLASSIFY["● Reclassify<br/>━━━━━━━━━━<br/>ADDRESSABLE → STRUCTURAL<br/>annotate prior_theme_match"]
        RESGATE{"Any ADDRESSABLE<br/>or DISCUSS?"}
    end

    subgraph RecipeRouting ["● research.yaml — resolve_design_review step"]
        direction LR
        RECIPE["skill_command passes<br/>━━━━━━━━━━<br/>$context.revision_guidance<br/>as optional 3rd arg"]
    end

    START --> L1
    L1 --> L1GATE
    L1GATE -->|"yes (L1 critical)"| MERGE
    L1GATE -->|"no"| PARALLEL
    PARALLEL --> RTCAP
    RTCAP --> MERGE
    MERGE --> VERDICT
    VERDICT -->|"stop_triggers present"| RECIPE
    VERDICT -->|"critical or ≥3 warnings"| REVISE_OUT
    VERDICT -->|"otherwise"| GO

    RECIPE --> PARSE
    PARSE --> DIMCHECK
    DIMCHECK -->|"yes"| DIMRET
    DIMCHECK -->|"no (round 1)"| RESGATE
    DIMRET --> GOALPOST
    GOALPOST -->|"true"| RECLASSIFY
    GOALPOST -->|"false"| RESGATE
    RECLASSIFY --> RESGATE
    RESGATE -->|"yes"| REVISED_OUT
    RESGATE -->|"all STRUCTURAL"| FAILED_OUT

    class START,GO,REVISE_OUT,REVISED_OUT,FAILED_OUT terminal;
    class L1,PARALLEL handler;
    class L1GATE,VERDICT,DIMCHECK,GOALPOST,RESGATE stateNode;
    class MERGE,PARSE phase;
    class RTCAP,DIMRET,RECLASSIFY newComponent;
    class RECIPE detector;

Color Legend:

Color	Category	Description
Dark Blue	Terminal	Start and outcome states
Orange	Handler	Analysis agents (L1, parallel L2-L4+RT)
Teal	State	Decision points and verdict routing
Purple	Phase	Merge and parse aggregation steps
Green	Modified Component	● Nodes changed by this PR (RT cap, diminishing-return detection, reclassify, recipe routing)
Red	Detector	Recipe routing gate (passes revision_guidance)

Closes #609

Implementation Plan

Plan file: /home/talon/projects/autoskillit-runs/impl-20260404-185816-184240/.autoskillit/temp/make-plan/add-red-team-severity-calibration-by-experiment-type_plan_2026-04-04_185816.md

🤖 Generated with Claude Code via AutoSkillit

Token Usage Summary

Step	input	output	cached	count	time
plan	5.5k	76.5k	6.0M	5	32m 41s
verify	3.1k	86.2k	5.4M	5	31m 25s
implement	1.1k	116.2k	22.6M	6	50m 55s
fix	214	28.4k	3.5M	5	30m 58s
audit_impl	137	58.9k	3.1M	5	19m 28s
open_pr	135	68.4k	5.4M	4	23m 1s
review_pr	31	22.8k	1.2M	1	5m 50s
Total	10.2k	457.5k	47.2M		3h 14m

Add experiment-type-aware severity cap for red-team findings, mirroring the existing L1 calibration rubric. Benchmarks cap at warning (no STOP), causal_inference retains critical, exploratory caps at info. The cap is applied in Step 7 before verdict logic evaluates stop_triggers. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add optional prior_revision_guidance_path argument and Step 1.5 that detects goalposts-moving findings by comparing current ADDRESSABLE findings against prior revision themes. Goalposts-moving findings are reclassified as STRUCTURAL to terminate non-converging review cycles. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…urn detection Add 4 tests for review-design (rubric present, cap before verdict, benchmark cannot STOP, causal_inference can STOP) and 3 tests for resolve-design-review (diminishing-return present, goalposts reclassified as STRUCTURAL, revision_guidance context input). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ch recipe Update skill_command to include context.revision_guidance as third arg and add revision_guidance to optional_context_refs for backward compat. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Extract column-value mapping from header+data rows instead of searching for experiment type names in data rows (which appear only in headers). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Trecek

AutoSkillit PR Review — Verdict: changes_requested (8 blocking issues found; see inline comments)

Trecek · 2026-04-05T03:41:22Z

tests/skills/test_review_design_contracts.py

+def test_red_team_severity_cap_applied_before_verdict(skill_text: str) -> None:
+    """Severity cap must be applied BEFORE building stop_triggers in verdict logic.
+
+    Without this ordering, red-team criticals bypass the cap and still trigger STOP.


[critical] tests: Syntax error: the docstring for test_red_team_severity_cap_applied_before_verdict is missing its opening """. Line 322 reads """Severity cap must be applied BEFORE building stop_triggers in verdict logic.""" (closing on same line), then L324 is a bare string Without this ordering, red-team criticals bypass the cap... followed by """ on L325. This makes L324 a bare expression that is NOT inside a triple-quoted string, causing a SyntaxError at import time — the entire test module fails to collect.

Investigated — this is intentional. The docstring is correctly formed: """Severity cap must be applied BEFORE building stop_triggers in verdict logic.\n\n Without this ordering...\n """ (opening triple-quote on L322, closing on L325). ast.parse() confirms no SyntaxError. The diff hunk visible to the reviewer was truncated before the closing line. No change needed.

Trecek · 2026-04-05T03:41:22Z

tests/skills/test_review_design_contracts.py

+
+    Without this ordering, red-team criticals bypass the cap and still trigger STOP.
+    """
+    step7_text = skill_text_between("### Step 7", "### Step 8", skill_text)


[warning] tests: skill_text_between("### Step 7", "### Step 8", skill_text) is called but skill_text_between is not imported or defined anywhere in the visible diff. If this helper is absent from the existing test file, the test will raise a NameError at runtime.

Investigated — this is intentional. skill_text_between is defined at line 190 of the same file (def skill_text_between(start_heading: str, end_heading: str, text: str) -> str:). The function predates this PR. The reviewer's diff hunk started at line 294 and did not include the pre-existing helper definition above it.

tests/skills/test_review_design_contracts.py

src/autoskillit/skills_extended/review-design/SKILL.md

Trecek

AutoSkillit review found 8 blocking issues. See inline comments.

Verdict: changes_requested

Critical (1):

tests/skills/test_review_design_contracts.py L324: Syntax error — missing opening """ for docstring of test_red_team_severity_cap_applied_before_verdict. Module fails to import.

Warnings (7):

tests/skills/test_review_design_contracts.py L326: skill_text_between called but not imported/defined — NameError at runtime
tests/skills/test_review_design_contracts.py L314: Arbitrary 1000-char window in rubric presence check — fragile assertion
tests/skills/test_review_design_contracts.py L341: Private helper _parse_rt_rubric placed mid-sequence between public tests — cohesion violation
tests/skills/test_review_design_contracts.py L346: Missing equal-length guard before zip(headers[1:], values[1:]) — silent truncation
tests/skills/test_review_design_contracts.py L347: Off-by-one risk if table_lines[1] absent — IndexError instead of informative assertion
tests/skills/test_review_design_contracts.py L350: Fragile [1:] slice assumes stable table structure — silent off-by-one if table reformatted
src/autoskillit/skills_extended/review-design/SKILL.md L315: Asymmetric naming: critical vs warning_findings — inconsistent _findings suffix

Info (2, not blocking):

src/autoskillit/skills_extended/review-design/SKILL.md L319: Hyphen vs en-dash in comment
tests/skills/test_resolve_design_review_contracts.py L93: Three-way OR in assertion is too easy to satisfy accidentally

…tion boundary, add equality assertions, drop [1:] slicing - Move _parse_rt_rubric to top of red-team section so it precedes all tests that call it - Replace 1000-char fixed window with next-section-heading boundary in both _parse_rt_rubric and test_red_team_severity_calibration_rubric_present - Change len(table_lines) >= 2 to == 2 to enforce exact one-header/one-data-row structure - Add assert len(headers) == len(values) before zip() to catch mismatched column counts - Drop [1:] slicing; use dict(zip(headers, values)) so callers look up by name without index-based alignment assumptions Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ng in verdict logic warning_findings already uses the _findings suffix; align critical to match. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Trecek and others added 6 commits April 4, 2026 19:35

feat: pass prior revision_guidance to resolve-design-review in resear…

3e681c2

…ch recipe Update skill_command to include context.revision_guidance as third arg and add revision_guidance to optional_context_refs for backward compat. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fix: rename ambiguous variable l to ln in test generators

a6b3cbe

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fix: parse rubric table columns correctly in red-team calibration tests

c83938f

Extract column-value mapping from header+data rows instead of searching for experiment type names in data rows (which appear only in headers). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Trecek commented Apr 5, 2026

View reviewed changes

Trecek and others added 2 commits April 4, 2026 20:47

fix(review): rename critical to critical_findings for consistent nami…

292dc3c

…ng in verdict logic warning_findings already uses the _findings suffix; align critical to match. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Trecek enabled auto-merge April 5, 2026 04:00

Trecek disabled auto-merge April 5, 2026 04:01

Trecek added this pull request to the merge queue Apr 5, 2026

Merged via the queue into integration with commit 75eafa2 Apr 5, 2026
2 checks passed

Trecek deleted the add-red-team-severity-calibration-by-experiment-type-in-revi/609 branch April 5, 2026 04:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implementation Plan: Add Red-Team Severity Calibration by Experiment Type in review-design#614

Implementation Plan: Add Red-Team Severity Calibration by Experiment Type in review-design#614
Trecek merged 8 commits intointegrationfrom
add-red-team-severity-calibration-by-experiment-type-in-revi/609

Trecek commented Apr 5, 2026 •

edited

Loading

Uh oh!

Trecek left a comment

Uh oh!

Trecek Apr 5, 2026

Uh oh!

Trecek Apr 5, 2026

Uh oh!

Trecek Apr 5, 2026

Uh oh!

Trecek Apr 5, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Trecek left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Trecek commented Apr 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Architecture Impact

Process Flow Diagram

Implementation Plan

Token Usage Summary

Uh oh!

Trecek left a comment

Choose a reason for hiding this comment

Uh oh!

Trecek Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

Trecek Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

Trecek Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

Trecek Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Trecek left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Trecek commented Apr 5, 2026 •

edited

Loading