Implementation Plan: Add Red-Team Severity Calibration by Experiment Type in review-design#614
Conversation
Add experiment-type-aware severity cap for red-team findings, mirroring the existing L1 calibration rubric. Benchmarks cap at warning (no STOP), causal_inference retains critical, exploratory caps at info. The cap is applied in Step 7 before verdict logic evaluates stop_triggers. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add optional prior_revision_guidance_path argument and Step 1.5 that detects goalposts-moving findings by comparing current ADDRESSABLE findings against prior revision themes. Goalposts-moving findings are reclassified as STRUCTURAL to terminate non-converging review cycles. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…urn detection Add 4 tests for review-design (rubric present, cap before verdict, benchmark cannot STOP, causal_inference can STOP) and 3 tests for resolve-design-review (diminishing-return present, goalposts reclassified as STRUCTURAL, revision_guidance context input). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ch recipe Update skill_command to include context.revision_guidance as third arg and add revision_guidance to optional_context_refs for backward compat. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Extract column-value mapping from header+data rows instead of searching for experiment type names in data rows (which appear only in headers). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Trecek
left a comment
There was a problem hiding this comment.
AutoSkillit PR Review — Verdict: changes_requested (8 blocking issues found; see inline comments)
| def test_red_team_severity_cap_applied_before_verdict(skill_text: str) -> None: | ||
| """Severity cap must be applied BEFORE building stop_triggers in verdict logic. | ||
|
|
||
| Without this ordering, red-team criticals bypass the cap and still trigger STOP. |
There was a problem hiding this comment.
[critical] tests: Syntax error: the docstring for test_red_team_severity_cap_applied_before_verdict is missing its opening """. Line 322 reads """Severity cap must be applied BEFORE building stop_triggers in verdict logic.""" (closing on same line), then L324 is a bare string Without this ordering, red-team criticals bypass the cap... followed by """ on L325. This makes L324 a bare expression that is NOT inside a triple-quoted string, causing a SyntaxError at import time — the entire test module fails to collect.
There was a problem hiding this comment.
Investigated — this is intentional. The docstring is correctly formed: """Severity cap must be applied BEFORE building stop_triggers in verdict logic.\n\n Without this ordering...\n """ (opening triple-quote on L322, closing on L325). ast.parse() confirms no SyntaxError. The diff hunk visible to the reviewer was truncated before the closing line. No change needed.
|
|
||
| Without this ordering, red-team criticals bypass the cap and still trigger STOP. | ||
| """ | ||
| step7_text = skill_text_between("### Step 7", "### Step 8", skill_text) |
There was a problem hiding this comment.
[warning] tests: skill_text_between("### Step 7", "### Step 8", skill_text) is called but skill_text_between is not imported or defined anywhere in the visible diff. If this helper is absent from the existing test file, the test will raise a NameError at runtime.
There was a problem hiding this comment.
Investigated — this is intentional. skill_text_between is defined at line 190 of the same file (def skill_text_between(start_heading: str, end_heading: str, text: str) -> str:). The function predates this PR. The reviewer's diff hunk started at line 294 and did not include the pre-existing helper definition above it.
Trecek
left a comment
There was a problem hiding this comment.
AutoSkillit review found 8 blocking issues. See inline comments.
Verdict: changes_requested
Critical (1):
tests/skills/test_review_design_contracts.pyL324: Syntax error — missing opening"""for docstring oftest_red_team_severity_cap_applied_before_verdict. Module fails to import.
Warnings (7):
tests/skills/test_review_design_contracts.pyL326:skill_text_betweencalled but not imported/defined — NameError at runtimetests/skills/test_review_design_contracts.pyL314: Arbitrary 1000-char window in rubric presence check — fragile assertiontests/skills/test_review_design_contracts.pyL341: Private helper_parse_rt_rubricplaced mid-sequence between public tests — cohesion violationtests/skills/test_review_design_contracts.pyL346: Missing equal-length guard beforezip(headers[1:], values[1:])— silent truncationtests/skills/test_review_design_contracts.pyL347: Off-by-one risk iftable_lines[1]absent — IndexError instead of informative assertiontests/skills/test_review_design_contracts.pyL350: Fragile[1:]slice assumes stable table structure — silent off-by-one if table reformattedsrc/autoskillit/skills_extended/review-design/SKILL.mdL315: Asymmetric naming:criticalvswarning_findings— inconsistent_findingssuffix
Info (2, not blocking):
src/autoskillit/skills_extended/review-design/SKILL.mdL319: Hyphen vs en-dash in commenttests/skills/test_resolve_design_review_contracts.pyL93: Three-way OR in assertion is too easy to satisfy accidentally
…tion boundary, add equality assertions, drop [1:] slicing - Move _parse_rt_rubric to top of red-team section so it precedes all tests that call it - Replace 1000-char fixed window with next-section-heading boundary in both _parse_rt_rubric and test_red_team_severity_calibration_rubric_present - Change len(table_lines) >= 2 to == 2 to enforce exact one-header/one-data-row structure - Add assert len(headers) == len(values) before zip() to catch mismatched column counts - Drop [1:] slicing; use dict(zip(headers, values)) so callers look up by name without index-based alignment assumptions Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ng in verdict logic warning_findings already uses the _findings suffix; align critical to match. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Summary
The
review-designskill has L1 severity calibration that correctly capsestimand_clarityandhypothesis_falsifiabilitybyexperiment_type— benchmarks can never produce L1 critical findings. But the red-team dimension has no analogous calibration, meaning any critical red-team finding triggers STOP regardless of experiment type. This creates an unresolvable loop for benchmarks: the red-team always finds new critical issues at progressively higher abstraction (the Hydra pattern), exhausting retries without ever producing GO.The fix adds a red-team severity calibration rubric to
review-design/SKILL.md(mirroring the L1 rubric), updates the verdict logic to apply the cap before buildingstop_triggers, and adds diminishing-return awareness toresolve-design-review/SKILL.mdso it can detect goalposts-moving across rounds.Architecture Impact
Process Flow Diagram
%%{init: {'flowchart': {'nodeSpacing': 50, 'rankSpacing': 60, 'curve': 'basis'}}}%% flowchart TB classDef terminal fill:#1a237e,stroke:#7986cb,stroke-width:2px,color:#fff; classDef stateNode fill:#004d40,stroke:#4db6ac,stroke-width:2px,color:#fff; classDef handler fill:#e65100,stroke:#ffb74d,stroke-width:2px,color:#fff; classDef phase fill:#6a1b9a,stroke:#ba68c8,stroke-width:2px,color:#fff; classDef newComponent fill:#2e7d32,stroke:#81c784,stroke-width:2px,color:#fff; classDef detector fill:#b71c1c,stroke:#ef5350,stroke-width:2px,color:#fff; START([Plan submitted]) GO([GO → execute]) REVISE_OUT([REVISE → revise_design]) REVISED_OUT([revised → revise_design]) FAILED_OUT([failed → design_rejected]) subgraph ReviewDesign ["● review-design/SKILL.md"] direction TB L1["L1 Analysis<br/>━━━━━━━━━━<br/>estimand_clarity +<br/>hypothesis_falsifiability"] L1GATE{"L1 Fail-Fast<br/>━━━━━━━━━━<br/>Any L1 critical?"} PARALLEL["L2 + L3 + L4 + RT<br/>━━━━━━━━━━<br/>Parallel analysis"] RTCAP["● RT Severity Cap<br/>━━━━━━━━━━<br/>RT_MAX_SEVERITY[experiment_type]<br/>Downgrade if above ceiling"] MERGE["Merge + Dedup<br/>━━━━━━━━━━<br/>All findings pooled"] VERDICT{"● Verdict Logic<br/>━━━━━━━━━━<br/>stop_triggers built<br/>AFTER rt_cap applied"} end subgraph ResolveDesign ["● resolve-design-review/SKILL.md"] direction TB PARSE["Step 1: Parse Dashboard<br/>━━━━━━━━━━<br/>Extract stop-trigger findings<br/>Classify ADDRESSABLE/STRUCTURAL/DISCUSS"] DIMCHECK{"prior_revision_guidance<br/>━━━━━━━━━━<br/>provided?"} DIMRET["● Step 1.5: Diminishing-Return<br/>━━━━━━━━━━<br/>Compare ADDRESSABLE themes<br/>vs prior guidance entries"] GOALPOST{"goalposts_moving<br/>━━━━━━━━━━<br/>true for any finding?"} RECLASSIFY["● Reclassify<br/>━━━━━━━━━━<br/>ADDRESSABLE → STRUCTURAL<br/>annotate prior_theme_match"] RESGATE{"Any ADDRESSABLE<br/>or DISCUSS?"} end subgraph RecipeRouting ["● research.yaml — resolve_design_review step"] direction LR RECIPE["skill_command passes<br/>━━━━━━━━━━<br/>$context.revision_guidance<br/>as optional 3rd arg"] end START --> L1 L1 --> L1GATE L1GATE -->|"yes (L1 critical)"| MERGE L1GATE -->|"no"| PARALLEL PARALLEL --> RTCAP RTCAP --> MERGE MERGE --> VERDICT VERDICT -->|"stop_triggers present"| RECIPE VERDICT -->|"critical or ≥3 warnings"| REVISE_OUT VERDICT -->|"otherwise"| GO RECIPE --> PARSE PARSE --> DIMCHECK DIMCHECK -->|"yes"| DIMRET DIMCHECK -->|"no (round 1)"| RESGATE DIMRET --> GOALPOST GOALPOST -->|"true"| RECLASSIFY GOALPOST -->|"false"| RESGATE RECLASSIFY --> RESGATE RESGATE -->|"yes"| REVISED_OUT RESGATE -->|"all STRUCTURAL"| FAILED_OUT class START,GO,REVISE_OUT,REVISED_OUT,FAILED_OUT terminal; class L1,PARALLEL handler; class L1GATE,VERDICT,DIMCHECK,GOALPOST,RESGATE stateNode; class MERGE,PARSE phase; class RTCAP,DIMRET,RECLASSIFY newComponent; class RECIPE detector;Color Legend:
Closes #609
Implementation Plan
Plan file:
/home/talon/projects/autoskillit-runs/impl-20260404-185816-184240/.autoskillit/temp/make-plan/add-red-team-severity-calibration-by-experiment-type_plan_2026-04-04_185816.md🤖 Generated with Claude Code via AutoSkillit
Token Usage Summary