diff --git a/src/autoskillit/recipes/research.yaml b/src/autoskillit/recipes/research.yaml index 60df2484..41008ed7 100644 --- a/src/autoskillit/recipes/research.yaml +++ b/src/autoskillit/recipes/research.yaml @@ -127,10 +127,10 @@ steps: resolve_design_review: tool: run_skill with: - skill_command: "/autoskillit:resolve-design-review ${{ context.evaluation_dashboard }} ${{ context.experiment_plan }}" + skill_command: "/autoskillit:resolve-design-review ${{ context.evaluation_dashboard }} ${{ context.experiment_plan }} ${{ context.revision_guidance }}" cwd: "${{ inputs.source_dir }}" step_name: resolve_design_review - optional_context_refs: [evaluation_dashboard, experiment_plan] + optional_context_refs: [evaluation_dashboard, experiment_plan, revision_guidance] capture: revision_guidance: "${{ result.revision_guidance }}" retries: 1 diff --git a/src/autoskillit/skills_extended/resolve-design-review/SKILL.md b/src/autoskillit/skills_extended/resolve-design-review/SKILL.md index 85c772ea..607abe04 100644 --- a/src/autoskillit/skills_extended/resolve-design-review/SKILL.md +++ b/src/autoskillit/skills_extended/resolve-design-review/SKILL.md @@ -25,7 +25,7 @@ loop or halt. ## Arguments -`/autoskillit:resolve-design-review ` +`/autoskillit:resolve-design-review [prior_revision_guidance_path]` ## When to Use @@ -52,7 +52,10 @@ MCP-only — not user-invocable directly. 2. Parse two positional path arguments: `evaluation_dashboard_path`, `experiment_plan_path` - If missing: print `"Error: missing required argument(s) — expected "`, then emit `resolution=failed`, exit 0 - If file not found: print `"Error: file not found — {missing_path}"`, then emit `resolution=failed`, exit 0 -3. Parse stop-trigger findings from the evaluation dashboard: +3. Parse optional third argument: `prior_revision_guidance_path` + - If present and file exists: read prior revision guidance for theme comparison + - If absent or file not found: skip diminishing-return detection (first-round behavior) +4. Parse stop-trigger findings from the evaluation dashboard: - Locate machine-readable YAML block (`# --- review-design machine summary ---`) - Extract critical findings from L1 dimensions (estimand_clarity, hypothesis_falsifiability) - Extract red_team critical findings @@ -90,6 +93,35 @@ Triage complete (BEFORE any guidance written) ADDRESSABLE: N | STRUCTURAL: N | DISCUSS: N ``` +### Step 1.5: Diminishing-Return Detection (only when prior_revision_guidance_path provided) + +When prior revision guidance is available, compare the current ADDRESSABLE findings +against the themes in the prior round's revision guidance to detect goalposts-moving. + +A finding is **goalposts-moving** when: +- The prior guidance addressed a specific concern at scope X (e.g., "add n=100K") +- The current finding raises the same concern at scope X+1 (e.g., "n=100K doesn't prove n=250K") +- The pattern is: the plan improved to satisfy the prior concern, but the reviewer + raised the bar on the same theme + +Detection heuristic — launch one subagent (model: "sonnet") per ADDRESSABLE finding. +Each subagent receives: current finding + all prior guidance entries. It returns: + +```json +{ + "goalposts_moving": true|false, + "prior_theme_match": "the specific prior guidance entry this finding escalates", + "escalation_pattern": "brief description of how the bar was raised" +} +``` + +When `goalposts_moving: true`, reclassify the finding from ADDRESSABLE to STRUCTURAL +with annotation: `"reclassified: goalposts-moving (prior theme: {prior_theme_match})"`. +This ensures the fix-and-review cycle terminates for concerns that are not converging. + +Fallback: if no prior_revision_guidance_path is provided, skip this step entirely +(preserves current first-round behavior unchanged). + ### Step 2: Apply Resolution Logic ``` diff --git a/src/autoskillit/skills_extended/review-design/SKILL.md b/src/autoskillit/skills_extended/review-design/SKILL.md index 70aa3069..0e1a2fb1 100644 --- a/src/autoskillit/skills_extended/review-design/SKILL.md +++ b/src/autoskillit/skills_extended/review-design/SKILL.md @@ -215,6 +215,22 @@ Receives: full plan text and `experiment_type` (from Step 1 triage output) - exploratory → HARKing vulnerability - ALL red-team findings must set `"requires_decision": true` and `"dimension": "red_team"` +**Red-team severity calibration rubric:** + +| Dimension | causal_inference | benchmark | configuration_study | robustness_audit | exploratory | +|-----------|-----------------|-----------|---------------------|------------------|-------------| +| red_team | critical | warning | warning | warning | info | + +The red-team agent assigns severity based on the intrinsic quality of each finding. +After the red-team agent returns, **cap each finding's severity** to the maximum +allowed by the experiment type using this rubric — identical to how L1 severity +calibration works. For `causal_inference`: critical red-team findings remain critical +(STOP-eligible). For `benchmark`/`configuration_study`/`robustness_audit`: critical +findings are downgraded to `warning` (REVISE-eligible but never STOP). For +`exploratory`: all red-team findings are capped at `info` (informational only). + +This cap is applied in Step 7 before the verdict logic evaluates `stop_triggers`. + ### Step 4: Level 3 (parallel) Run after Level 2 completes. Do not wait for the red-team agent before starting Level 3. @@ -277,17 +293,36 @@ One synthesis pass (no subagent — orchestrator synthesizes directly): 1. **Merge all findings** from L1, L2, L3, L4, and red-team into a single list. 2. **Deduplicate** by `(dimension, section, message)` — identical findings from parallel agents are collapsed into one entry. -3. **Apply verdict logic**: +3. **Apply red-team severity cap, then verdict logic**: ```python + # Red-team severity cap: downgrade findings above the type ceiling + RT_MAX_SEVERITY = { + "causal_inference": "critical", + "benchmark": "warning", + "configuration_study": "warning", + "robustness_audit": "warning", + "exploratory": "info", + } + SEVERITY_RANK = {"info": 0, "warning": 1, "critical": 2} + rt_cap = RT_MAX_SEVERITY[experiment_type] + + for f in findings: + if f.dimension == "red_team" and SEVERITY_RANK[f.severity] > SEVERITY_RANK[rt_cap]: + f.severity = rt_cap # downgrade before verdict + + # Reclassify after cap + critical_findings = [f for f in findings if f.severity == "critical"] + warning_findings = [f for f in findings if f.severity == "warning"] + # L1 fail-fast path: structural defects that block all further analysis - stop_triggers = [f for f in critical if f.dimension in {"estimand_clarity", "hypothesis_falsifiability"}] - # Red-team STOP path: adversarial critical findings after full analysis (L2–L4) - # These fire only when the L1 gate passed; any critical red_team finding is a STOP. - stop_triggers += [f for f in critical if f.dimension == "red_team"] + stop_triggers = [f for f in critical_findings if f.dimension in {"estimand_clarity", "hypothesis_falsifiability"}] + # Red-team STOP path: adversarial critical findings after full analysis (L2-L4) + # These fire only when the L1 gate passed AND the severity cap still allows critical. + stop_triggers += [f for f in critical_findings if f.dimension == "red_team"] if stop_triggers: verdict = "STOP" - elif critical_findings or len(warning_findings) >= 3: + elif critical or len(warning_findings) >= 3: verdict = "REVISE" else: verdict = "GO" diff --git a/tests/skills/test_resolve_design_review_contracts.py b/tests/skills/test_resolve_design_review_contracts.py index 9d76523b..940cb96c 100644 --- a/tests/skills/test_resolve_design_review_contracts.py +++ b/tests/skills/test_resolve_design_review_contracts.py @@ -82,3 +82,39 @@ def test_structured_output_tokens_present(): def test_temp_directory_is_resolve_design_review(): """SKILL.md must use .autoskillit/temp/resolve-design-review/ for output.""" assert ".autoskillit/temp/resolve-design-review/" in SKILL_TEXT + + +# ── Diminishing-return detection ────────────────────────────────────────────── + + +def test_diminishing_return_detection_present(): + """SKILL.md must describe diminishing-return detection for finding themes.""" + lower = SKILL_TEXT.lower() + assert "diminishing" in lower or "goalposts" in lower or "theme comparison" in lower, ( + "resolve-design-review must detect diminishing returns — repeated findings " + "that are higher-abstraction restatements of previously addressed concerns." + ) + + +def test_goalposts_reclassified_as_structural(): + """Goalposts-moving findings must be reclassified as STRUCTURAL.""" + lower = SKILL_TEXT.lower() + has_goalposts_structural = "goalposts" in lower and "structural" in lower + has_diminishing_structural = "diminishing" in lower and "structural" in lower + has_reclassify = "reclassif" in lower + assert has_goalposts_structural or has_diminishing_structural or has_reclassify, ( + "Goalposts-moving findings must be reclassified as STRUCTURAL — " + "the fix-and-review cycle is not converging on that concern." + ) + + +def test_revision_guidance_context_input(): + """SKILL.md must accept prior revision_guidance as context for theme comparison.""" + lower = SKILL_TEXT.lower() + assert "prior" in lower or "previous" in lower, ( + "resolve-design-review must reference prior revision context " + "for diminishing-return detection." + ) + assert "[prior_revision_guidance_path]" in SKILL_TEXT or "optional" in lower, ( + "Prior revision_guidance must be an optional argument for backward compatibility." + ) diff --git a/tests/skills/test_review_design_contracts.py b/tests/skills/test_review_design_contracts.py index 5337de38..6a7f6502 100644 --- a/tests/skills/test_review_design_contracts.py +++ b/tests/skills/test_review_design_contracts.py @@ -294,3 +294,91 @@ def test_l3_l4_subagents_receive_experiment_type( f"{step_heading}: L3/L4 subagents must receive experiment_type. " "Type-agnostic severity calibration is a structural gap." ) + + +# ── Red-team severity calibration ───────────────────────────────────────────── + + +def _parse_rt_rubric(skill_text: str) -> dict[str, str]: + """Parse the red-team severity calibration rubric into {experiment_type: severity}.""" + rt_cal_idx = skill_text.lower().find("red-team severity calibration") + assert rt_cal_idx != -1, "Red-team severity calibration rubric not found" + next_section_idx = skill_text.find("\n###", rt_cal_idx) + rt_section = ( + skill_text[rt_cal_idx:] + if next_section_idx == -1 + else skill_text[rt_cal_idx:next_section_idx] + ) + table_lines = [ln for ln in rt_section.splitlines() if "|" in ln and "---" not in ln] + assert len(table_lines) == 2, "Rubric must have exactly one header row and one data row" + headers = [c.strip().lower() for c in table_lines[0].split("|") if c.strip()] + values = [c.strip().lower() for c in table_lines[1].split("|") if c.strip()] + assert len(headers) == len(values), ( + f"Rubric table header/value count mismatch: {len(headers)} headers vs {len(values)} values" + ) + return dict(zip(headers, values)) + + +def test_red_team_severity_calibration_rubric_present(skill_text: str) -> None: + """Red-team dimension must have a severity calibration rubric by experiment type. + + Without this rubric, any critical red-team finding triggers STOP regardless + of experiment type, creating an unresolvable loop for benchmarks. + """ + rt_cal_idx = skill_text.lower().find("red-team severity calibration") + assert rt_cal_idx != -1, ( + "Red-team severity calibration rubric not found in SKILL.md. " + "Without it, any critical red-team finding triggers STOP regardless " + "of experiment type." + ) + next_section_idx = skill_text.find("\n###", rt_cal_idx) + rt_section = ( + skill_text[rt_cal_idx:] + if next_section_idx == -1 + else skill_text[rt_cal_idx:next_section_idx] + ) + for exp_type in ["causal_inference", "benchmark", "exploratory"]: + assert exp_type in rt_section, ( + f"Red-team calibration rubric must specify {exp_type} severity cap." + ) + + +def test_red_team_severity_cap_applied_before_verdict(skill_text: str) -> None: + """Severity cap must be applied BEFORE building stop_triggers in verdict logic. + + Without this ordering, red-team criticals bypass the cap and still trigger STOP. + """ + step7_text = skill_text_between("### Step 7", "### Step 8", skill_text) + cap_idx = step7_text.find("rt_cap") + stop_idx = step7_text.find('f.dimension == "red_team"') + assert cap_idx != -1, ( + "Step 7 verdict logic must reference rt_cap for red-team severity capping." + ) + assert stop_idx != -1, ( + "Step 7 verdict logic must reference red_team dimension in stop_triggers." + ) + assert cap_idx < stop_idx, ( + "rt_cap must be applied BEFORE the red_team stop_triggers line — " + "otherwise the cap has no effect on STOP eligibility." + ) + + +def test_benchmark_red_team_cannot_stop(skill_text: str) -> None: + """Benchmark experiment type must cap red-team severity at warning (no STOP).""" + rubric = _parse_rt_rubric(skill_text) + assert "benchmark" in rubric, "Benchmark column not found in red-team calibration rubric" + assert rubric["benchmark"] == "warning", ( + "Benchmark red-team severity must be capped at 'warning' — " + "STOP-eligible red-team findings are unreasonable for benchmarks." + ) + + +def test_causal_inference_red_team_can_stop(skill_text: str) -> None: + """causal_inference must retain critical as max red-team severity (STOP eligible).""" + rubric = _parse_rt_rubric(skill_text) + assert "causal_inference" in rubric, ( + "causal_inference column not found in red-team calibration rubric" + ) + assert rubric["causal_inference"] == "critical", ( + "causal_inference must retain critical as max red-team severity." + )