TalonT-Org · Trecek · Apr 5, 2026 · Apr 5, 2026 · Apr 5, 2026 · Apr 5, 2026
diff --git a/src/autoskillit/recipes/research.yaml b/src/autoskillit/recipes/research.yaml
@@ -127,10 +127,10 @@ steps:
   resolve_design_review:
     tool: run_skill
     with:
-      skill_command: "/autoskillit:resolve-design-review ${{ context.evaluation_dashboard }} ${{ context.experiment_plan }}"
+      skill_command: "/autoskillit:resolve-design-review ${{ context.evaluation_dashboard }} ${{ context.experiment_plan }} ${{ context.revision_guidance }}"
       cwd: "${{ inputs.source_dir }}"
       step_name: resolve_design_review
-    optional_context_refs: [evaluation_dashboard, experiment_plan]
+    optional_context_refs: [evaluation_dashboard, experiment_plan, revision_guidance]
     capture:
       revision_guidance: "${{ result.revision_guidance }}"
     retries: 1

diff --git a/src/autoskillit/skills_extended/resolve-design-review/SKILL.md b/src/autoskillit/skills_extended/resolve-design-review/SKILL.md
@@ -25,7 +25,7 @@ loop or halt.
 
 ## Arguments
 
-`/autoskillit:resolve-design-review <evaluation_dashboard_path> <experiment_plan_path>`
+`/autoskillit:resolve-design-review <evaluation_dashboard_path> <experiment_plan_path> [prior_revision_guidance_path]`
 
 ## When to Use
 
@@ -52,7 +52,10 @@ MCP-only — not user-invocable directly.
 2. Parse two positional path arguments: `evaluation_dashboard_path`, `experiment_plan_path`
    - If missing: print `"Error: missing required argument(s) — expected <evaluation_dashboard_path> <experiment_plan_path>"`, then emit `resolution=failed`, exit 0
    - If file not found: print `"Error: file not found — {missing_path}"`, then emit `resolution=failed`, exit 0
-3. Parse stop-trigger findings from the evaluation dashboard:
+3. Parse optional third argument: `prior_revision_guidance_path`
+   - If present and file exists: read prior revision guidance for theme comparison
+   - If absent or file not found: skip diminishing-return detection (first-round behavior)
+4. Parse stop-trigger findings from the evaluation dashboard:
    - Locate machine-readable YAML block (`# --- review-design machine summary ---`)
    - Extract critical findings from L1 dimensions (estimand_clarity, hypothesis_falsifiability)
    - Extract red_team critical findings
@@ -90,6 +93,35 @@ Triage complete (BEFORE any guidance written)
 ADDRESSABLE: N | STRUCTURAL: N | DISCUSS: N
 ```
 
+### Step 1.5: Diminishing-Return Detection (only when prior_revision_guidance_path provided)
+
+When prior revision guidance is available, compare the current ADDRESSABLE findings
+against the themes in the prior round's revision guidance to detect goalposts-moving.
+
+A finding is **goalposts-moving** when:
+- The prior guidance addressed a specific concern at scope X (e.g., "add n=100K")
+- The current finding raises the same concern at scope X+1 (e.g., "n=100K doesn't prove n=250K")
+- The pattern is: the plan improved to satisfy the prior concern, but the reviewer
+  raised the bar on the same theme
+
+Detection heuristic — launch one subagent (model: "sonnet") per ADDRESSABLE finding.
+Each subagent receives: current finding + all prior guidance entries. It returns:
+
+```json
+{
+  "goalposts_moving": true|false,
+  "prior_theme_match": "the specific prior guidance entry this finding escalates",
+  "escalation_pattern": "brief description of how the bar was raised"
+}
+```
+
+When `goalposts_moving: true`, reclassify the finding from ADDRESSABLE to STRUCTURAL
+with annotation: `"reclassified: goalposts-moving (prior theme: {prior_theme_match})"`.
+This ensures the fix-and-review cycle terminates for concerns that are not converging.
+
+Fallback: if no prior_revision_guidance_path is provided, skip this step entirely
+(preserves current first-round behavior unchanged).
+
 ### Step 2: Apply Resolution Logic
 
 ```

diff --git a/src/autoskillit/skills_extended/review-design/SKILL.md b/src/autoskillit/skills_extended/review-design/SKILL.md
@@ -215,6 +215,22 @@ Receives: full plan text and `experiment_type` (from Step 1 triage output)
   - exploratory → HARKing vulnerability
 - ALL red-team findings must set `"requires_decision": true` and `"dimension": "red_team"`
 
+**Red-team severity calibration rubric:**
+
+| Dimension | causal_inference | benchmark | configuration_study | robustness_audit | exploratory |
+|-----------|-----------------|-----------|---------------------|------------------|-------------|
+| red_team  | critical        | warning   | warning             | warning          | info        |
+
+The red-team agent assigns severity based on the intrinsic quality of each finding.
+After the red-team agent returns, **cap each finding's severity** to the maximum
+allowed by the experiment type using this rubric — identical to how L1 severity
+calibration works. For `causal_inference`: critical red-team findings remain critical
+(STOP-eligible). For `benchmark`/`configuration_study`/`robustness_audit`: critical
+findings are downgraded to `warning` (REVISE-eligible but never STOP). For
+`exploratory`: all red-team findings are capped at `info` (informational only).
+
+This cap is applied in Step 7 before the verdict logic evaluates `stop_triggers`.
+
 ### Step 4: Level 3 (parallel)
 
 Run after Level 2 completes. Do not wait for the red-team agent before starting Level 3.
@@ -277,17 +293,36 @@ One synthesis pass (no subagent — orchestrator synthesizes directly):
 1. **Merge all findings** from L1, L2, L3, L4, and red-team into a single list.
 2. **Deduplicate** by `(dimension, section, message)` — identical findings from parallel
    agents are collapsed into one entry.
-3. **Apply verdict logic**:
+3. **Apply red-team severity cap, then verdict logic**:
    ```python
+   # Red-team severity cap: downgrade findings above the type ceiling
+   RT_MAX_SEVERITY = {
+       "causal_inference": "critical",
+       "benchmark": "warning",
+       "configuration_study": "warning",
+       "robustness_audit": "warning",
+       "exploratory": "info",
+   }
+   SEVERITY_RANK = {"info": 0, "warning": 1, "critical": 2}
+   rt_cap = RT_MAX_SEVERITY[experiment_type]
+
+   for f in findings:
+       if f.dimension == "red_team" and SEVERITY_RANK[f.severity] > SEVERITY_RANK[rt_cap]:
+           f.severity = rt_cap  # downgrade before verdict
+
+   # Reclassify after cap
+   critical = [f for f in findings if f.severity == "critical"]
+   warning_findings = [f for f in findings if f.severity == "warning"]
+
    # L1 fail-fast path: structural defects that block all further analysis
    stop_triggers = [f for f in critical if f.dimension in {"estimand_clarity", "hypothesis_falsifiability"}]
-   # Red-team STOP path: adversarial critical findings after full analysis (L2–L4)
-   # These fire only when the L1 gate passed; any critical red_team finding is a STOP.
+   # Red-team STOP path: adversarial critical findings after full analysis (L2-L4)
+   # These fire only when the L1 gate passed AND the severity cap still allows critical.
    stop_triggers += [f for f in critical if f.dimension == "red_team"]
 
    if stop_triggers:
        verdict = "STOP"
-   elif critical_findings or len(warning_findings) >= 3:
+   elif critical or len(warning_findings) >= 3:
        verdict = "REVISE"
    else:
        verdict = "GO"

diff --git a/tests/skills/test_resolve_design_review_contracts.py b/tests/skills/test_resolve_design_review_contracts.py
@@ -82,3 +82,39 @@ def test_structured_output_tokens_present():
 def test_temp_directory_is_resolve_design_review():
     """SKILL.md must use .autoskillit/temp/resolve-design-review/ for output."""
     assert ".autoskillit/temp/resolve-design-review/" in SKILL_TEXT
+
+
+# ── Diminishing-return detection ──────────────────────────────────────────────
+
+
+def test_diminishing_return_detection_present():
+    """SKILL.md must describe diminishing-return detection for finding themes."""
+    lower = SKILL_TEXT.lower()
+    assert "diminishing" in lower or "goalposts" in lower or "theme comparison" in lower, (
+        "resolve-design-review must detect diminishing returns — repeated findings "
+        "that are higher-abstraction restatements of previously addressed concerns."
+    )
+
+
+def test_goalposts_reclassified_as_structural():
+    """Goalposts-moving findings must be reclassified as STRUCTURAL."""
+    lower = SKILL_TEXT.lower()
+    has_goalposts_structural = "goalposts" in lower and "structural" in lower
+    has_diminishing_structural = "diminishing" in lower and "structural" in lower
+    has_reclassify = "reclassif" in lower
+    assert has_goalposts_structural or has_diminishing_structural or has_reclassify, (
+        "Goalposts-moving findings must be reclassified as STRUCTURAL — "
+        "the fix-and-review cycle is not converging on that concern."
+    )
+
+
+def test_revision_guidance_context_input():
+    """SKILL.md must accept prior revision_guidance as context for theme comparison."""
+    lower = SKILL_TEXT.lower()
+    assert "prior" in lower or "previous" in lower, (
+        "resolve-design-review must reference prior revision context "
+        "for diminishing-return detection."
+    )
+    assert "[prior_revision_guidance_path]" in SKILL_TEXT or "optional" in lower, (
+        "Prior revision_guidance must be an optional argument for backward compatibility."
+    )
diff --git a/tests/skills/test_review_design_contracts.py b/tests/skills/test_review_design_contracts.py
@@ -294,3 +294,78 @@ def test_l3_l4_subagents_receive_experiment_type(
         f"{step_heading}: L3/L4 subagents must receive experiment_type. "
         "Type-agnostic severity calibration is a structural gap."
     )
+
+
+# ── Red-team severity calibration ─────────────────────────────────────────────
+
+
+def test_red_team_severity_calibration_rubric_present(skill_text: str) -> None:
+    """Red-team dimension must have a severity calibration rubric by experiment type.
+
+    Without this rubric, any critical red-team finding triggers STOP regardless
+    of experiment type, creating an unresolvable loop for benchmarks.
+    """
+    rt_cal_idx = skill_text.lower().find("red-team severity calibration")
+    assert rt_cal_idx != -1, (
+        "Red-team severity calibration rubric not found in SKILL.md. "
+        "Without it, any critical red-team finding triggers STOP regardless "
+        "of experiment type."
+    )
+    rt_section = skill_text[rt_cal_idx : rt_cal_idx + 1000]
+    for exp_type in ["causal_inference", "benchmark", "exploratory"]:
+        assert exp_type in rt_section, (
+            f"Red-team calibration rubric must specify {exp_type} severity cap."
+        )
+
+
+def test_red_team_severity_cap_applied_before_verdict(skill_text: str) -> None:
+    """Severity cap must be applied BEFORE building stop_triggers in verdict logic.
+
+    Without this ordering, red-team criticals bypass the cap and still trigger STOP.
+    """
+    step7_text = skill_text_between("### Step 7", "### Step 8", skill_text)
+    cap_idx = step7_text.find("rt_cap")
+    stop_idx = step7_text.find('f.dimension == "red_team"')
+    assert cap_idx != -1, (
+        "Step 7 verdict logic must reference rt_cap for red-team severity capping."
+    )
+    assert stop_idx != -1, (
+        "Step 7 verdict logic must reference red_team dimension in stop_triggers."
+    )
+    assert cap_idx < stop_idx, (
+        "rt_cap must be applied BEFORE the red_team stop_triggers line — "
+        "otherwise the cap has no effect on STOP eligibility."
+    )
+
+
+def _parse_rt_rubric(skill_text: str) -> dict[str, str]:
+    """Parse the red-team severity calibration rubric into {experiment_type: severity}."""
+    rt_cal_idx = skill_text.lower().find("red-team severity calibration")
+    assert rt_cal_idx != -1, "Red-team severity calibration rubric not found"
+    rt_section = skill_text[rt_cal_idx : rt_cal_idx + 1000]
+    table_lines = [ln for ln in rt_section.splitlines() if "|" in ln and "---" not in ln]
+    assert len(table_lines) >= 2, "Rubric must have header + data row"
+    headers = [c.strip().lower() for c in table_lines[0].split("|") if c.strip()]
+    values = [c.strip().lower() for c in table_lines[1].split("|") if c.strip()]
+    return dict(zip(headers[1:], values[1:]))
+
+
+def test_benchmark_red_team_cannot_stop(skill_text: str) -> None:
+    """Benchmark experiment type must cap red-team severity at warning (no STOP)."""
+    rubric = _parse_rt_rubric(skill_text)
+    assert "benchmark" in rubric, "Benchmark column not found in red-team calibration rubric"
+    assert rubric["benchmark"] == "warning", (
+        "Benchmark red-team severity must be capped at 'warning' — "
+        "STOP-eligible red-team findings are unreasonable for benchmarks."
+    )
+
+
+def test_causal_inference_red_team_can_stop(skill_text: str) -> None:
+    """causal_inference must retain critical as max red-team severity (STOP eligible)."""
+    rubric = _parse_rt_rubric(skill_text)
+    assert "causal_inference" in rubric, (
+        "causal_inference column not found in red-team calibration rubric"
+    )
+    assert rubric["causal_inference"] == "critical", (
+        "causal_inference must retain critical as max red-team severity."
+    )