Skip to content
Merged
4 changes: 2 additions & 2 deletions src/autoskillit/recipes/research.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -127,10 +127,10 @@ steps:
resolve_design_review:
tool: run_skill
with:
skill_command: "/autoskillit:resolve-design-review ${{ context.evaluation_dashboard }} ${{ context.experiment_plan }}"
skill_command: "/autoskillit:resolve-design-review ${{ context.evaluation_dashboard }} ${{ context.experiment_plan }} ${{ context.revision_guidance }}"
cwd: "${{ inputs.source_dir }}"
step_name: resolve_design_review
optional_context_refs: [evaluation_dashboard, experiment_plan]
optional_context_refs: [evaluation_dashboard, experiment_plan, revision_guidance]
capture:
revision_guidance: "${{ result.revision_guidance }}"
retries: 1
Expand Down
36 changes: 34 additions & 2 deletions src/autoskillit/skills_extended/resolve-design-review/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ loop or halt.

## Arguments

`/autoskillit:resolve-design-review <evaluation_dashboard_path> <experiment_plan_path>`
`/autoskillit:resolve-design-review <evaluation_dashboard_path> <experiment_plan_path> [prior_revision_guidance_path]`

## When to Use

Expand All @@ -52,7 +52,10 @@ MCP-only — not user-invocable directly.
2. Parse two positional path arguments: `evaluation_dashboard_path`, `experiment_plan_path`
- If missing: print `"Error: missing required argument(s) — expected <evaluation_dashboard_path> <experiment_plan_path>"`, then emit `resolution=failed`, exit 0
- If file not found: print `"Error: file not found — {missing_path}"`, then emit `resolution=failed`, exit 0
3. Parse stop-trigger findings from the evaluation dashboard:
3. Parse optional third argument: `prior_revision_guidance_path`
- If present and file exists: read prior revision guidance for theme comparison
- If absent or file not found: skip diminishing-return detection (first-round behavior)
4. Parse stop-trigger findings from the evaluation dashboard:
- Locate machine-readable YAML block (`# --- review-design machine summary ---`)
- Extract critical findings from L1 dimensions (estimand_clarity, hypothesis_falsifiability)
- Extract red_team critical findings
Expand Down Expand Up @@ -90,6 +93,35 @@ Triage complete (BEFORE any guidance written)
ADDRESSABLE: N | STRUCTURAL: N | DISCUSS: N
```

### Step 1.5: Diminishing-Return Detection (only when prior_revision_guidance_path provided)

When prior revision guidance is available, compare the current ADDRESSABLE findings
against the themes in the prior round's revision guidance to detect goalposts-moving.

A finding is **goalposts-moving** when:
- The prior guidance addressed a specific concern at scope X (e.g., "add n=100K")
- The current finding raises the same concern at scope X+1 (e.g., "n=100K doesn't prove n=250K")
- The pattern is: the plan improved to satisfy the prior concern, but the reviewer
raised the bar on the same theme

Detection heuristic — launch one subagent (model: "sonnet") per ADDRESSABLE finding.
Each subagent receives: current finding + all prior guidance entries. It returns:

```json
{
"goalposts_moving": true|false,
"prior_theme_match": "the specific prior guidance entry this finding escalates",
"escalation_pattern": "brief description of how the bar was raised"
}
```

When `goalposts_moving: true`, reclassify the finding from ADDRESSABLE to STRUCTURAL
with annotation: `"reclassified: goalposts-moving (prior theme: {prior_theme_match})"`.
This ensures the fix-and-review cycle terminates for concerns that are not converging.

Fallback: if no prior_revision_guidance_path is provided, skip this step entirely
(preserves current first-round behavior unchanged).

### Step 2: Apply Resolution Logic

```
Expand Down
43 changes: 39 additions & 4 deletions src/autoskillit/skills_extended/review-design/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -215,6 +215,22 @@ Receives: full plan text and `experiment_type` (from Step 1 triage output)
- exploratory → HARKing vulnerability
- ALL red-team findings must set `"requires_decision": true` and `"dimension": "red_team"`

**Red-team severity calibration rubric:**

| Dimension | causal_inference | benchmark | configuration_study | robustness_audit | exploratory |
|-----------|-----------------|-----------|---------------------|------------------|-------------|
| red_team | critical | warning | warning | warning | info |

The red-team agent assigns severity based on the intrinsic quality of each finding.
After the red-team agent returns, **cap each finding's severity** to the maximum
allowed by the experiment type using this rubric — identical to how L1 severity
calibration works. For `causal_inference`: critical red-team findings remain critical
(STOP-eligible). For `benchmark`/`configuration_study`/`robustness_audit`: critical
findings are downgraded to `warning` (REVISE-eligible but never STOP). For
`exploratory`: all red-team findings are capped at `info` (informational only).

This cap is applied in Step 7 before the verdict logic evaluates `stop_triggers`.

### Step 4: Level 3 (parallel)

Run after Level 2 completes. Do not wait for the red-team agent before starting Level 3.
Expand Down Expand Up @@ -277,17 +293,36 @@ One synthesis pass (no subagent — orchestrator synthesizes directly):
1. **Merge all findings** from L1, L2, L3, L4, and red-team into a single list.
2. **Deduplicate** by `(dimension, section, message)` — identical findings from parallel
agents are collapsed into one entry.
3. **Apply verdict logic**:
3. **Apply red-team severity cap, then verdict logic**:
```python
# Red-team severity cap: downgrade findings above the type ceiling
RT_MAX_SEVERITY = {
"causal_inference": "critical",
"benchmark": "warning",
"configuration_study": "warning",
"robustness_audit": "warning",
"exploratory": "info",
}
SEVERITY_RANK = {"info": 0, "warning": 1, "critical": 2}
rt_cap = RT_MAX_SEVERITY[experiment_type]

for f in findings:
if f.dimension == "red_team" and SEVERITY_RANK[f.severity] > SEVERITY_RANK[rt_cap]:
f.severity = rt_cap # downgrade before verdict

# Reclassify after cap
critical = [f for f in findings if f.severity == "critical"]
warning_findings = [f for f in findings if f.severity == "warning"]

# L1 fail-fast path: structural defects that block all further analysis
stop_triggers = [f for f in critical if f.dimension in {"estimand_clarity", "hypothesis_falsifiability"}]
# Red-team STOP path: adversarial critical findings after full analysis (L2L4)
# These fire only when the L1 gate passed; any critical red_team finding is a STOP.
# Red-team STOP path: adversarial critical findings after full analysis (L2-L4)
# These fire only when the L1 gate passed AND the severity cap still allows critical.
stop_triggers += [f for f in critical if f.dimension == "red_team"]

if stop_triggers:
verdict = "STOP"
elif critical_findings or len(warning_findings) >= 3:
elif critical or len(warning_findings) >= 3:
verdict = "REVISE"
else:
verdict = "GO"
Expand Down
36 changes: 36 additions & 0 deletions tests/skills/test_resolve_design_review_contracts.py
Original file line number Diff line number Diff line change
Expand Up @@ -82,3 +82,39 @@ def test_structured_output_tokens_present():
def test_temp_directory_is_resolve_design_review():
"""SKILL.md must use .autoskillit/temp/resolve-design-review/ for output."""
assert ".autoskillit/temp/resolve-design-review/" in SKILL_TEXT


# ── Diminishing-return detection ──────────────────────────────────────────────


def test_diminishing_return_detection_present():
"""SKILL.md must describe diminishing-return detection for finding themes."""
lower = SKILL_TEXT.lower()
assert "diminishing" in lower or "goalposts" in lower or "theme comparison" in lower, (
"resolve-design-review must detect diminishing returns — repeated findings "
"that are higher-abstraction restatements of previously addressed concerns."
)


def test_goalposts_reclassified_as_structural():
"""Goalposts-moving findings must be reclassified as STRUCTURAL."""
lower = SKILL_TEXT.lower()
has_goalposts_structural = "goalposts" in lower and "structural" in lower
has_diminishing_structural = "diminishing" in lower and "structural" in lower
has_reclassify = "reclassif" in lower
assert has_goalposts_structural or has_diminishing_structural or has_reclassify, (
"Goalposts-moving findings must be reclassified as STRUCTURAL — "
"the fix-and-review cycle is not converging on that concern."
)


def test_revision_guidance_context_input():
"""SKILL.md must accept prior revision_guidance as context for theme comparison."""
lower = SKILL_TEXT.lower()
assert "prior" in lower or "previous" in lower, (
"resolve-design-review must reference prior revision context "
"for diminishing-return detection."
)
assert "[prior_revision_guidance_path]" in SKILL_TEXT or "optional" in lower, (
"Prior revision_guidance must be an optional argument for backward compatibility."
)
75 changes: 75 additions & 0 deletions tests/skills/test_review_design_contracts.py
Original file line number Diff line number Diff line change
Expand Up @@ -294,3 +294,78 @@ def test_l3_l4_subagents_receive_experiment_type(
f"{step_heading}: L3/L4 subagents must receive experiment_type. "
"Type-agnostic severity calibration is a structural gap."
)


# ── Red-team severity calibration ─────────────────────────────────────────────


def test_red_team_severity_calibration_rubric_present(skill_text: str) -> None:
"""Red-team dimension must have a severity calibration rubric by experiment type.

Without this rubric, any critical red-team finding triggers STOP regardless
of experiment type, creating an unresolvable loop for benchmarks.
"""
rt_cal_idx = skill_text.lower().find("red-team severity calibration")
assert rt_cal_idx != -1, (
"Red-team severity calibration rubric not found in SKILL.md. "
"Without it, any critical red-team finding triggers STOP regardless "
"of experiment type."
)
rt_section = skill_text[rt_cal_idx : rt_cal_idx + 1000]
for exp_type in ["causal_inference", "benchmark", "exploratory"]:
assert exp_type in rt_section, (
f"Red-team calibration rubric must specify {exp_type} severity cap."
)


def test_red_team_severity_cap_applied_before_verdict(skill_text: str) -> None:
"""Severity cap must be applied BEFORE building stop_triggers in verdict logic.

Without this ordering, red-team criticals bypass the cap and still trigger STOP.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[critical] tests: Syntax error: the docstring for test_red_team_severity_cap_applied_before_verdict is missing its opening """. Line 322 reads """Severity cap must be applied BEFORE building stop_triggers in verdict logic.""" (closing on same line), then L324 is a bare string Without this ordering, red-team criticals bypass the cap... followed by """ on L325. This makes L324 a bare expression that is NOT inside a triple-quoted string, causing a SyntaxError at import time — the entire test module fails to collect.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Investigated — this is intentional. The docstring is correctly formed: """Severity cap must be applied BEFORE building stop_triggers in verdict logic.\n\n Without this ordering...\n """ (opening triple-quote on L322, closing on L325). ast.parse() confirms no SyntaxError. The diff hunk visible to the reviewer was truncated before the closing line. No change needed.

"""
step7_text = skill_text_between("### Step 7", "### Step 8", skill_text)
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[warning] tests: skill_text_between("### Step 7", "### Step 8", skill_text) is called but skill_text_between is not imported or defined anywhere in the visible diff. If this helper is absent from the existing test file, the test will raise a NameError at runtime.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Investigated — this is intentional. skill_text_between is defined at line 190 of the same file (def skill_text_between(start_heading: str, end_heading: str, text: str) -> str:). The function predates this PR. The reviewer's diff hunk started at line 294 and did not include the pre-existing helper definition above it.

cap_idx = step7_text.find("rt_cap")
stop_idx = step7_text.find('f.dimension == "red_team"')
assert cap_idx != -1, (
"Step 7 verdict logic must reference rt_cap for red-team severity capping."
)
assert stop_idx != -1, (
"Step 7 verdict logic must reference red_team dimension in stop_triggers."
)
assert cap_idx < stop_idx, (
"rt_cap must be applied BEFORE the red_team stop_triggers line — "
"otherwise the cap has no effect on STOP eligibility."
)


def _parse_rt_rubric(skill_text: str) -> dict[str, str]:
"""Parse the red-team severity calibration rubric into {experiment_type: severity}."""
rt_cal_idx = skill_text.lower().find("red-team severity calibration")
assert rt_cal_idx != -1, "Red-team severity calibration rubric not found"
rt_section = skill_text[rt_cal_idx : rt_cal_idx + 1000]
table_lines = [ln for ln in rt_section.splitlines() if "|" in ln and "---" not in ln]
assert len(table_lines) >= 2, "Rubric must have header + data row"
headers = [c.strip().lower() for c in table_lines[0].split("|") if c.strip()]
values = [c.strip().lower() for c in table_lines[1].split("|") if c.strip()]
return dict(zip(headers[1:], values[1:]))


def test_benchmark_red_team_cannot_stop(skill_text: str) -> None:
"""Benchmark experiment type must cap red-team severity at warning (no STOP)."""
rubric = _parse_rt_rubric(skill_text)
assert "benchmark" in rubric, "Benchmark column not found in red-team calibration rubric"
assert rubric["benchmark"] == "warning", (
"Benchmark red-team severity must be capped at 'warning' — "
"STOP-eligible red-team findings are unreasonable for benchmarks."
)


def test_causal_inference_red_team_can_stop(skill_text: str) -> None:
"""causal_inference must retain critical as max red-team severity (STOP eligible)."""
rubric = _parse_rt_rubric(skill_text)
assert "causal_inference" in rubric, (
"causal_inference column not found in red-team calibration rubric"
)
assert rubric["causal_inference"] == "critical", (
"causal_inference must retain critical as max red-team severity."
)
Loading