v0.6.0 FP moat ablation: 3 vs 3 vs 2 on stubborn 14 — limit=50 re-run plan

## What

The CI A/B agent ran a 5-way feature ablation on the same 14-challenge stubborn slice (XBOW). The result was unexpected and important enough to track in public.

| Configuration | Mode | Flags | Cost |
|---|---|---:|---:|
| \`features=none\` (no v0.6.0 moat) | white-box | **3/13** | $7.99 |
| \`features=experimental\` | white-box | 3/13 | $11.99 |
| \`features=all\` (v0.6.0 headline) | white-box | **2/13** | $12.41 |
| \`features=all\` | black-box | 0/13 | $8.42 |

Runs: [24022992439](https://github.com/peaktwilight/pwnkit/actions/runs/24022992439) [24022991529](https://github.com/peaktwilight/pwnkit/actions/runs/24022991529) [24022989979](https://github.com/peaktwilight/pwnkit/actions/runs/24022989979) [24022990816](https://github.com/peaktwilight/pwnkit/actions/runs/24022990816)

## What this says

On the stubborn-14 slice, the v0.6.0 11-layer FP reduction moat:
- did not improve solve rate vs the no-features baseline
- arguably regressed (`features=all` lost 1 challenge to `features=none`)
- cost +55% more dollars and +15% more turns

The most plausible mechanism: the FP filter is suppressing real signal before it gets reported. Ironic for a "false positive reduction" pipeline.

## Caveats

- N=13. One challenge swing = 7.7 percentage points. The flag delta could be statistical noise.
- The 14 challenges are by definition the stubborn ones — the slice where the baseline already failed and the FP filter has the least signal to work with.
- The cost and turn deltas are orders of magnitude bigger than the noise floor, so even if the flag delta is noise, the spend regression is real.

## What's running now

Three limit=50 white-box ablation runs dispatched at 10:50 UTC 2026-04-06:

- Run [24029011949](https://github.com/peaktwilight/pwnkit/actions/runs/24029011949) — \`features=none\` (baseline)
- Run [24029012616](https://github.com/peaktwilight/pwnkit/actions/runs/24029012616) — \`features=experimental\`
- Run [24029013242](https://github.com/peaktwilight/pwnkit/actions/runs/24029013242) — \`features=all\`

ETA ~6h. The 50-challenge slice has enough statistical power to make a real call on whether the moat helps or hurts.

## Decision tree after the limit=50 result lands

- **If `features=all` ≥ `features=none`** at limit=50 → the stubborn-14 result was noise, the moat is fine, document it carefully.
- **If `features=all` < `features=none`** at limit=50 → the moat is a regression. Two paths:
  - (a) Identify which layer is the problem via single-feature ablation; remove or re-tune it
  - (b) Publish the negative result openly. Reposition the moat as "structurally equivalent to commercial leaders, doesn't move solve rate at the studied scale, costs 55% more — here's the data, here's the methodology, this is what rigorous open-source security research looks like"
- **If results are split** → publish both honestly, propose a per-layer ablation as future work

## Why this matters for the joint paper

This is exactly the kind of empirical result the joint paper with @guanniqu would center on. *"We A/B tested every layer of our own 11-layer FP reduction pipeline against a no-features baseline at N=50, and here's what worked vs what didn't"* is more credible than any vendor's "95% accuracy" marketing claim.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.6.0 FP moat ablation: 3 vs 3 vs 2 on stubborn 14 — limit=50 re-run plan #72

What

What this says

Caveats

What's running now

Decision tree after the limit=50 result lands

Why this matters for the joint paper

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Configuration	Mode	Flags	Cost
`features=none` (no v0.6.0 moat)	white-box	3/13	$7.99
`features=experimental`	white-box	3/13	$11.99
`features=all` (v0.6.0 headline)	white-box	2/13	$12.41
`features=all`	black-box	0/13	$8.42

v0.6.0 FP moat ablation: 3 vs 3 vs 2 on stubborn 14 — limit=50 re-run plan #72

Description

What

What this says

Caveats

What's running now

Decision tree after the limit=50 result lands

Why this matters for the joint paper

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions