What
The CI A/B agent ran a 5-way feature ablation on the same 14-challenge stubborn slice (XBOW). The result was unexpected and important enough to track in public.
| Configuration |
Mode |
Flags |
Cost |
| `features=none` (no v0.6.0 moat) |
white-box |
3/13 |
$7.99 |
| `features=experimental` |
white-box |
3/13 |
$11.99 |
| `features=all` (v0.6.0 headline) |
white-box |
2/13 |
$12.41 |
| `features=all` |
black-box |
0/13 |
$8.42 |
Runs: 24022992439 24022991529 24022989979 24022990816
What this says
On the stubborn-14 slice, the v0.6.0 11-layer FP reduction moat:
- did not improve solve rate vs the no-features baseline
- arguably regressed (
features=all lost 1 challenge to features=none)
- cost +55% more dollars and +15% more turns
The most plausible mechanism: the FP filter is suppressing real signal before it gets reported. Ironic for a "false positive reduction" pipeline.
Caveats
- N=13. One challenge swing = 7.7 percentage points. The flag delta could be statistical noise.
- The 14 challenges are by definition the stubborn ones — the slice where the baseline already failed and the FP filter has the least signal to work with.
- The cost and turn deltas are orders of magnitude bigger than the noise floor, so even if the flag delta is noise, the spend regression is real.
What's running now
Three limit=50 white-box ablation runs dispatched at 10:50 UTC 2026-04-06:
ETA ~6h. The 50-challenge slice has enough statistical power to make a real call on whether the moat helps or hurts.
Decision tree after the limit=50 result lands
- If
features=all ≥ features=none at limit=50 → the stubborn-14 result was noise, the moat is fine, document it carefully.
- If
features=all < features=none at limit=50 → the moat is a regression. Two paths:
- (a) Identify which layer is the problem via single-feature ablation; remove or re-tune it
- (b) Publish the negative result openly. Reposition the moat as "structurally equivalent to commercial leaders, doesn't move solve rate at the studied scale, costs 55% more — here's the data, here's the methodology, this is what rigorous open-source security research looks like"
- If results are split → publish both honestly, propose a per-layer ablation as future work
Why this matters for the joint paper
This is exactly the kind of empirical result the joint paper with @guanniqu would center on. "We A/B tested every layer of our own 11-layer FP reduction pipeline against a no-features baseline at N=50, and here's what worked vs what didn't" is more credible than any vendor's "95% accuracy" marketing claim.
What
The CI A/B agent ran a 5-way feature ablation on the same 14-challenge stubborn slice (XBOW). The result was unexpected and important enough to track in public.
Runs: 24022992439 24022991529 24022989979 24022990816
What this says
On the stubborn-14 slice, the v0.6.0 11-layer FP reduction moat:
features=alllost 1 challenge tofeatures=none)The most plausible mechanism: the FP filter is suppressing real signal before it gets reported. Ironic for a "false positive reduction" pipeline.
Caveats
What's running now
Three limit=50 white-box ablation runs dispatched at 10:50 UTC 2026-04-06:
ETA ~6h. The 50-challenge slice has enough statistical power to make a real call on whether the moat helps or hurts.
Decision tree after the limit=50 result lands
features=all≥features=noneat limit=50 → the stubborn-14 result was noise, the moat is fine, document it carefully.features=all<features=noneat limit=50 → the moat is a regression. Two paths:Why this matters for the joint paper
This is exactly the kind of empirical result the joint paper with @guanniqu would center on. "We A/B tested every layer of our own 11-layer FP reduction pipeline against a no-features baseline at N=50, and here's what worked vs what didn't" is more credible than any vendor's "95% accuracy" marketing claim.