Skip to content

v0.6.0 FP moat ablation: 3 vs 3 vs 2 on stubborn 14 — limit=50 re-run plan #72

@peaktwilight

Description

@peaktwilight

What

The CI A/B agent ran a 5-way feature ablation on the same 14-challenge stubborn slice (XBOW). The result was unexpected and important enough to track in public.

Configuration Mode Flags Cost
`features=none` (no v0.6.0 moat) white-box 3/13 $7.99
`features=experimental` white-box 3/13 $11.99
`features=all` (v0.6.0 headline) white-box 2/13 $12.41
`features=all` black-box 0/13 $8.42

Runs: 24022992439 24022991529 24022989979 24022990816

What this says

On the stubborn-14 slice, the v0.6.0 11-layer FP reduction moat:

  • did not improve solve rate vs the no-features baseline
  • arguably regressed (features=all lost 1 challenge to features=none)
  • cost +55% more dollars and +15% more turns

The most plausible mechanism: the FP filter is suppressing real signal before it gets reported. Ironic for a "false positive reduction" pipeline.

Caveats

  • N=13. One challenge swing = 7.7 percentage points. The flag delta could be statistical noise.
  • The 14 challenges are by definition the stubborn ones — the slice where the baseline already failed and the FP filter has the least signal to work with.
  • The cost and turn deltas are orders of magnitude bigger than the noise floor, so even if the flag delta is noise, the spend regression is real.

What's running now

Three limit=50 white-box ablation runs dispatched at 10:50 UTC 2026-04-06:

ETA ~6h. The 50-challenge slice has enough statistical power to make a real call on whether the moat helps or hurts.

Decision tree after the limit=50 result lands

  • If features=allfeatures=none at limit=50 → the stubborn-14 result was noise, the moat is fine, document it carefully.
  • If features=all < features=none at limit=50 → the moat is a regression. Two paths:
    • (a) Identify which layer is the problem via single-feature ablation; remove or re-tune it
    • (b) Publish the negative result openly. Reposition the moat as "structurally equivalent to commercial leaders, doesn't move solve rate at the studied scale, costs 55% more — here's the data, here's the methodology, this is what rigorous open-source security research looks like"
  • If results are split → publish both honestly, propose a per-layer ablation as future work

Why this matters for the joint paper

This is exactly the kind of empirical result the joint paper with @guanniqu would center on. "We A/B tested every layer of our own 11-layer FP reduction pipeline against a no-features baseline at N=50, and here's what worked vs what didn't" is more credible than any vendor's "95% accuracy" marketing claim.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions