Skip to content

[benchmark] npm-bench repeat=3: get statistically meaningful FPR #121

@peaktwilight

Description

@peaktwilight

The 2026-04-11 ablation showed that npm-bench FPR swings ±0.08 between same-profile runs (batch 1 default = 0.19, batch 2 default = 0.11). At 27 safe packages, that's a 2-package flip — within LLM variance. The 'stable features cause FPR' finding from batch 1 did NOT replicate in batch 2.

We need repeat=3 per profile to separate signal from noise. The xbow-bench workflow already supports --repeat N with Wilson CI. npm-bench should get the same treatment.

Concrete ask:

  1. Add repeat input to npm-bench.yml (mirrors xbow-bench)
  2. Dispatch repeat=3 for all 5 profiles (none, no-triage, moat-only, moat, default)
  3. Compute per-profile FPR with 95% Wilson CI
  4. Report whether any profile is statistically significantly different from none

Refs #72 (the ablation that surfaced the noise problem)

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions