The 2026-04-11 ablation showed that npm-bench FPR swings ±0.08 between same-profile runs (batch 1 default = 0.19, batch 2 default = 0.11). At 27 safe packages, that's a 2-package flip — within LLM variance. The 'stable features cause FPR' finding from batch 1 did NOT replicate in batch 2.
We need repeat=3 per profile to separate signal from noise. The xbow-bench workflow already supports --repeat N with Wilson CI. npm-bench should get the same treatment.
Concrete ask:
- Add
repeat input to npm-bench.yml (mirrors xbow-bench)
- Dispatch
repeat=3 for all 5 profiles (none, no-triage, moat-only, moat, default)
- Compute per-profile FPR with 95% Wilson CI
- Report whether any profile is statistically significantly different from
none
Refs #72 (the ablation that surfaced the noise problem)
The 2026-04-11 ablation showed that npm-bench FPR swings ±0.08 between same-profile runs (batch 1
default= 0.19, batch 2default= 0.11). At 27 safe packages, that's a 2-package flip — within LLM variance. The 'stable features cause FPR' finding from batch 1 did NOT replicate in batch 2.We need
repeat=3per profile to separate signal from noise. The xbow-bench workflow already supports--repeat Nwith Wilson CI. npm-bench should get the same treatment.Concrete ask:
repeatinput to npm-bench.yml (mirrors xbow-bench)repeat=3for all 5 profiles (none, no-triage, moat-only, moat, default)noneRefs #72 (the ablation that surfaced the noise problem)