[benchmark] npm-bench repeat=3: get statistically meaningful FPR

The 2026-04-11 ablation showed that npm-bench FPR swings ±0.08 between same-profile runs (batch 1 `default` = 0.19, batch 2 `default` = 0.11). At 27 safe packages, that's a 2-package flip — within LLM variance. The 'stable features cause FPR' finding from batch 1 did NOT replicate in batch 2.

We need `repeat=3` per profile to separate signal from noise. The xbow-bench workflow already supports `--repeat N` with Wilson CI. npm-bench should get the same treatment.

**Concrete ask:**
1. Add `repeat` input to npm-bench.yml (mirrors xbow-bench)
2. Dispatch `repeat=3` for all 5 profiles (none, no-triage, moat-only, moat, default)
3. Compute per-profile FPR with 95% Wilson CI
4. Report whether any profile is statistically significantly different from `none`

Refs #72 (the ablation that surfaced the noise problem)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[benchmark] npm-bench repeat=3: get statistically meaningful FPR #121

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[benchmark] npm-bench repeat=3: get statistically meaningful FPR #121

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions