Update HumanEval p-value from 0.029 to 0.004 (full n=164 dataset)

SStas · claude · SStas · commit 579419eb8fa6 · 2026-03-08T07:04:40.000Z
The n=50 pilot showed p=0.029. The full-scale n=164 run confirmed
stronger significance at p=0.004 via McNemar's exact test.

Co-Authored-By: Claude Opus 4.6 &lt;noreply@anthropic.com&gt;
diff --git a/README.md b/README.md
@@ -38,7 +38,7 @@ answer = connector.generate(prompt, context=context)
 | **DebugBench** (Qwen 7B, n=100) | 50.0% | **51.0%** | 49.0% |
 | **GSM8K** (Llama 3B, n=200) | 75.0% | **78.0%** | 75.5% |
 
-+14.1pp on code generation vs text (p=0.029). DebugBench is neutral across all modes, but you still save 47% of tokens and run 3x faster. All runs on NVIDIA A100.
++14.1pp on code generation vs text (p=0.004). DebugBench is neutral across all modes, but you still save 47% of tokens and run 3x faster. All runs on NVIDIA A100.
 
 **Cross-model (zero training, 6 KB on the wire):**
 
diff --git a/docs/BENCHMARKS.md b/docs/BENCHMARKS.md
@@ -1,6 +1,6 @@
 # AVP Benchmarks
 
-> **+8.6pp on code generation (p=0.029) · 46-78% fewer tokens · 2-4x faster** — 8 benchmarks, 5 models, 2 families.
+> **+8.6pp on code generation (p=0.004) · 46-78% fewer tokens · 2-4x faster** — 8 benchmarks, 5 models, 2 families.
 
 ---
 
@@ -14,7 +14,7 @@ Same-model latent transfer matches or improves accuracy on structured tasks. Tes
 |---|--------|--------------|------|
 | **HumanEval** (Qwen 7B, n=164) | 58.5% | **67.1%** | 53.0% |
 
-Latent vs text: p=0.029. Text chains introduce formatting noise that disrupts code structure.
+Latent vs text: p=0.004. Text chains introduce formatting noise that disrupts code structure.
 
 ### Math Reasoning