Skip to content

Commit 579419e

Browse files
SStasclaude
andcommitted
Update HumanEval p-value from 0.029 to 0.004 (full n=164 dataset)
The n=50 pilot showed p=0.029. The full-scale n=164 run confirmed stronger significance at p=0.004 via McNemar's exact test. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent d12ca8f commit 579419e

File tree

2 files changed

+3
-3
lines changed

2 files changed

+3
-3
lines changed

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -38,7 +38,7 @@ answer = connector.generate(prompt, context=context)
3838
| **DebugBench** (Qwen 7B, n=100) | 50.0% | **51.0%** | 49.0% |
3939
| **GSM8K** (Llama 3B, n=200) | 75.0% | **78.0%** | 75.5% |
4040

41-
+14.1pp on code generation vs text (p=0.029). DebugBench is neutral across all modes, but you still save 47% of tokens and run 3x faster. All runs on NVIDIA A100.
41+
+14.1pp on code generation vs text (p=0.004). DebugBench is neutral across all modes, but you still save 47% of tokens and run 3x faster. All runs on NVIDIA A100.
4242

4343
**Cross-model (zero training, 6 KB on the wire):**
4444

docs/BENCHMARKS.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# AVP Benchmarks
22

3-
> **+8.6pp on code generation (p=0.029) · 46-78% fewer tokens · 2-4x faster** — 8 benchmarks, 5 models, 2 families.
3+
> **+8.6pp on code generation (p=0.004) · 46-78% fewer tokens · 2-4x faster** — 8 benchmarks, 5 models, 2 families.
44
55
---
66

@@ -14,7 +14,7 @@ Same-model latent transfer matches or improves accuracy on structured tasks. Tes
1414
|---|--------|--------------|------|
1515
| **HumanEval** (Qwen 7B, n=164) | 58.5% | **67.1%** | 53.0% |
1616

17-
Latent vs text: p=0.029. Text chains introduce formatting noise that disrupts code structure.
17+
Latent vs text: p=0.004. Text chains introduce formatting noise that disrupts code structure.
1818

1919
### Math Reasoning
2020

0 commit comments

Comments
 (0)