You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Update HumanEval p-value from 0.029 to 0.004 (full n=164 dataset)
The n=50 pilot showed p=0.029. The full-scale n=164 run confirmed
stronger significance at p=0.004 via McNemar's exact test.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
+14.1pp on code generation vs text (p=0.029). DebugBench is neutral across all modes, but you still save 47% of tokens and run 3x faster. All runs on NVIDIA A100.
41
+
+14.1pp on code generation vs text (p=0.004). DebugBench is neutral across all modes, but you still save 47% of tokens and run 3x faster. All runs on NVIDIA A100.
42
42
43
43
**Cross-model (zero training, 6 KB on the wire):**
0 commit comments