This directory contains the public benchmark material for a simple but important question:
How much better is NEXO than a static
CLAUDE.md-only setup on real recall-heavy workflows?
This is intentionally a small, reproducible operator benchmark, not a grand universal intelligence claim.
Baselines:
nexo_full_stackstatic_claude_mdno_memory
Primary outcomes:
- decision recall
- preference recall
- repeat-error avoidance
- interrupted-task resume
- related-context stitching
- contradiction handling
- temporal reasoning
- structured domain recall
- cross-client continuity
- outcome-loop usage
- prioritization quality
- Use one fixed model per run.
- Keep prompts identical across baselines.
- Use the same synthetic project history and scenario rubric.
- Score each scenario as
pass,partial, orfail. - Record the number of tool calls needed before the correct answer/action.
scenarios/contains the scenario definitions and expected outputsresults/contains checked-in benchmark runsruntime_pack/contains the structured operator benchmark catalog, run files, and generated summary artifactslocomo/contains the larger checked-in LoCoMo benchmark harness
The first micro-benchmark is here:
This initial run is deliberately modest: five workflow scenarios, manual grading rubric, and a baseline comparison that answers a product question users actually ask.
The broader v5 foundation matrix is here:
That second run keeps the same three baselines, but widens the matrix into contradiction/freshness, temporal reasoning, structured recall, multi-session continuity, cross-client continuity, and the first outcome-loop / prioritization checks.
The structured runtime-pack artifacts generated from that run live here:
runtime_pack/scenario_catalog.jsonruntime_pack/results/2026-04-08-memory-recall-vs-static.jsonruntime_pack/results/latest_summary.jsonruntime_pack/README.md