Benchmark for measuring semantic code search quality of octocode's full pipeline: chunking, embedding, vector search, and reranking.
code.csv contains 127 queries with 1-3 annotated code locations each (100 standard + 27 hard).
Based on commit: b1771ba48214ce7404ad7158e277eb60680912b4
Config used: benchmark/config.toml — contextual retrieval enabled, Voyage reranker 2.5, RaBitQ quantization
- AI-generated candidates — queries and expected code/doc locations were generated by an LLM with full codebase context, covering all major modules and documentation files
- Source verification — every referenced file and line range was read and validated against the actual source to confirm the code/docs at those lines answers the query
- Multi-agent validation — parallel validation agents independently checked line ranges, file paths, and relevance scores across all 254 queries
- Search-informed corrections — queries that the search missed were analyzed: if the search found a valid alternative location (same logic in a different file), it was added as a secondary result rather than removed
- Hard query design — 27 code queries and ~14 doc queries use natural language that deliberately avoids mirroring function names or section titles, testing semantic understanding over keyword matching
Format:
query,result1,result2,result3
Each result: src/path/file.rs:start_line-end_line:relevance
- Relevance
2= primary (directly answers the query) - Relevance
1= secondary (related, useful context)
Matching uses line range overlap: a search result at lines 40-90 matches ground truth 45-92 if ranges intersect.
Binary per query: did ANY correct result appear in the top-k? Averaged across all queries.
A Hit@5 of 0.85 means 85% of queries had at least one relevant result in the top 5.
Reciprocal of the rank of the first correct result, averaged across queries.
| First hit at rank | Score |
|---|---|
| 1 | 1.0 |
| 2 | 0.5 |
| 3 | 0.33 |
| 5 | 0.2 |
| Not found | 0.0 |
Measures how high the first relevant result appears.
Accounts for both relevance grades (2 vs 1) and position in the ranking:
DCG@k = sum( rel_i / log2(i + 1) ) for i = 1..k
IDCG@k = DCG of the ideal ranking (ground truth sorted by relevance desc)
NDCG = DCG / IDCG
A relevance=2 result at position 1 contributes more than a relevance=1 result at position 5. Measures whether the most relevant results are ranked highest.
Fraction of ground truth entries found in the top-k results.
If a query has 3 ground truth blocks and search finds 2 of them in the top 10:
Recall@10 = 2/3 = 0.67
Measures completeness: how many relevant blocks did we find?
Ground truth: fileA:10-50:2, fileB:20-30:1
Search returns: [fileC:1-10, fileA:30-60, fileB:25-35, ...]
- fileA overlaps (30-60 intersects 10-50) at rank 2, relevance=2
- fileB overlaps (25-35 intersects 20-30) at rank 3, relevance=1
| Metric | Calculation | Score |
|---|---|---|
| Hit@5 | found a match | 1 |
| MRR | first hit at rank 2 = 1/2 | 0.50 |
| DCG | 2/log2(3) + 1/log2(4) = 1.26 + 0.50 | 1.76 |
| IDCG | 2/log2(2) + 1/log2(3) = 2.00 + 0.63 | 2.63 |
| NDCG@10 | 1.76 / 2.63 | 0.67 |
| Recall@10 | 2 of 2 ground truth found | 1.00 |
| File | Mode | Queries | What it tests |
|---|---|---|---|
code.csv |
code |
127 | Code search: functions, structs, logic blocks |
docs.csv |
docs |
127 | Doc search: markdown sections, config guides, architecture |
# Benchmark code search (default)
python3 benchmark/score.py --verbose
# Benchmark documentation search
python3 benchmark/score.py --mode docs --csv benchmark/docs.csv --verbose
# With custom settings
python3 benchmark/score.py --threshold 0.5 --max-results 10
# Quiet mode (summary only)
python3 benchmark/score.pyExit code is 1 if Hit@5 drops below 0.70.
The 100 queries cover all major modules:
| Area | Queries |
|---|---|
| Code chunking/extraction | 12 |
| Markdown processing | 12 |
| Contextual enrichment | 5 |
| Differential indexing | 3 |
| Embedding generation | 5 |
| Store data structures | 10 |
| Store operations | 10 |
| Vector optimizer/table ops | 4 |
| Batch converter | 1 |
| GraphRAG types | 5 |
| GraphRAG relationships | 6 |
| GraphRAG database/utils | 7 |
| GraphRAG AI/builder | 6 |
| MCP server | 5 |
| LSP integration | 4 |
| LLM client | 2 |
| Search/rendering | 2 |
| File watcher | 1 |
| Config/storage/state | 7 |
| Subtotal (standard) | 100 |
| Hard queries | 27 |
| Total | 127 |
The last 27 queries (101-127) use natural language that doesn't mirror code comments or function names. They test semantic understanding rather than keyword matching:
| # | Query intent | Why it's hard |
|---|---|---|
| 101 | Preventing infinite loops in overlapping chunks | 5-line guard clause, no code keywords in query |
| 102 | Skipping indexing when repo unchanged | Buried 100+ lines into a 900-line function |
| 103 | Avoiding duplicate embeddings for unchanged code | Hash-based dedup flow across functions |
| 104 | Vector quantization compression ratios | Answer is in doc comments, not executable code |
| 105 | Consistent database paths across developers | Intent-based query about identity hashing |
| 106 | Minimum declarations before grouping | Specific constant + its usage site (2 lines) |
| 107 | Why some symbols are hidden in results | UX behavior question, small helper function |
| 108 | AI vs rule-based decision for code analysis | Decision logic, no direct keyword overlap |
| 109 | Filtering dissimilar search results | 7-line block within a 60-line method |
| 110 | Knowing which files to reindex after git commit | Multi-step git diff logic |
| 111 | Cleaning markdown fences from LLM JSON responses | Natural language, function name not in query |
| 112 | Behavior when embedding API keeps failing | Failure/retry path, not happy path |
| 113 | Skipping files unchanged on disk (mtime) | Performance optimization buried in main loop |
| 114 | Preventing duplicate graph nodes during rebuild | Dedup check within batch processing (25 lines) |
| 115 | Switching embedding model with different dimensions | Schema migration logic, not obvious location |
| 116 | Metadata not saved if flush fails | Atomicity pattern, answer in CRITICAL comment |
| 117 | Preventing concurrent reindexing in MCP server | AtomicBool + compare_exchange (concurrency) |
| 118 | Deduplicating results from multiple queries | Single function call, concept-level query |
| 119 | Non-code files handled as chunked text | Edge case handling, 6 lines in 900-line function |
| 120 | Forced flush after removing changed file blocks | Crash safety, explained only in code comment |
| 121 | Avoiding redundant table opens | Cache with double-check locking pattern (20 lines) |
| 122 | Reranker score to distance conversion | 2-line conversion in a map closure |
| 123 | Rough token estimation without tokenizer | Single line: s.len() / 4 |
| 124 | Additional delay before background reindex | 3 lines with timing rationale |
| 125 | Cleaning up files recently added to gitignore | 8 lines within cleanup loop |
| 126 | Similarity-to-distance threshold conversion | 6 lines, two separate locations |
| 127 | GraphRAG build from existing DB when no new files | Decision tree within indexing pipeline |
If the standard queries score ~1.0 but hard queries score significantly lower, the benchmark is working correctly and exposes real retrieval weaknesses.