Name	Name	Last commit message	Last commit date
parent directory ..
README.md	README.md
code.csv	code.csv
config.toml	config.toml
docs.csv	docs.csv
score.py	score.py

Retrieval Quality Benchmark

Benchmark for measuring semantic code search quality of octocode's full pipeline: chunking, embedding, vector search, and reranking.

Ground Truth

code.csv contains 127 queries with 1-3 annotated code locations each (100 standard + 27 hard).

Based on commit: b1771ba48214ce7404ad7158e277eb60680912b4 Config used: benchmark/config.toml — contextual retrieval enabled, Voyage reranker 2.5, RaBitQ quantization

How ground truth was created

AI-generated candidates — queries and expected code/doc locations were generated by an LLM with full codebase context, covering all major modules and documentation files
Source verification — every referenced file and line range was read and validated against the actual source to confirm the code/docs at those lines answers the query
Multi-agent validation — parallel validation agents independently checked line ranges, file paths, and relevance scores across all 254 queries
Search-informed corrections — queries that the search missed were analyzed: if the search found a valid alternative location (same logic in a different file), it was added as a secondary result rather than removed
Hard query design — 27 code queries and ~14 doc queries use natural language that deliberately avoids mirroring function names or section titles, testing semantic understanding over keyword matching

Format:

query,result1,result2,result3

Each result: src/path/file.rs:start_line-end_line:relevance

Relevance 2 = primary (directly answers the query)
Relevance 1 = secondary (related, useful context)

Matching uses line range overlap: a search result at lines 40-90 matches ground truth 45-92 if ranges intersect.

Metrics

Hit@k

Binary per query: did ANY correct result appear in the top-k? Averaged across all queries.

A Hit@5 of 0.85 means 85% of queries had at least one relevant result in the top 5.

MRR (Mean Reciprocal Rank)

Reciprocal of the rank of the first correct result, averaged across queries.

First hit at rank	Score
1	1.0
2	0.5
3	0.33
5	0.2
Not found	0.0

Measures how high the first relevant result appears.

NDCG@10 (Normalized Discounted Cumulative Gain)

Accounts for both relevance grades (2 vs 1) and position in the ranking:

DCG@k  = sum( rel_i / log2(i + 1) )   for i = 1..k
IDCG@k = DCG of the ideal ranking (ground truth sorted by relevance desc)
NDCG   = DCG / IDCG

A relevance=2 result at position 1 contributes more than a relevance=1 result at position 5. Measures whether the most relevant results are ranked highest.

Recall@k

Fraction of ground truth entries found in the top-k results.

If a query has 3 ground truth blocks and search finds 2 of them in the top 10:

Recall@10 = 2/3 = 0.67

Measures completeness: how many relevant blocks did we find?

Worked Example

Ground truth: fileA:10-50:2, fileB:20-30:1 Search returns: [fileC:1-10, fileA:30-60, fileB:25-35, ...]

fileA overlaps (30-60 intersects 10-50) at rank 2, relevance=2
fileB overlaps (25-35 intersects 20-30) at rank 3, relevance=1

Metric	Calculation	Score
Hit@5	found a match	1
MRR	first hit at rank 2 = 1/2	0.50
DCG	2/log2(3) + 1/log2(4) = 1.26 + 0.50	1.76
IDCG	2/log2(2) + 1/log2(3) = 2.00 + 0.63	2.63
NDCG@10	1.76 / 2.63	0.67
Recall@10	2 of 2 ground truth found	1.00

Ground Truth Files

File	Mode	Queries	What it tests
`code.csv`	`code`	127	Code search: functions, structs, logic blocks
`docs.csv`	`docs`	127	Doc search: markdown sections, config guides, architecture

Usage

# Benchmark code search (default)
python3 benchmark/score.py --verbose

# Benchmark documentation search
python3 benchmark/score.py --mode docs --csv benchmark/docs.csv --verbose

# With custom settings
python3 benchmark/score.py --threshold 0.5 --max-results 10

# Quiet mode (summary only)
python3 benchmark/score.py

Exit code is 1 if Hit@5 drops below 0.70.

Coverage

The 100 queries cover all major modules:

Area	Queries
Code chunking/extraction	12
Markdown processing	12
Contextual enrichment	5
Differential indexing	3
Embedding generation	5
Store data structures	10
Store operations	10
Vector optimizer/table ops	4
Batch converter	1
GraphRAG types	5
GraphRAG relationships	6
GraphRAG database/utils	7
GraphRAG AI/builder	6
MCP server	5
LSP integration	4
LLM client	2
Search/rendering	2
File watcher	1
Config/storage/state	7
Subtotal (standard)	100
Hard queries	27
Total	127

Hard Queries

The last 27 queries (101-127) use natural language that doesn't mirror code comments or function names. They test semantic understanding rather than keyword matching:

#	Query intent	Why it's hard
101	Preventing infinite loops in overlapping chunks	5-line guard clause, no code keywords in query
102	Skipping indexing when repo unchanged	Buried 100+ lines into a 900-line function
103	Avoiding duplicate embeddings for unchanged code	Hash-based dedup flow across functions
104	Vector quantization compression ratios	Answer is in doc comments, not executable code
105	Consistent database paths across developers	Intent-based query about identity hashing
106	Minimum declarations before grouping	Specific constant + its usage site (2 lines)
107	Why some symbols are hidden in results	UX behavior question, small helper function
108	AI vs rule-based decision for code analysis	Decision logic, no direct keyword overlap
109	Filtering dissimilar search results	7-line block within a 60-line method
110	Knowing which files to reindex after git commit	Multi-step git diff logic
111	Cleaning markdown fences from LLM JSON responses	Natural language, function name not in query
112	Behavior when embedding API keeps failing	Failure/retry path, not happy path
113	Skipping files unchanged on disk (mtime)	Performance optimization buried in main loop
114	Preventing duplicate graph nodes during rebuild	Dedup check within batch processing (25 lines)
115	Switching embedding model with different dimensions	Schema migration logic, not obvious location
116	Metadata not saved if flush fails	Atomicity pattern, answer in CRITICAL comment
117	Preventing concurrent reindexing in MCP server	AtomicBool + compare_exchange (concurrency)
118	Deduplicating results from multiple queries	Single function call, concept-level query
119	Non-code files handled as chunked text	Edge case handling, 6 lines in 900-line function
120	Forced flush after removing changed file blocks	Crash safety, explained only in code comment
121	Avoiding redundant table opens	Cache with double-check locking pattern (20 lines)
122	Reranker score to distance conversion	2-line conversion in a map closure
123	Rough token estimation without tokenizer	Single line: `s.len() / 4`
124	Additional delay before background reindex	3 lines with timing rationale
125	Cleaning up files recently added to gitignore	8 lines within cleanup loop
126	Similarity-to-distance threshold conversion	6 lines, two separate locations
127	GraphRAG build from existing DB when no new files	Decision tree within indexing pipeline

If the standard queries score ~1.0 but hard queries score significantly lower, the benchmark is working correctly and exposes real retrieval weaknesses.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Retrieval Quality Benchmark

Ground Truth

How ground truth was created

Metrics

Hit@k

MRR (Mean Reciprocal Rank)

NDCG@10 (Normalized Discounted Cumulative Gain)

Recall@k

Worked Example

Ground Truth Files

Usage

Coverage

Hard Queries

FilesExpand file tree

benchmark

Directory actions

More options

Directory actions

More options

Latest commit

History

benchmark

Folders and files

parent directory

README.md

Retrieval Quality Benchmark

Ground Truth

How ground truth was created

Metrics

Hit@k

MRR (Mean Reciprocal Rank)

NDCG@10 (Normalized Discounted Cumulative Gain)

Recall@k

Worked Example

Ground Truth Files

Usage

Coverage

Hard Queries