Skip to content

Latest commit

 

History

History

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 

README.md

Retrieval Quality Benchmark

Benchmark for measuring semantic code search quality of octocode's full pipeline: chunking, embedding, vector search, and reranking.

Ground Truth

code.csv contains 127 queries with 1-3 annotated code locations each (100 standard + 27 hard).

Based on commit: b1771ba48214ce7404ad7158e277eb60680912b4 Config used: benchmark/config.toml — contextual retrieval enabled, Voyage reranker 2.5, RaBitQ quantization

How ground truth was created

  1. AI-generated candidates — queries and expected code/doc locations were generated by an LLM with full codebase context, covering all major modules and documentation files
  2. Source verification — every referenced file and line range was read and validated against the actual source to confirm the code/docs at those lines answers the query
  3. Multi-agent validation — parallel validation agents independently checked line ranges, file paths, and relevance scores across all 254 queries
  4. Search-informed corrections — queries that the search missed were analyzed: if the search found a valid alternative location (same logic in a different file), it was added as a secondary result rather than removed
  5. Hard query design — 27 code queries and ~14 doc queries use natural language that deliberately avoids mirroring function names or section titles, testing semantic understanding over keyword matching

Format:

query,result1,result2,result3

Each result: src/path/file.rs:start_line-end_line:relevance

  • Relevance 2 = primary (directly answers the query)
  • Relevance 1 = secondary (related, useful context)

Matching uses line range overlap: a search result at lines 40-90 matches ground truth 45-92 if ranges intersect.

Metrics

Hit@k

Binary per query: did ANY correct result appear in the top-k? Averaged across all queries.

A Hit@5 of 0.85 means 85% of queries had at least one relevant result in the top 5.

MRR (Mean Reciprocal Rank)

Reciprocal of the rank of the first correct result, averaged across queries.

First hit at rank Score
1 1.0
2 0.5
3 0.33
5 0.2
Not found 0.0

Measures how high the first relevant result appears.

NDCG@10 (Normalized Discounted Cumulative Gain)

Accounts for both relevance grades (2 vs 1) and position in the ranking:

DCG@k  = sum( rel_i / log2(i + 1) )   for i = 1..k
IDCG@k = DCG of the ideal ranking (ground truth sorted by relevance desc)
NDCG   = DCG / IDCG

A relevance=2 result at position 1 contributes more than a relevance=1 result at position 5. Measures whether the most relevant results are ranked highest.

Recall@k

Fraction of ground truth entries found in the top-k results.

If a query has 3 ground truth blocks and search finds 2 of them in the top 10:

Recall@10 = 2/3 = 0.67

Measures completeness: how many relevant blocks did we find?

Worked Example

Ground truth: fileA:10-50:2, fileB:20-30:1 Search returns: [fileC:1-10, fileA:30-60, fileB:25-35, ...]

  • fileA overlaps (30-60 intersects 10-50) at rank 2, relevance=2
  • fileB overlaps (25-35 intersects 20-30) at rank 3, relevance=1
Metric Calculation Score
Hit@5 found a match 1
MRR first hit at rank 2 = 1/2 0.50
DCG 2/log2(3) + 1/log2(4) = 1.26 + 0.50 1.76
IDCG 2/log2(2) + 1/log2(3) = 2.00 + 0.63 2.63
NDCG@10 1.76 / 2.63 0.67
Recall@10 2 of 2 ground truth found 1.00

Ground Truth Files

File Mode Queries What it tests
code.csv code 127 Code search: functions, structs, logic blocks
docs.csv docs 127 Doc search: markdown sections, config guides, architecture

Usage

# Benchmark code search (default)
python3 benchmark/score.py --verbose

# Benchmark documentation search
python3 benchmark/score.py --mode docs --csv benchmark/docs.csv --verbose

# With custom settings
python3 benchmark/score.py --threshold 0.5 --max-results 10

# Quiet mode (summary only)
python3 benchmark/score.py

Exit code is 1 if Hit@5 drops below 0.70.

Coverage

The 100 queries cover all major modules:

Area Queries
Code chunking/extraction 12
Markdown processing 12
Contextual enrichment 5
Differential indexing 3
Embedding generation 5
Store data structures 10
Store operations 10
Vector optimizer/table ops 4
Batch converter 1
GraphRAG types 5
GraphRAG relationships 6
GraphRAG database/utils 7
GraphRAG AI/builder 6
MCP server 5
LSP integration 4
LLM client 2
Search/rendering 2
File watcher 1
Config/storage/state 7
Subtotal (standard) 100
Hard queries 27
Total 127

Hard Queries

The last 27 queries (101-127) use natural language that doesn't mirror code comments or function names. They test semantic understanding rather than keyword matching:

# Query intent Why it's hard
101 Preventing infinite loops in overlapping chunks 5-line guard clause, no code keywords in query
102 Skipping indexing when repo unchanged Buried 100+ lines into a 900-line function
103 Avoiding duplicate embeddings for unchanged code Hash-based dedup flow across functions
104 Vector quantization compression ratios Answer is in doc comments, not executable code
105 Consistent database paths across developers Intent-based query about identity hashing
106 Minimum declarations before grouping Specific constant + its usage site (2 lines)
107 Why some symbols are hidden in results UX behavior question, small helper function
108 AI vs rule-based decision for code analysis Decision logic, no direct keyword overlap
109 Filtering dissimilar search results 7-line block within a 60-line method
110 Knowing which files to reindex after git commit Multi-step git diff logic
111 Cleaning markdown fences from LLM JSON responses Natural language, function name not in query
112 Behavior when embedding API keeps failing Failure/retry path, not happy path
113 Skipping files unchanged on disk (mtime) Performance optimization buried in main loop
114 Preventing duplicate graph nodes during rebuild Dedup check within batch processing (25 lines)
115 Switching embedding model with different dimensions Schema migration logic, not obvious location
116 Metadata not saved if flush fails Atomicity pattern, answer in CRITICAL comment
117 Preventing concurrent reindexing in MCP server AtomicBool + compare_exchange (concurrency)
118 Deduplicating results from multiple queries Single function call, concept-level query
119 Non-code files handled as chunked text Edge case handling, 6 lines in 900-line function
120 Forced flush after removing changed file blocks Crash safety, explained only in code comment
121 Avoiding redundant table opens Cache with double-check locking pattern (20 lines)
122 Reranker score to distance conversion 2-line conversion in a map closure
123 Rough token estimation without tokenizer Single line: s.len() / 4
124 Additional delay before background reindex 3 lines with timing rationale
125 Cleaning up files recently added to gitignore 8 lines within cleanup loop
126 Similarity-to-distance threshold conversion 6 lines, two separate locations
127 GraphRAG build from existing DB when no new files Decision tree within indexing pipeline

If the standard queries score ~1.0 but hard queries score significantly lower, the benchmark is working correctly and exposes real retrieval weaknesses.