Skip to content

Commit 7e8a7c7

Browse files
committed
Add retrieval quality check to pipeline benchmark: test queries with known answers validate full RAG pipeline
1 parent 3e19d26 commit 7e8a7c7

File tree

1 file changed

+42
-0
lines changed

1 file changed

+42
-0
lines changed

benchmarks/COMPARISON_PLAN.md

Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -238,6 +238,48 @@ python benchmarks/run_pipeline.py --extract --mock-upload
238238
python benchmarks/run_pipeline.py --extract --upload
239239
```
240240

241+
### Stage 5: Retrieval quality check (the real test)
242+
243+
The pipeline benchmark ends with a quality check that answers: "If I ask a question about the content I just crawled, do I get the right answer back?"
244+
245+
**How it works:**
246+
247+
1. After embedding, store all chunks + vectors in memory
248+
2. Embed 5 test queries using the same embedding model
249+
3. Compute cosine similarity between each query and all chunks
250+
4. Check if the top-3 most similar chunks contain the correct source page
251+
5. Report hit rate: "X/5 queries returned the correct page in top 3"
252+
253+
**Test queries for FastAPI docs (example):**
254+
255+
| Query | Expected source page | What it tests |
256+
|---|---|---|
257+
| "How do I add authentication to a FastAPI endpoint?" | Security/OAuth2 tutorial page | Can it find conceptual content? |
258+
| "What is the default response status code?" | Response model docs | Can it find specific technical details? |
259+
| "How do I define query parameters?" | Query parameters tutorial | Can it find tutorial content? |
260+
| "What Python types does FastAPI support for request bodies?" | Request body docs | Can it find reference content? |
261+
| "How do I handle file uploads?" | File upload tutorial | Can it find procedural content? |
262+
263+
**Why this is the most important metric:**
264+
265+
Pages/second measures how fast the pipe runs. Retrieval accuracy measures whether the pipe produces useful output. A crawler that's 10x faster but produces chunks that can't answer questions is worthless for RAG. This single metric — "does retrieval work?" — validates the entire pipeline: crawl quality, cleaning quality, chunk coherence, and embedding usefulness.
266+
267+
**No Supabase needed:** The similarity search runs in memory using numpy. The test is self-contained and reproducible.
268+
269+
```python
270+
# Pseudocode for retrieval test
271+
import numpy as np
272+
273+
def cosine_similarity(a, b):
274+
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
275+
276+
for query, expected_url in test_queries:
277+
query_vec = embed(query)
278+
scores = [(cosine_similarity(query_vec, chunk.vec), chunk) for chunk in all_chunks]
279+
top_3 = sorted(scores, reverse=True)[:3]
280+
hit = any(expected_url in chunk.url for _, chunk in top_3)
281+
```
282+
241283
### Why this matters for positioning
242284

243285
No other single tool in the comparison offers this pipeline. The message isn't "we're faster at crawling" — it's "we're the only tool where `pip install markcrawl` gets you from URL to searchable vector database in 3 commands." The pipeline benchmark quantifies that value with real numbers.

0 commit comments

Comments
 (0)