You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
### Stage 5: Retrieval quality check (the real test)
242
+
243
+
The pipeline benchmark ends with a quality check that answers: "If I ask a question about the content I just crawled, do I get the right answer back?"
244
+
245
+
**How it works:**
246
+
247
+
1. After embedding, store all chunks + vectors in memory
248
+
2. Embed 5 test queries using the same embedding model
249
+
3. Compute cosine similarity between each query and all chunks
250
+
4. Check if the top-3 most similar chunks contain the correct source page
251
+
5. Report hit rate: "X/5 queries returned the correct page in top 3"
252
+
253
+
**Test queries for FastAPI docs (example):**
254
+
255
+
| Query | Expected source page | What it tests |
256
+
|---|---|---|
257
+
| "How do I add authentication to a FastAPI endpoint?" | Security/OAuth2 tutorial page | Can it find conceptual content? |
258
+
| "What is the default response status code?" | Response model docs | Can it find specific technical details? |
259
+
| "How do I define query parameters?" | Query parameters tutorial | Can it find tutorial content? |
260
+
| "What Python types does FastAPI support for request bodies?" | Request body docs | Can it find reference content? |
261
+
| "How do I handle file uploads?" | File upload tutorial | Can it find procedural content? |
262
+
263
+
**Why this is the most important metric:**
264
+
265
+
Pages/second measures how fast the pipe runs. Retrieval accuracy measures whether the pipe produces useful output. A crawler that's 10x faster but produces chunks that can't answer questions is worthless for RAG. This single metric — "does retrieval work?" — validates the entire pipeline: crawl quality, cleaning quality, chunk coherence, and embedding usefulness.
266
+
267
+
**No Supabase needed:** The similarity search runs in memory using numpy. The test is self-contained and reproducible.
268
+
269
+
```python
270
+
# Pseudocode for retrieval test
271
+
import numpy as np
272
+
273
+
defcosine_similarity(a, b):
274
+
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
275
+
276
+
for query, expected_url in test_queries:
277
+
query_vec = embed(query)
278
+
scores = [(cosine_similarity(query_vec, chunk.vec), chunk) for chunk in all_chunks]
279
+
top_3 =sorted(scores, reverse=True)[:3]
280
+
hit =any(expected_url in chunk.url for _, chunk in top_3)
281
+
```
282
+
241
283
### Why this matters for positioning
242
284
243
285
No other single tool in the comparison offers this pipeline. The message isn't "we're faster at crawling" — it's "we're the only tool where `pip install markcrawl` gets you from URL to searchable vector database in 3 commands." The pipeline benchmark quantifies that value with real numbers.
0 commit comments