Evaluating Retrieval

If you can't measure retrieval quality, you can't improve it. Build the eval harness before the product.

The two metrics that matter:

Recall@K: of all the chunks that should be returned for a query, what fraction were in the top K results?
Precision@K: of the top K results, what fraction are actually relevant?

Build a golden set — 50–100 representative queries with hand-labeled correct answers. Run it after every retrieval change. This is the only way to know if a chunking or embedding change helped.

← Vector Databases: When to Actually Use One Back to course