The two metrics that matter:
- Recall@K: of all the chunks that should be returned for a query, what fraction were in the top K results?
- Precision@K: of the top K results, what fraction are actually relevant?
Build a golden set — 50–100 representative queries with hand-labeled correct answers. Run it after every retrieval change. This is the only way to know if a chunking or embedding change helped.