Evaluating RAG Systems: A Complete Framework

RAGAS, TruLens, and Phoenix all claim to evaluate your RAG. Here's what they actually measure.

By Creative Genius · May 12, 2026 · 8 min read

You can't improve a RAG system you can't measure. The good news: there are excellent open-source eval frameworks. The bad news: each measures different things and the marketing makes them sound interchangeable. They're not.

The four metrics that matter

Faithfulness — does the generated answer match the retrieved context? Catches hallucinations.
Answer relevance — does the answer actually address the question? Catches off-topic generation.
Context precision — are the retrieved chunks actually relevant? Catches retrieval noise.
Context recall — did we retrieve all the chunks that contain the answer? Catches retrieval gaps.

Faithfulness + relevance score your generator. Precision + recall score your retriever. You need both halves to know what to fix.

RAGAS — opinionated, fast to start

Measures all four with LLM-as-judge out of the box. Best for getting to a baseline in an afternoon. Less customizable when you want to test domain-specific qualities.

TruLens — programmatic, more flexible

Lets you define custom feedback functions. Better for production systems where the off-the-shelf metrics don't quite fit your domain. Steeper learning curve.

Phoenix (Arize) — eval + tracing in one tool

The best choice if you also need distributed tracing of your LLM pipeline. Evals are slightly less mature than RAGAS but improving fast.

The eval set is more important than the framework

50–200 question-answer pairs covering your real query distribution beats any framework choice. Spend the day collecting those before you spend the week comparing tools. Re-sample your eval set quarterly — drift is real and quietly degrades quality.

What to do this sprint

Collect 50 real user queries from production logs.
Have a domain expert write the ideal answer + ideal retrieved chunks for each.
Stand up RAGAS to measure all four metrics.
Track those four numbers over time. Block any pipeline change that drops any metric by >2 points.

Bottom line

Pick the framework whose UX you'll actually use. The eval set you build matters 10× more than the tool you build it in.