Evals: The Discipline That Separates AI Hobbyists from Builders
Without evals you're guessing. Here's how to build a useful eval harness in under a day.
An eval is a test for an LLM call: inputs, expected behavior, automated grading. Without it, every prompt change is folklore — "I think this works better" with no proof. With it, you can ship LLM features the same way you ship any other software.
The three-tier eval pyramid
- Deterministic checks — string match, JSON schema validation, regex, code execution. Fast, free, catches the obvious. Should be ~70% of your evals.
- LLM-as-judge — use a stronger model (e.g. GPT-4o judging GPT-4o-mini output) to grade subjective qualities like tone, helpfulness, factuality. Run on every PR.
- Human review queue — sample 1–5% of production traffic into a labeling tool. The ground truth that grounds the other two tiers.
The minimum viable harness
You can stand this up in one day:
- A JSON file of 50 input/expected-output pairs.
- A script that runs the prompt against each input, calls the grader, and outputs a pass/fail summary.
- A CI job that blocks merges if pass rate drops more than 2 points from baseline.
That's it. Frameworks like Promptfoo, Phoenix, Langfuse, and OpenAI's Evals add UX and dashboards on top. Pick one when you've outgrown the JSON file.
What to actually measure
For most production LLM features, the metrics that matter are: tool-call correctness, structured-output schema compliance, refusal appropriateness, factual accuracy on a closed set, and latency p95. Skip vibes-based metrics until you've nailed these.
The hidden ROI
Evals are the single best onboarding tool for new engineers. A new hire can ship prompt changes confidently in week one because the harness will catch regressions. Without evals, every prompt change requires the original author's intuition.
Bottom line
If you have an LLM in production without evals, you don't have a product — you have a science experiment. Fix it this sprint.