Lesson 3 of 3 · 18 min read

LLM-as-Judge — Done Right

When subjective quality matters, use a strong model to grade a weaker one. But beware the pitfalls.

LLM-as-judge: ask GPT-4o "rate this response 1–5 for clarity." It works — if you do it carefully.

Use a different (and stronger) model than the one being graded. Otherwise you're asking a model to grade itself.
Give a rubric, not a scale. "1 = misses the question, 3 = answers but ignores edge cases, 5 = nails edge cases" — way more reliable than "1–5".
Sanity check on a human-labeled sample. If the judge agrees with humans <70%, your rubric is broken.