LLM-as-judge: ask GPT-4o "rate this response 1–5 for clarity." It works — if you do it carefully.
Three rules
- Use a different (and stronger) model than the one being graded. Otherwise you're asking a model to grade itself.
- Give a rubric, not a scale. "1 = misses the question, 3 = answers but ignores edge cases, 5 = nails edge cases" — way more reliable than "1–5".
- Sanity check on a human-labeled sample. If the judge agrees with humans <70%, your rubric is broken.