Articles tagged “llm-as-judge”
2 articles

Testing & Evaluation·11 min read
GPT-5, Claude 4.5, Gemini Score the Same Calls. Their Kappa Is 0.52
Run the same calls through GPT-5, Claude 4.5, and Gemini and Cohen's kappa lands at 0.52. Here is how to measure judge agreement on your own corpus.
Read More

Testing & Evaluation·14 min read
How to Eval Agents When There's No Right Answer
Most eval methods assume you know the correct response. CX agents rarely have one. Here's how to score agent quality with criteria-based rubrics and LLM-as-judge, no labeled ground truth required.
Read More
The Signal Briefing
Un email por semana. Cómo los equipos líderes de CS, ingresos e IA están convirtiendo conversaciones en decisiones. Benchmarks, playbooks y lo que funciona en producción.