Articles tagged “llm-as-judge”
2 articles

Testing & Evaluation·11 min read
GPT-5, Claude 4.5, Gemini Score the Same Calls. Their Kappa Is 0.52
Run the same calls through GPT-5, Claude 4.5, and Gemini and Cohen's kappa lands at 0.52. Here is how to measure judge agreement on your own corpus.
Read More

Testing & Evaluation·14 min read
How to Eval Agents When There's No Right Answer
Most eval methods assume you know the correct response. CX agents rarely have one. Here's how to score agent quality with criteria-based rubrics and LLM-as-judge, no labeled ground truth required.
Read More
Learn Agentic AI
Weekly. Patterns for shipping agents that work — MCP, scorecards, regression tests, prompts, model comparisons.