Articles tagged “quality”
6 articles

Trajectory Eval: Catch Agent Bugs Output Scoring Misses
Final-output scoring misses 20-40% of agent regressions. Trajectory evaluation scores every step an agent takes -- tool calls, reasoning decisions, order of operations -- and catches the bugs that output-only evals can't see.

Is AI Better Than Your Humans? Score Both on One Rubric
Most teams can't say whether AI beats humans because they score them differently. One rubric, run on both, sliced by segment, gives you an honest answer.

How Much Testing Is Enough for Your AI Agent?
Code coverage doesn't apply to AI agents. Here's a framework for thinking about evaluation coverage: how many scenarios you need, what distribution to target, and how to know when you've tested enough.

Your Call Center Handles 10,000 Calls a Day. Who's Grading Them?
AI agents handle 40% of your calls. Your QA team samples 2%. The monitoring gap between deployment and quality is where enterprise reputations break.

Your RAG Returns Wrong Answers. Upgrading the Model Won't Help
Most RAG quality problems are retrieval problems, not model problems. Bad chunking, wrong embeddings, and missing re-ranking cause more hallucinations than model capability gaps.

The Voice AI Quality Crisis: Why Most Deployments Fail in Production
Most voice AI deployments fail in production despite passing lab tests. Real data on why the gap exists, what it costs, and how to close it.
The Signal Briefing
Un email por semana. Cómo los equipos líderes de CS, ingresos e IA están convirtiendo conversaciones en decisiones. Benchmarks, playbooks y lo que funciona en producción.