ChanlChanl
Blog/Tags/evaluations

evaluations

Browse 8 articles tagged with “evaluations”.

Articles tagged “evaluations

8 articles

A person standing before multiple transparent evaluation panels in a semicircle, each showing a different lens on the same conversation
Testing & Evaluation·16 min read read

Your LLM-as-judge may be highly biased

LLM-as-Judge has 12 documented biases. Here are 6 evaluation methods production teams actually use instead, with code examples and patterns.

Read More
Illustration of a quality monitoring dashboard showing score trends and alert thresholds across production AI agent conversations
Learning AI·20 min read

Production Agent Evals: Catch Score Drift, Ship Confidently

Your evals pass in staging but miss production failures. Build three eval pipelines with the Chanl SDK: automated scorecards, scenario regression, and drift detection that catches quality degradation before customers do.

Read More
Abstract visualization of a signal gradually losing coherence as it passes through layered processing stages, with early stages showing clean waveforms and later stages showing scattered, fragmented patterns
Testing & Evaluation·14 min read

Agent Drift: Why Your AI Gets Worse the Longer It Runs

AI agents silently degrade over long conversations. Research quantifies three types of drift and shows why point-in-time evals miss them entirely.

Read More
Illustration of a balance scale tilted by invisible weights, representing hidden biases in AI evaluation systems
Learning AI·18 min read

12 Ways Your LLM Judge Is Lying to You

Research identifies 12 systematic biases in LLM-as-a-judge systems. Learn to detect and mitigate each one before they corrupt your eval pipeline.

Read More
Visualization of the widening gap between AI agent capability scores and reliability metrics across model generations
Learning AI·15 min read

Your Agent Is Getting Smarter. It's Not Getting More Reliable.

Reliability improves at half the rate of accuracy. Three 85%+ tools combine to just 74%. Here's the math, the research, and the testing protocols that close the gap.

Read More
Data visualization showing the gap between AI agent benchmark scores and production performance metrics
Testing & Evaluation·13 min read

Your Agent Aced the Benchmark. Production Disagreed.

We scored 92% on GAIA. Production CSAT: 64%. Here's which AI agent benchmarks actually predict deployed performance, why most don't, and what to measure instead.

Read More
Illustration of a team evaluating AI agent quality through structured testing scenarios
Testing & Evaluation·24 min read

AI Agent Testing: How to Evaluate Agents Before They Talk to Customers

A practical guide to testing AI agents before production — scenario-based testing with AI personas, scorecard evaluation, regression suites, edge case generation, and CI/CD integration.

Read More
Ilustracion de dos personas revisando un grafico de mejoras juntas en un escritorio de pie
Learning AI·20 min read

Como evaluar agentes de IA: construye un framework de evaluacion desde cero

Construye un framework funcional de evaluacion de agentes de IA en TypeScript y Python. Cubre LLM-as-judge, puntuacion por rubrica, pruebas de regresion e integracion con CI.

Read More

Aprende IA Agéntica

Una lección por semana: técnicas prácticas para construir, probar y lanzar agentes IA. Desde ingeniería de prompts hasta monitoreo en producción. Aprende haciendo.

500+ ingenieros suscritos