ChanlChanl
Operations

How to Run the Agent Development Lifecycle (ADLC) in Production

Shipping an AI agent is easy. Keeping it reliable after launch is hard. The ADLC walks you through Intent, Build, Evaluate, Deploy, Observe, then back around.

DGDean GroverCo-founderFollow
May 13, 2026
14 min read
AI-generated illustration for agent development lifecycle adlc

The demo worked perfectly. Smooth responses, correct tool calls, happy reviewers. You deployed to production feeling confident.

Six weeks later, a customer calls to complain. Then two more. You check the conversation logs and find patterns: the agent is mishandling a scenario it used to handle fine. The prompt hasn't changed. The tools haven't changed. The model hasn't changed. But something has shifted, and you don't have a clear picture of when it started or why.

This is the default trajectory for most agent deployments. Not a catastrophic failure. A quiet drift. And the reason it happens isn't a technical problem. It's a process problem.

The Agent Development Lifecycle (ADLC) is the framework that closes this gap. It treats agent deployment not as a finish line but as the start of a continuous improvement loop. Here's how it works, why it's different from what most teams do, and what the implementation actually looks like.

What Is the Agent Development Lifecycle?

The ADLC is a way of building and operating AI agents in production. It defines five phases (Intent, Build, Evaluate, Deploy, Observe) arranged in a continuous loop where the output of each cycle feeds the input of the next.

The critical difference from traditional software development: ADLC assumes agents don't reach a stable final state. User behavior changes. Edge cases accumulate. The real world diverges from the controlled environment you designed in. An agent that was excellent at launch degrades to mediocre over months not because you changed anything but because the world moved and the agent didn't.

Observation isn't a phase that follows deployment. It runs permanently, alongside every other phase, generating the signal that determines what needs to change in the next cycle. That continuous observation layer is what most teams skip. It's also what separates agents that improve over time from agents that quietly decay.

Intent: Define the bets Build: Configure tools, memory, prompts Evaluate: Offline evals and CI gates Deploy: Shadow mode and canary Observe: Production monitoring New test cases from failures
The five phases of the ADLC, forming a continuous improvement loop

Phase 1: Intent. Define the Problem With Precision

Before building anything, you need a precise definition of what success looks like. Not "the agent should handle customer support." That's a direction, not a target. The Intent phase produces a specific, testable statement of what the agent should do and how you'll know it's doing it.

A useful Intent definition includes: the tasks the agent must complete, the tasks it must hand off, the quality level required for each, and the failure modes that are acceptable versus unacceptable. For a billing support agent: handle balance inquiries, payment processing, and plan changes autonomously; escalate disputes over $500 and all fraud reports; maintain a resolution rate above 80%; never misquote a balance.

The "Bet Register" concept from Arthur AI formalizes this: each improvement idea or new capability is a bet, with a hypothesis, a success metric, and a tracking mechanism. This matters because later in the cycle, when you're reviewing observation data and deciding what to change, you need something to measure against. Teams that skip the Intent phase end up with "the agent feels better" as their eval criteria, which isn't actionable.

You don't need a complex document. A one-page definition of tasks, quality thresholds, and unacceptable failure modes is enough. The goal is specificity you can test against.

Phase 2: Build. Configuration, Not Just Code

Building an agent well means instrumenting it from the start, designing your evaluation dataset before your prompts, and treating memory and tools as first-class configuration rather than afterthoughts. The basics (writing a system prompt, connecting tools, testing a few example flows) get done by everyone. What gets skipped is the scaffolding that makes the Observe phase functional later.

Instrument from day one. The most common reason observation fails in production is that the agent wasn't built to emit useful signals. Tool invocations, memory lookups, decision branches, confidence levels: all of these need to be logged before you deploy, not after you discover you need them. Adding instrumentation retroactively is painful and often incomplete. If you register your tools and memory through the platform layer (rather than calling raw HTTP from your agent), this comes for free. The code below shows how to read it back during review:

Reading a call's tool history for the Observe phase·typescript
import Chanl from '@chanl/sdk'
 
const chanl = new Chanl({ apiKey: process.env.CHANL_API_KEY! })
 
// During production review, you need the agent's full tool history per call.
// The SDK exposes this through calls.get and getTranscript: every tool
// invocation, latency, and result is already captured at the platform level
// because you registered the tool in the agent config at build time.
async function reviewCall(callId: string) {
  const { data: call } = await chanl.calls.get(callId)
  const { data: transcript } = await chanl.calls.getTranscript(callId)
 
  // Tool invocations come back with timing and outcome data attached.
  // This is the raw signal the Observe phase feeds back into evals.
  return {
    duration: call.durationMs,
    toolCalls: transcript.toolCalls,
    escalated: call.escalated,
  }
}

Design your evaluation dataset before writing prompts. A common pattern: teams write prompts, then scramble to build evals. The better order is inverted. Sketch out 50-100 example conversations (inputs and expected outputs) before finalizing prompt design. Writing the eval first clarifies what you're actually trying to build and surfaces edge cases you'd otherwise discover in production.

Memory and tools are configuration, not afterthoughts. An agent's persistent memory and tool access determine more of its behavior than the system prompt does. Which customer history does the agent retrieve? Which tools can it call without escalating? These decisions shape the agent's capabilities and constraints. Treat them with the same care as prompt design.

Phase 3: Evaluate Before You Ship

Quality measurement before deployment (what the ADLC calls offline evaluation) is the gate between Build and Deploy. Richer task descriptions improve task success rates by 5.85pp according to recent research, and eval catches whether your changes actually produced improvements or caused regressions before they reach customers.

An offline eval suite for a CX agent should cover:

  • Task completion: for a representative sample of conversation types, does the agent resolve the task?
  • Tool selection accuracy: does the agent invoke the right tools in the right order?
  • Boundary case handling: does it escalate appropriately for the cases it shouldn't handle autonomously?
  • Quality consistency: across 100 variations of the same request, does quality variance stay below a threshold?

The last one is often skipped. A single "does it work?" test doesn't tell you whether the agent is reliably good or occasionally good. Quality variance matters as much as average quality for production deployments.

Operations engineer monitoring deploys

Deploy Gate

Pre-deploy quality checks

Score > 80%
92%
Latency < 500ms
234ms
Error Rate < 2%
3.1%
Deploy Blocked

Running evals as a CI gate (blocking deploys when quality regresses below a threshold) is the implementation detail that makes this real rather than aspirational. Without a CI gate, eval becomes optional. With one, quality checks are part of every deployment, not an afterthought.

Example CI gate in a deployment pipeline·yaml
# .github/workflows/agent-eval.yml
- name: Run agent scenarios as quality gate
  run: |
    npx chanl scenarios run $REGRESSION_SUITE_ID \
      --agent $AGENT_ID \
      --watch
    # The CLI exits non-zero if the scenario fails its minScore,
    # which fails the workflow and blocks the deploy.

An eval suite that's passing doesn't mean your agent is production-ready. It means your agent is as good as your test cases require. The gap between your test suite and the real world is what the Observe phase closes over time.

Phase 4: Deploy With a Safety Net

Deployment for agents isn't a binary switch. The safest pattern is a progression: shadow mode, then canary, then full traffic.

Shadow mode runs the new agent version alongside the production version on the same inputs, without showing its responses to users. You compare outputs, tool selections, and quality scores to validate the new version before it touches any real conversations. This is the only way to test a major agent change without risking production conversations, and it's especially valuable for changes to prompts, tools, or models where the behavioral change is hard to predict.

The shadow deployment pattern is worth reading in full if you're implementing this for the first time. The key is that shadow mode isn't optional for consequential changes. It's the default.

Canary deployment routes a small percentage of production traffic (5 to 10 percent) to the new version. This exposes real user variability (accents, unusual phrasing, edge cases your test suite didn't cover) in a controlled way. If quality metrics hold at canary level, roll to full traffic. If they don't, roll back before the damage scales.

Define your rollback triggers before deploying, not after you notice problems. "If task completion rate drops more than 5 points from baseline in a 4-hour window, roll back automatically" is a trigger. "We'll watch it and decide" is not.

Phase 5: Observe Continuously

Observation is the phase most teams treat as optional. It's the most important one.

A well-monitored agent tells you what's actually happening in production: which conversations succeed, which ones fail, which tool call patterns signal trouble before users do. A poorly monitored agent generates alerts when it crashes and silence when it quietly degrades.

Five things to track in production monitoring:

Task completion rate: What percentage of conversations end with the customer's issue resolved? This is the top-level health metric. Track it by conversation type (billing, shipping, account) to catch category-specific regressions.

Tool call patterns: Are tools being called in the expected order? Are any tools being called unusually frequently or infrequently? Unexpected patterns are early signals of behavioral drift, often visible in tool call data days before they show up in customer satisfaction metrics.

Quality scores at scale: Manual review of every conversation isn't possible at production volume. AI-powered scoring against a rubric gives you quality signal across 100% of conversations, not a sampled subset. Run scores on key dimensions: resolution accuracy, response appropriateness, escalation judgment.

Escalation analysis: Are escalations happening for the right reasons? High escalation rates on topics the agent should handle autonomously indicate a capability gap. Low escalation rates on topics where escalation is required indicate the agent is overstepping. Both are problems.

Conversation exit points: Where do customers disengage? If customers are dropping off mid-conversation on a particular intent type, that's a UX or quality problem localized to that flow.

The signal from these five areas isn't just for reporting. It directly feeds Phase 1 (Intent) of the next cycle.

Closing the Feedback Loop: Production Failures as Test Cases

Every production failure is a test case you didn't have before. Converting those failures into eval suite entries is what makes the ADLC compound: each cycle makes the next evaluation tighter, each deployed fix retires a failure mode for good, and the gap between your test suite and the real world narrows over time. The system that turns observations into improvements has three steps:

Step 1: Surface failure modes. From your monitoring data, identify conversations that ended in unexpected escalation, customer dissatisfaction, task failure, or anomalous tool usage. These are your candidates for investigation. You don't need to review all of them. Prioritize by frequency and severity.

Step 2: Convert failures to labeled test cases. For each failure mode you want to prevent, create a scenario from it: the input that triggered the failure, the output that was wrong, and the expected correct output. The annotation work is the part teams underestimate. Without it, your eval suite drifts behind production reality and stops being a reliable gate. Once the scenario is in your suite, fixes must pass it to ship:

Re-running a scenario after the failing call is fixed·typescript
import Chanl from '@chanl/sdk'
 
const chanl = new Chanl({ apiKey: process.env.CHANL_API_KEY! })
 
// Flow: a flagged production call gets turned into a scenario (in the dashboard
// or via your scenarios YAML). Then your fix needs to pass that scenario before
// it can be re-deployed. The SDK runs the scenario; CI gates on the result.
async function verifyFixForScenario(scenarioId: string, agentId: string) {
  const { data: execution } = await chanl.scenarios.run({
    scenarioId,
    agentId,
  })
 
  if (!execution.passed) {
    throw new Error(
      `Regression: scenario ${scenarioId} failed with score ${execution.score}`,
    )
  }
 
  return execution
}

Step 3: Fix and verify. Change the agent configuration, prompt, or tools to address the root cause. Re-run the eval suite, which now includes the new test case, and verify the fix passes before deploying. The new test case stays in the suite permanently, so the same failure mode can't recur silently.

This loop compounds over time. An agent with 50 test cases at launch can have 500 within six months of active monitoring, each one encoding a real failure mode that can never bite you again. The eval suite becomes a record of everything the agent has learned to handle correctly.

The goal, as described in the online vs offline evals framework, is to keep both running simultaneously: offline evals to catch regressions before deploy, online evals (or monitoring-derived evals) to surface what's going wrong right now.

Why Do Most ADLC Implementations Stall?

The most common failure pattern isn't skipping phases. It's treating the ADLC as linear rather than circular.

Teams complete Intent, Build, Evaluate, Deploy, and then move on to the next feature. Observation becomes a checkbox ("we set up dashboards") rather than an active signal generator. The feedback loop never closes. The eval suite from launch stays static for six months. The agent improves on features the team actively builds while quietly degrading on everything else.

The fix is structural: assign someone to own the Observe phase. Not to react to outages, but to actively review conversation data weekly, create test cases from what they find, and feed the results into the next Build cycle. At smaller teams this is part of the engineering role. At larger teams it becomes a dedicated function.

Cisco's February 2026 acquisition of Galileo (an agent observability platform) is a signal of where enterprise investment is going. AI agent observability is maturing from a developer tool into infrastructure, the same trajectory application monitoring followed a decade ago. The teams that build the observation discipline now will have a significant lead when it becomes table stakes.

The Monitor Phase Connects to Everything Else

The ADLC isn't a waterfall with observation tacked on at the end. It's a loop where monitoring data actively drives what you build next. A well-run Observe phase:

  • Generates the failure cases that improve your eval suite (closing the Build feedback loop)
  • Surfaces the new intent types that users bring to your agent (feeding the next Intent phase)
  • Validates that the changes you deployed in the last cycle actually worked (completing the Deploy verification)

Chanl's platform is designed around this loop. Analytics and monitoring surface production signal, scorecards apply quality measurement at scale, and scenarios turn that signal into eval cases for the next build cycle. The connection between Monitor and Build is what makes each iteration faster and more targeted than the last.

The teams winning with AI agents in 2026 aren't the ones who shipped fastest. They're the ones who built the loop. Shipping is genuinely the easy part.

Close the loop between observation and improvement

Track quality across every conversation, surface failure modes automatically, and turn production data into better agent behavior. The ADLC in practice.

See How Monitoring Works
DG

Co-founder

Building the platform for AI agents at Chanl — tools, testing, and observability for customer experience.

The Signal Briefing

Un email por semana. Cómo los equipos líderes de CS, ingresos e IA están convirtiendo conversaciones en decisiones. Benchmarks, playbooks y lo que funciona en producción.

500+ líderes de CS e ingresos suscritos

Frequently Asked Questions