What is the Agent Development Lifecycle (ADLC)?

The ADLC is a structured methodology for building and operating AI agents in production. Unlike a traditional software development lifecycle, ADLC is explicitly circular: agents don't reach a stable final state, they continuously learn from production behavior. The five phases are Intent (define what the agent should do), Build (configure tools, memory, prompts), Evaluate (measure quality before shipping), Deploy (release with safety controls), and Observe (monitor production continuously and feed findings back into the next iteration).

How is the ADLC different from a traditional software development lifecycle?

Traditional SDLC ends at deployment: code ships, you monitor for crashes and fix bugs. The ADLC assumes the agent will degrade over time as user behavior and real-world conditions diverge from what you designed for. Observation isn't a phase that follows deployment -- it runs permanently alongside every other phase and generates the signal that drives future Intent cycles. The goal isn't to ship and maintain; it's to ship and continuously improve.

Why do so many AI agents fail in production despite working in demos?

Demo environments are curated: you control the inputs, the personas, the edge cases you test. Production is uncurated: users say unexpected things, combine tools in unexpected ways, and surface edge cases you never imagined. Agents that work in demos often fail in production because there's no systematic way to capture those production failures, convert them into test cases, and improve the agent based on what you learned. The ADLC provides that system.

What is offline evaluation in the context of ADLC?

Offline evaluation is the quality gate you run before deployment -- against a fixed dataset where you know the expected outputs. It tells you whether agent quality has regressed since the last version. Offline evals catch regressions in staging; they can't catch edge cases that don't exist in your test set yet. That's what production observation adds: a pipeline for converting new failure modes into future test cases.

What should I monitor in production for an AI agent?

Five categories matter most: tool call accuracy (did the agent invoke the right tools in the right order?), task completion rate (did the conversation end with the customer's issue resolved?), quality scores (how would a human reviewer rate the response quality on key dimensions?), escalation patterns (are escalations happening for the right reasons?), and conversation exit points (where do customers disengage). These together tell you whether your agent is working, not just running.

How do I close the feedback loop from production observation to agent improvement?

The loop has three steps: (1) surface failure modes through production monitoring -- conversations that ended in escalation, negative sentiment, or task failure; (2) convert those conversations into labeled test cases in your evaluation suite; and (3) fix the root cause in the agent's configuration, prompts, tools, or memory, then verify the fix passes the new test cases before re-deploying. This turns each production failure into a permanent regression test.

What is shadow mode deployment for AI agents?

Shadow mode runs the new agent version alongside the current production version without exposing users to it. Both versions process the same inputs, but only the production version's responses are shown. You compare the two versions' outputs, tool selections, and quality scores to validate the new version before switching traffic. It's the safest way to deploy major agent changes because it decouples deployment from activation.

How often should I run through the full ADLC cycle?

The cadence depends on how fast your agent's environment changes. For most CX agents, a two-week cycle works: one week of observation and test case creation from production failures, one week of build and evaluation fixes. High-velocity teams run weekly. The key is that observation never stops -- the Observe phase is continuous, even when you're not actively in a Build cycle.

The Agent Development Lifecycle: Ship, Observe, Improve

The demo worked perfectly. Smooth responses, correct tool calls, happy reviewers. You deployed to production feeling confident.

Six weeks later, a customer calls to complain. Then two more. You check the conversation logs and find patterns: the agent is mishandling a scenario it used to handle fine. The prompt hasn't changed. The tools haven't changed. The model hasn't changed. But something has shifted, and you don't have a clear picture of when it started or why.

This is the default trajectory for most agent deployments. Not a catastrophic failure -- a quiet drift. And the reason it happens isn't a technical problem. It's a process problem.

The Agent Development Lifecycle (ADLC) is the framework that closes this gap. It treats agent deployment not as a finish line but as the start of a continuous improvement loop. Here's how it works, why it's different from what most teams do, and what the implementation actually looks like.

What the ADLC Is (and Why It's Not Just Another Acronym)

The ADLC is a structured methodology for building and operating AI agents in production. It defines five phases -- Intent, Build, Evaluate, Deploy, Observe -- arranged in a continuous loop where the output of each cycle feeds the input of the next.

The critical difference from traditional software development: ADLC assumes agents don't reach a stable final state. User behavior changes. Edge cases accumulate. The real world diverges from the controlled environment you designed in. An agent that was excellent at launch degrades to mediocre over months not because you changed anything but because the world moved and the agent didn't.

Observation isn't a phase that follows deployment. It runs permanently, alongside every other phase, generating the signal that determines what needs to change in the next cycle. That continuous observation layer is what most teams skip -- and what separates agents that improve over time from agents that quietly decay.

The five phases of the ADLC, forming a continuous improvement loop

Phase 1: Intent -- Define the Problem With Precision

Before building anything, you need a precise definition of what success looks like. Not "the agent should handle customer support" -- that's a direction, not a target. The Intent phase produces a specific, testable statement of what the agent should do and how you'll know it's doing it.

A useful Intent definition includes: the tasks the agent must complete, the tasks it must hand off, the quality level required for each, and the failure modes that are acceptable versus unacceptable. For a billing support agent: handle balance inquiries, payment processing, and plan changes autonomously; escalate disputes over $500 and all fraud reports; maintain a resolution rate above 80%; never misquote a balance.

The "Bet Register" concept from Arthur AI formalizes this: each improvement idea or new capability is a bet, with a hypothesis, a success metric, and a tracking mechanism. This matters because later in the cycle -- when you're reviewing observation data and deciding what to change -- you need something to measure against. Teams that skip the Intent phase end up with "the agent feels better" as their eval criteria, which isn't actionable.

You don't need a complex document. A one-page definition of tasks, quality thresholds, and unacceptable failure modes is enough. The goal is specificity you can test against.

Phase 2: Build -- Configuration, Not Just Code

Building an agent well means instrumenting it from the start, designing your evaluation dataset before your prompts, and treating memory and tools as first-class configuration rather than afterthoughts. The basics -- writing a system prompt, connecting tools, testing a few example flows -- get done by everyone. What gets skipped is the scaffolding that makes the Observe phase functional later.

Instrument from day one. The most common reason observation fails in production is that the agent wasn't built to emit useful signals. Tool invocations, memory lookups, decision branches, confidence levels -- all of these need to be logged before you deploy, not after you discover you need them. Adding instrumentation retroactively is painful and often incomplete.

Instrumenting a tool call at build time·typescript

import Chanl from '@chanl/sdk'
 
const chanl = new Chanl({ apiKey: process.env.CHANL_API_KEY! })
 
async function handleCustomerRequest(conversationId: string, input: string) {
  // Wrap tool calls with tracking at build time
  const result = await chanl.tools.trackedCall({
    conversationId,
    toolName: 'lookupCustomerAccount',
    input: { customerId: extractCustomerId(input) },
    onComplete: (output, latency) => {
      // This data feeds your observation phase dashboards
      chanl.analytics.recordToolCall({
        conversationId,
        tool: 'lookupCustomerAccount',
        success: output.found,
        latencyMs: latency,
      })
    },
  })
 
  return result
}

Design your evaluation dataset before writing prompts. A common pattern: teams write prompts, then scramble to build evals. The better order is inverted. Sketch out 50-100 example conversations -- inputs and expected outputs -- before finalizing prompt design. Writing the eval first clarifies what you're actually trying to build and surfaces edge cases you'd otherwise discover in production.

Memory and tools are configuration, not afterthoughts. An agent's persistent memory and tool access determine more of its behavior than the system prompt does. Which customer history does the agent retrieve? Which tools can it call without escalating? These decisions shape the agent's capabilities and constraints. Treat them with the same care as prompt design.

Phase 3: Evaluate Before You Ship

Quality measurement before deployment -- what the ADLC calls offline evaluation -- is the gate between Build and Deploy. Richer descriptions improve task success rates by 5.85pp according to research; eval catches whether your changes actually produced improvements or caused regressions before they reach customers.

An offline eval suite for a CX agent should cover:

Task completion: for a representative sample of conversation types, does the agent resolve the task?
Tool selection accuracy: does the agent invoke the right tools in the right order?
Boundary case handling: does it escalate appropriately for the cases it shouldn't handle autonomously?
Quality consistency: across 100 variations of the same request, does quality variance stay below a threshold?

The last one is often skipped. A single "does it work?" test doesn't tell you whether the agent is reliably good or occasionally good. Quality variance matters as much as average quality for production deployments.

Deploy Gate

Pre-deploy quality checks

Score > 80%

92%

Latency < 500ms

234ms

Error Rate < 2%

3.1%

Deploy Blocked

Running evals as a CI gate -- blocking deploys when quality regresses below a threshold -- is the implementation detail that makes this real rather than aspirational. Without a CI gate, eval becomes optional. With one, quality checks are part of every deployment, not an afterthought.

Example CI gate in a deployment pipeline·yaml

# .github/workflows/agent-eval.yml
- name: Run agent quality eval
  run: |
    npx chanl eval run \
      --suite ./evals/billing-agent.json \
      --agent-id $AGENT_ID \
      --min-task-completion 0.85 \
      --min-tool-accuracy 0.90 \
      --fail-on-regression

An eval suite that's passing doesn't mean your agent is production-ready -- it means it's as good as your test cases require. The gap between your test suite and the real world is what the Observe phase closes over time.

Phase 4: Deploy With a Safety Net

Deployment for agents isn't a binary switch. The safest pattern is a progression: shadow mode, then canary, then full traffic.

Shadow mode runs the new agent version alongside the production version on the same inputs, without showing its responses to users. You compare outputs, tool selections, and quality scores to validate the new version before it touches any real conversations. This is the only way to test a major agent change without risking production conversations, and it's especially valuable for changes to prompts, tools, or models where the behavioral change is hard to predict.

The shadow deployment pattern is worth reading in full if you're implementing this for the first time. The key is that shadow mode isn't optional for consequential changes -- it's the default.

Canary deployment routes a small percentage of production traffic (5-10%) to the new version. This exposes real user variability -- accents, unusual phrasing, edge cases your test suite didn't cover -- in a controlled way. If quality metrics hold at canary level, roll to full traffic. If they don't, roll back before the damage scales.

Define your rollback triggers before deploying, not after you notice problems. "If task completion rate drops more than 5 points from baseline in a 4-hour window, roll back automatically" is a trigger. "We'll watch it and decide" is not.

Phase 5: Observe Continuously

Observation is the phase most teams treat as optional. It's the most important one.

A well-monitored agent tells you what's actually happening in production: which conversations succeed, which ones fail, which tool call patterns signal trouble before users do. A poorly monitored agent generates alerts when it crashes and silence when it quietly degrades.

Five things to track in production monitoring:

Task completion rate: What percentage of conversations end with the customer's issue resolved? This is the top-level health metric. Track it by conversation type (billing, shipping, account) to catch category-specific regressions.

Tool call patterns: Are tools being called in the expected order? Are any tools being called unusually frequently or infrequently? Unexpected patterns are early signals of behavioral drift -- often visible in tool call data days before they show up in customer satisfaction metrics.

Quality scores at scale: Manual review of every conversation isn't possible at production volume. AI-powered scoring against a rubric gives you quality signal across 100% of conversations, not a sampled subset. Run scores on key dimensions: resolution accuracy, response appropriateness, escalation judgment.

Escalation analysis: Are escalations happening for the right reasons? High escalation rates on topics the agent should handle autonomously indicate a capability gap. Low escalation rates on topics where escalation is required indicate the agent is overstepping. Both are problems.

Conversation exit points: Where do customers disengage? If customers are dropping off mid-conversation on a particular intent type, that's a UX or quality problem localized to that flow.

The signal from these five areas isn't just for reporting -- it directly feeds Phase 1 (Intent) of the next cycle.

Closing the Feedback Loop: Production Failures as Test Cases

Every production failure is a test case you didn't have before. Converting those failures into eval suite entries is what turns the ADLC into a compounding flywheel: each cycle makes the next evaluation tighter, each deployed fix reduces a failure mode permanently, and the gap between your test suite and the real world narrows over time. The system that turns observations into improvements has three steps:

Step 1: Surface failure modes. From your monitoring data, identify conversations that ended in unexpected escalation, customer dissatisfaction, task failure, or anomalous tool usage. These are your candidates for investigation. You don't need to review all of them -- prioritize by frequency and severity.

Step 2: Convert failures to labeled test cases. For each failure mode you want to prevent, create a test case in your eval suite: the input that triggered it, the output that was wrong, and the expected correct output. This is the ground truth annotation pipeline -- the ongoing work that keeps your eval suite current as the world changes.

Creating a test case from a production failure·typescript

import Chanl from '@chanl/sdk'
 
const chanl = new Chanl({ apiKey: process.env.CHANL_API_KEY! })
 
// When a conversation is flagged in production review
async function addRegressionTest(conversationId: string) {
  const conversation = await chanl.calls.get(conversationId)
  
  // Create a test case from the failure
  await chanl.scenarios.createCase({
    suiteId: process.env.EVAL_SUITE_ID!,
    input: conversation.userTurns,
    expectedOutcome: {
      taskCompleted: true,
      toolsUsed: ['verifyIdentity', 'lookupCustomerContext'],
      escalated: false,
    },
    source: 'production-failure',
    sourceConversationId: conversationId,
    notes: 'Agent incorrectly escalated billing inquiry that should have been handled autonomously',
  })
}

Step 3: Fix and verify. Change the agent configuration, prompt, or tools to address the root cause. Re-run the eval suite -- which now includes the new test case -- and verify the fix passes before deploying. The new test case stays in the suite permanently, so the same failure mode can't recur silently.

This loop compounds over time. An agent with 50 test cases at launch can have 500 within six months of active monitoring -- each one encoding a real failure mode that can never bite you again. The eval suite becomes a record of everything the agent has learned to handle correctly.

The goal, as described in the online vs offline evals framework, is to keep both running simultaneously: offline evals to catch regressions before deploy, online evals (or monitoring-derived evals) to surface what's going wrong right now.

What Most Teams Get Wrong

The most common failure pattern isn't skipping phases -- it's treating the ADLC as linear rather than circular.

Teams complete Intent, Build, Evaluate, Deploy, and then... move on to the next feature. Observation becomes a checkbox ("we set up dashboards") rather than an active signal generator. The feedback loop never closes. The eval suite from launch stays static for six months. The agent improves on features the team actively builds while quietly degrading on everything else.

The fix is structural: assign someone to own the Observe phase. Not to react to outages -- to actively review conversation data weekly, create test cases from what they find, and feed the results into the next Build cycle. At smaller teams this is part of the engineering role. At larger teams it becomes a dedicated function.

Cisco's February 2026 acquisition of Galileo -- an agent observability platform -- is a signal of where enterprise investment is going. AI agent observability is maturing from a developer tool into infrastructure, the same trajectory that application monitoring followed a decade ago. The teams that build the observation discipline now will have a significant lead when it becomes table stakes.

The Monitor Phase Connects to Everything Else

The ADLC isn't a waterfall with observation tacked on at the end. It's a loop where monitoring data actively drives what you build next. A well-run Observe phase:

Generates the failure cases that improve your eval suite (closing the Build feedback loop)
Surfaces the new intent types that users bring to your agent (feeding the next Intent phase)
Validates that the changes you deployed in the last cycle actually worked (completing the Deploy verification)

Chanl's platform is designed around this loop -- analytics and monitoring surface production signal, scorecards apply quality measurement at scale, and scenarios turn that signal into eval cases for the next build cycle. The connection between Monitor and Build is what makes each iteration faster and more targeted than the last.

The teams winning with AI agents in 2026 aren't the ones who shipped fastest. They're the ones who built the loop. Shipping is genuinely the easy part.

Close the loop between observation and improvement

Track quality across every conversation, surface failure modes automatically, and turn production data into better agent behavior. The ADLC in practice.

See How Monitoring Works

Key Takeaway

Testing edge cases before production deployment can reduce customer complaints by 80% and prevent costly emergency fixes post-launch.

agent-lifecycle production-agents observability evaluation operations adlc

Dean Grover

Co-founder

Building the platform for AI agents at Chanl — tools, testing, and observability for customer experience.

The Signal Briefing

Un email por semana. Cómo los equipos líderes de CS, ingresos e IA están convirtiendo conversaciones en decisiones. Benchmarks, playbooks y lo que funciona en producción.

500+ líderes de CS e ingresos suscritos