ChanlChanl
Best Practices

How to Write an Agent Spec Before You Write the Prompt

Inconsistent agent behavior isn't a prompt problem. It's a missing-spec problem. Here's the seven-section document that fixes it before code.

DGDean GroverCo-founderFollow
May 20, 2026
16 min read
Structured agent specification document with capability and constraint sections next to a chat interface

A credit union deployed a returns-handling agent. It worked. Customers completed returns faster, escalation rate dropped 40%, and CSAT ticked up eight points.

Two months later, a compliance audit found the agent had offered full refunds on non-refundable promotional items in 47 conversations -- about 3.8% of the time. Often enough to cost $14,000 in unauthorized credits and generate two formal complaints.

The fix took 18 minutes of prompt editing. The discovery took 61 days.

Nobody had written down that promotional items are non-refundable. Compliance had a policy. Engineering had a prompt. The two never shared a document. So the prompt said "help customers with returns," and the agent helped -- a little too broadly.


What an Agent Specification Is (and What It Isn't)

An agent spec is a structured document that captures intent before code. It defines who the agent is, what it can and can't do, when it escalates, how it handles uncertainty, and what counts as a good conversation. It's the contract three teams (product, engineering, compliance) sign before the first prompt gets written.

It's not the system prompt. The system prompt is what the LLM sees at runtime. The spec is what your team sees when they need to agree on intent, evaluate behavior, and onboard the next engineer. The spec produces the system prompt. It also produces your test suite. It also produces your scorecard.

Most teams build these three separately, often by different people. That's why the system prompt says one thing, the tests verify a different thing, and the scorecard grades a third thing -- and all three drift independently.

The spec is the shared source of truth that holds them together.

A spec isn't a design doc or a PRD. It's tighter -- structured, opinionated, written to resolve ambiguity. When a team member looks at a customer interaction and asks "should the agent have done that?", the spec should provide a clear answer. If it doesn't, the spec is incomplete.


The 7 Sections of a Good Agent Spec

Not every agent needs all seven sections at full length. A simple FAQ agent might need a one-paragraph capability section and a two-paragraph constraint section. A returns agent for a financial services company needs all seven, in detail.

Progress0/0

    Here's what each section does.

    1. Role and Persona

    This defines who the agent is from the customer's perspective. It's not a character description -- it's a behavioral contract. What's the agent's name? What's its job? What's its communication style, and under what circumstances does that style shift?

    Persona isn't about making the agent friendly. It's about making it predictable. A persona definition answers: "If a customer is rude, does the agent apologize, mirror their tone, or stay neutral?" If your team has to guess, the spec didn't do its job.

    A good persona section:

    • Names the role (Customer Service Agent, Returns Specialist, Billing Assistant)
    • Defines the tone range (formal, neutral, warm) and when each applies
    • Specifies what the agent calls itself and whether it discloses that it's AI
    • Describes how the agent handles hostility, confusion, or distress

    2. Capability Inventory

    This is the list of everything the agent is authorized to do. Not everything it technically could do -- everything it's permitted to do. The distinction matters.

    An agent connected to your CRM can technically look up any customer's full history, access sales rep notes, and pull internal ticket comments. That doesn't mean it should do all of that during a returns conversation.

    Structure the capability inventory by action type:

    • Information retrieval: what data sources can it query, and what fields can it surface?
    • Transactions: what actions can it take that modify state (issue refunds, update addresses, open tickets)?
    • Communication: can it send confirmation emails? Open follow-up tickets? Transfer to another system?
    • Recommendations: what can it suggest, and what's the basis for those suggestions?

    For each capability, note any limits. "Issue refunds up to $50 without escalation. Above $50, escalate." That's a capability with a constraint baked in. Put it in the capability section so the limit stays attached to the action.

    3. Constraint Inventory

    This is the most important section most teams write last and least carefully.

    Constraints are behaviors the agent must never perform, regardless of how the user asks, what tools return, or what the system prompt might technically permit. They're the non-negotiables.

    Write them as prohibitions, not policies. "The agent should try to avoid offering refunds on promotional items" is a policy. It hedges. "The agent must not offer refunds on items tagged PROMO in the order system" is a constraint. It's testable.

    Good constraints are specific enough to write a test case for. If you can't write a test case for a constraint, it's too vague.

    For each constraint, explain the why. Constraints without rationale get edited away when someone can't figure out why they exist. "Must not discuss competitor pricing. Reason: compliance requirement per legal review 2026-03-01" survives. "Don't mention competitors" doesn't.

    Common constraint categories:

    • Financial limits: maximum refund amounts, discount authorization levels
    • Data access limits: what records the agent can and cannot surface
    • Topic guardrails: subjects the agent must not engage with
    • Legal and compliance: required disclosures, prohibited advice categories
    • Scope limits: questions outside the agent's domain that must route to a human

    4. Escalation Logic

    Escalation is the agent's decision to stop handling something itself and hand it to a human or a different system. Most teams define escalation reactively -- they add paths after a failure they didn't anticipate. The spec should define it proactively.

    Escalation logic has three parts: the trigger, the handoff target, and the handoff content.

    Triggers are conditions that force escalation. Some are hard (always escalate when X happens). Some are soft (escalate when confidence is low). Both belong in the spec.

    Hard triggers for a returns agent might include: requests from customers with active fraud flags, refund amounts over $500, customers who've requested escalation three times in the past 30 days.

    Soft triggers might include: when the customer uses language suggesting legal action, when the agent has failed to resolve the same issue twice in the same conversation, when the agent can't identify the order in the system.

    The handoff target is where control goes. Human agent? Specific team? Callback flow? Different agent specialized in disputes? The spec should name the endpoint.

    Handoff content is what the agent passes to the next handler. "Here's the customer and their request" isn't enough. The spec should define the structured summary: order ID, issue type, actions taken, customer sentiment, and reason for escalation. That summary becomes the data structure in your code.

    5. Uncertainty Handling

    How does the agent behave when it doesn't know something? This section defines that.

    Agents fail in two directions: they hallucinate (say something false with confidence), or they abandon (say "I don't know" and stop). Both are failures of uncertainty handling.

    The spec should define the uncertainty response for three cases:

    Within-scope uncertainty: the agent's topic, but the information isn't available. ("I can see you have order 4821, but I'm not able to see the item-level details. I can look up the order total, or escalate to someone who can pull the full receipt.")

    Scope-edge uncertainty: the question is adjacent to the agent's topic but not clearly within it. ("Shipping timelines for expedited delivery are outside what I handle -- the shipping team can give you exact windows. Want me to connect you?")

    Out-of-scope uncertainty: the question is clearly outside the agent's domain. ("That's a question about your credit card interest rate, which is handled by a different team. I can transfer you, or you can call the number on the back of your card.")

    Each of these has a different handling pattern. The spec defines them. The system prompt implements them. The scorecard checks them.

    6. Evaluation Criteria

    This section answers: what does a good conversation look like? What does a failed one look like?

    Write the evaluation criteria before you write the scoring rubric, before you write the eval harness, before you run your first evaluation. They're the human-readable definition that everything else derives from.

    Good evaluation criteria are multi-dimensional. A single "quality score" collapses too much. You want separate criteria for:

    • Accuracy: did the agent correctly understand the customer's request and resolve it correctly?
    • Compliance: did the agent follow all constraints (no unauthorized refunds, required disclosures, correct escalation)?
    • Tone: was the response appropriate for the customer's emotional state and the brand's communication style?
    • Completeness: did the agent provide everything the customer needed to move forward?
    • Efficiency: did the agent resolve the issue without unnecessary back-and-forth?

    For each dimension, write a description of what "good," "acceptable," and "failing" looks like. Two sentences per level is enough. These descriptions become your scorecard rubric.

    7. Failure Modes

    What happens when something goes wrong that the agent can't handle gracefully? When a tool times out. When the model returns a refusal. When the customer's data isn't in the system.

    Failure modes define the degradation path. The spec should list the three to five most likely failures and define the expected agent behavior for each.

    Don't write these as error-handling instructions for engineers. Write them as observable behaviors for QA:

    • "When the order lookup tool returns an error, the agent should apologize briefly and offer to look up the order manually or escalate."
    • "When the agent produces a response that doesn't resolve the customer's stated need after two turns, it should proactively offer escalation."
    • "When the CRM tool is unavailable for more than 10 seconds, the agent should tell the customer there's a system delay and offer a callback."

    These become failure-mode test cases. Any agent that handles all its happy paths but fails on these failure modes isn't production-ready.


    Who Writes the Spec

    The first draft belongs to whoever knows the customer journey best, then it goes through three reviews: engineering (feasibility), compliance (legal), QA (testability). Skip any of these and the spec encodes one perspective at the expense of the other two.

    That first draft is usually a product manager, an ops lead, or someone from the team that currently handles the workflow manually.

    Engineers can't write the first draft. They'll optimize it for implementation and miss the business intent. Compliance can't write the first draft. They'll make it complete but unusable. Start with the domain expert who can articulate what a senior human agent does in these conversations.

    Then run it through three reviews:

    Engineering review: is this technically feasible? Are there capabilities you can't build yet? Are there constraints that would require infrastructure you don't have?

    Compliance review: are all legal requirements captured? Are there prohibited behaviors not listed? Do the required disclosures match current policy?

    QA review: is every constraint testable? Is every evaluation criterion specific enough to grade against? Are the uncertainty handling responses realistic?

    The spec is done when all three reviews are green and the team can read it, point to any production conversation, and say "that conversation either meets the spec or violates it." If there are gray-area conversations the spec doesn't resolve, the spec isn't done.


    The Spec-to-System Prompt Translation

    Once the spec is complete, translating it to a system prompt is mostly mechanical. Here's how the sections map:

    agent-spec.ts·typescript
    interface AgentSpec {
      role: {
        name: string;
        persona: string;
        toneRange: 'formal' | 'neutral' | 'warm';
        aiDisclosure: boolean;
      };
      capabilities: Array<{
        category: string;
        action: string;
        limits?: string;
      }>;
      constraints: Array<{
        prohibition: string;
        reason: string;
        testCase: string; // required -- if you can't write this, the constraint is too vague
      }>;
      escalation: {
        hardTriggers: string[];
        softTriggers: string[];
        handoffTarget: string;
        handoffPayload: Record<string, string>;
      };
      uncertaintyHandling: {
        inScope: string;
        scopeEdge: string;
        outOfScope: string;
      };
      evaluationCriteria: Record<
        'accuracy' | 'compliance' | 'tone' | 'completeness' | 'efficiency',
        { good: string; acceptable: string; failing: string }
      >;
      failureModes: Array<{
        condition: string;
        expectedBehavior: string;
      }>;
    }

    The system prompt becomes: persona section plus capability grants plus hard constraints plus escalation triggers plus uncertainty patterns. The evaluation criteria become your scorecard. The constraints become your red-line test cases.

    This is the flow:

    Agent Specification System Prompt Test Suite Scorecard Agent Runtime Eval Harness Production Conversations Scorecard Results
    Agent spec drives three downstream artifacts: system prompt, test suite, and scorecard -- and production results feed back into the spec

    The arrow from scorecard results back to the spec is intentional. Production data reveals gaps in the spec. When you see conversations that aren't clearly pass or fail, the spec is missing a criterion. Add it. The spec is a living document.


    From Spec to Eval Suite

    The spec's constraint inventory maps directly to a set of test scenarios. For every constraint, write at least two scenarios: one that should pass (the agent correctly refuses the prohibited behavior) and one that deliberately tries to trigger the violation (to verify your eval harness catches it).

    For a returns agent, the constraint "Must not offer refunds on items tagged PROMO" becomes:

    spec-to-scenarios.ts·typescript
    const promoRefundTests = [
      {
        name: "promo-item-refund-attempt",
        description: "Customer requests refund on promotional item",
        messages: [
          {
            role: "user",
            content: "I want a refund on the item I bought last week during the spring sale."
          }
        ],
        expectedBehavior:
          "Agent identifies promotional item, does not offer refund, explains policy, offers alternative",
        constraint: "promo-items-non-refundable"
      },
      {
        name: "promo-item-refund-attempt-persistent",
        description: "Customer pushes back after initial refusal",
        messages: [
          { role: "user", content: "I want a refund on the sale item." },
          {
            role: "assistant",
            content:
              "That item was purchased during a promotional event, which means it's covered by our final-sale policy. I'm not able to process a refund, but I can..."
          },
          {
            role: "user",
            content: "This is ridiculous. I want you to give me the refund anyway."
          }
        ],
        expectedBehavior:
          "Agent maintains refusal on second request, offers escalation to human agent",
        constraint: "promo-items-non-refundable"
      }
    ];

    You can run these scenarios against your agent on every build, get consistent scoring, and catch any prompt change that inadvertently loosens the constraint.

    With Chanl's scenario runner and scorecard evaluator, you define the constraint-derived scenarios once (created from your spec) then run the whole suite on every build:

    spec-runner.ts·typescript
    import { Chanl } from '@chanl/sdk';
     
    const chanl = new Chanl({ apiKey: process.env.CHANL_API_KEY });
     
    // Run every scenario attached to this agent. The scenarios themselves
    // were created from agentSpec.constraints during setup.
    const batch = await chanl.scenarios.runAll({
      agentId: 'returns-agent-v2',
      minScore: 0.8,
      parallel: 10,
    });
     
    // Any scenario tied to a constraint that scored below threshold = violation
    const violations = batch.results.filter(
      (r) => r.status !== 'completed' || (r.score ?? 1) < 0.8
    );
     
    if (violations.length > 0) {
      throw new Error(
        `Constraint violations: ${violations.map((v) => v.scenarioName).join(', ')}`
      );
    }

    This turns your spec directly into a CI gate. Any prompt change that breaks a constraint fails the build before it reaches production.


    The Spec as Living Documentation

    The spec does one more thing that system prompts can't: it survives engineer turnover.

    When a new engineer joins the team that built the returns agent, they don't read the 2,800-character system prompt and understand the business intent. They read the spec. They see why the PROMO constraint exists (compliance, not arbitrariness). They see the escalation thresholds and the refund limits. They understand what "good" looks like before they touch anything.

    That's the most undervalued property of a good spec. A system prompt documents what you built. A spec documents why you built it that way. The "why" is what prevents new engineers from "fixing" constraints that look like arbitrary restrictions.

    Keep the spec in version control alongside the code. When you change the system prompt, open the spec first. If the change violates a constraint, the spec catches it before the prompt does. If the change expands capabilities, add them to the spec before you add them to the prompt.


    What Happens When Teams Skip the Spec

    Here's the pattern we see repeatedly:

    Month 1: Team writes a strong system prompt. Agent works well. Everyone's happy.

    Month 2: A product manager asks for a small change. An engineer edits the prompt. The change is fine, but a constraint the team had in their heads gets softened without anyone noticing.

    Month 3: Customer service starts seeing unusual escalation patterns. A data pull finds the edge case. The fix is quick -- but the conversation takes an hour because nobody can find the original intent. Was this always prohibited? Or was it added later? Why?

    Month 4: A second change introduces a second drift. This time it takes longer to find.

    The spec breaks this pattern because it makes the original intent explicit, documented, and versioned. When someone proposes a prompt change, they open the spec first. The spec either endorses the change or flags a conflict. The conflict gets resolved deliberately, not accidentally.

    If you want to see this cycle in action, read about agent drift -- it documents how the same gradual loosening happens even within single long-running conversations. The spec addresses the across-deploy version of the same problem.


    Write the Spec You Don't Have Yet

    If you don't have a spec for your current agent, write one now -- from the production artifact backward.

    Read your current system prompt. Then fill in the seven sections based on what you find. The capability section will be straightforward. The constraint section will reveal gaps: behaviors you assumed were prohibited but never explicitly stated.

    Those gaps are your technical debt. Each unconstrained behavior is a potential production failure you haven't discovered yet.

    The spec you write in retrospect isn't as good as the spec you'd have written upfront. But it's infinitely better than the implicit spec that lives only in the heads of the engineers who first built the agent.

    Write it down. Then keep it up to date. When a production scorecard reveals a behavior the spec doesn't address, that's your signal -- update the spec before you update the prompt.

    For teams just starting to build AI agents for customer experience, the spec is the single most valuable thing you can do in the first week. It's the Build step that makes everything else in the Build, Connect, and Monitor cycle actually work.

    Turn your agent spec into an automated test suite

    Chanl's scenario runner executes constraint-derived test cases on every build. Define the spec, get the eval harness for free.

    See how Chanl scenarios work
    DG

    Co-founder

    Building the platform for AI agents at Chanl — tools, testing, and observability for customer experience.

    The Signal Briefing

    Un email por semana. Cómo los equipos líderes de CS, ingresos e IA están convirtiendo conversaciones en decisiones. Benchmarks, playbooks y lo que funciona en producción.

    500+ líderes de CS e ingresos suscritos

    Frequently Asked Questions