What Is Structured Output in AI Agents?

Structured output means the model returns data in a guaranteed format, usually JSON matching a specific schema, rather than free-form text you parse yourself. It's the difference between hoping the agent writes valid JSON and knowing it will.

What's the Difference Between JSON Mode and Structured Outputs?

JSON mode guarantees the model returns syntactically valid JSON but doesn't enforce any schema. Structured outputs (constrained decoding) guarantee the response matches your exact schema, with correct field names, correct types, and all required properties. Constrained decoding drops schema mismatch rates well below typical JSON-mode rates.

Why Do AI Agents Fail at Returning Consistent JSON?

Without enforcement, language models generate tokens by probability, not by following a contract. They might use 'transfer' as a key in one response and 'should_transfer' in the next. JSON mode stops invalid JSON syntax, but nothing stops the model from using unexpected field names or nesting.

How Do I Implement Structured Outputs With Claude?

Claude doesn't support constrained decoding directly, but it reliably follows schemas with clear prompting plus Zod validation and a retry loop. Put the schema in your system prompt as a TypeScript interface, validate with Zod on every response, and retry once with the Zod error message on mismatch. Teams report 99%+ compliance with this pattern.

What Is the Validator-Retry Pattern for AI Agents?

The validator-retry pattern validates the model's output against your schema, and if it fails, sends the Zod error message back to the model with a request to fix it. One retry round resolves most schema mismatches because the model can act on specific error messages like 'action: Invalid enum value, received book_appointment'.

How Do Structured Outputs Affect Agent Latency?

Constrained decoding adds minimal latency because it runs during token generation. Retry-on-mismatch adds one full model call on failure, which is why you want first-attempt compliance above 95% before relying on retries as your safety net.

Should I Use Zod or JSON Schema for Agent Output Validation?

Zod is the better choice for TypeScript codebases because it validates and transforms at runtime, infers static types, and gives you a single source of truth for both runtime and compile-time checking. If you need to share schemas across languages, convert with z.toJsonSchema().

How Do I Test Structured Output Reliability?

Run your prompt template against 50 to 100 diverse inputs and measure schema compliance rate, not just whether the model returns JSON at all. Track which fields fail most often, whether failures cluster around specific input patterns, and whether one retry resolves them consistently.

Structured Outputs: Make Your AI Agent Stop Guessing

Your agent worked perfectly in testing. Then it hit production.

Same prompt, same model, same call. But this time the booking tool receives { "confirmed": true } and the next call sends { "should_confirm": true }. Your router throws a TypeError. The customer's appointment never schedules. Your agent says goodbye like everything's fine.

You didn't have a reasoning problem. You had a formatting problem.

Most agent failures that look like reasoning errors are actually parsing errors in disguise. The model knew what to do. It just said it in a slightly different shape than your code expected, and nothing caught the mismatch. Structured outputs are how you fix this, and the fix is simpler than most teams realize.

Why Plain JSON Parsing Fails in Production

Without any output enforcement, free-form completions fail JSON parsing often enough that you'll see it daily in any agent doing meaningful volume. Industry write-ups put the rate roughly in the high single digits to low double digits of calls, depending on prompt quality and model. JSON mode tightens that by guaranteeing syntactically valid JSON, but you still get a few percent schema mismatch because the model can swap field names or nest things differently across runs.

The reason is simple: models generate tokens by probability, not by following a contract. They've seen millions of JSON objects with dozens of naming conventions. transfer, should_transfer, doTransfer, transferCall, all of them are plausible completions depending on context. JSON mode stops {transfer: true from being syntactically invalid. Nothing stops the model from choosing an unexpected key name.

Here's what a typical mismatch looks like:

agent-router.ts·typescript

// What your code expects
type AgentAction = {
  action: "book" | "transfer" | "deflect";
  reason: string;
  confidence: number;
};
 
// What the model returned this time
const raw = `{
  "decision": "book_appointment",
  "explanation": "User wants to schedule...",
  "score": 0.94
}`;
 
// Parse succeeds, but action is undefined
const parsed = JSON.parse(raw); // { decision: ..., explanation: ..., score: ... }
console.log(parsed.action); // undefined, silent failure

The parse didn't throw. No error was logged. The agent simply did nothing, then told the user the call was complete. Silent failures like this are the worst kind. At least a thrown exception tells you something broke.

Three Ways to Enforce Output Shape

Three approaches exist for structured output enforcement, and they differ in what they actually guarantee.

JSON mode is the weakest guarantee. You instruct the model to respond in JSON, and the API guarantees syntactically valid JSON. Field names and schema shape are still up to the model. Use this only as a baseline when your schema is very simple or you're validating downstream anyway.

Constrained decoding is the strongest guarantee. The model's token generation is constrained by your schema at every step, so it can't return a non-conforming response. OpenAI's Structured Outputs API and Gemini's responseSchema use this approach. The tradeoff: you define your schema in the provider's format, which ties you to that provider's implementation.

Prompt plus validation plus retry is what you use with models that don't support constrained decoding (including Claude) or when you need portable logic that works across providers. Put the schema in the system prompt, validate with Zod, and retry once with the error message when validation fails.

As of 2026, constrained decoding is available for OpenAI's GPT-5 family and Gemini 2.5, and provider docs report near-zero schema mismatch when used correctly. For Claude, the prompt-plus-retry pattern reaches the high-99% range when schema instructions are written clearly.

Here's what each approach looks like in TypeScript:

structured-output-strategies.ts·typescript

import Anthropic from "@anthropic-ai/sdk";
import OpenAI from "openai";
import { z } from "zod";
import { zodResponseFormat } from "openai/helpers/zod";
 
const BookingDecision = z.object({
  action: z.enum(["book", "transfer", "deflect"]),
  reason: z.string(),
  confidence: z.number().min(0).max(1),
});
 
// Strategy A: OpenAI constrained decoding
const openai = new OpenAI();
 
async function withConstrainedDecoding(userMessage: string) {
  const response = await openai.responses.parse({
    model: "gpt-5",
    input: [{ role: "user", content: userMessage }],
    text: {
      format: zodResponseFormat(BookingDecision, "booking_decision"),
    },
  });
  // response.output_parsed is already typed as z.infer<typeof BookingDecision>
  return response.output_parsed;
}
 
// Strategy B: Claude with schema instruction + Zod + retry
const anthropic = new Anthropic();
 
const SCHEMA_INSTRUCTION = `
Respond only with valid JSON matching this exact schema:
{
  "action": "book" | "transfer" | "deflect",
  "reason": string,
  "confidence": number between 0 and 1
}
No markdown fences, no explanation, only the JSON object.
`;
 
async function withValidationRetry(
  userMessage: string,
  maxRetries = 1
): Promise<z.infer<typeof BookingDecision>> {
  for (let attempt = 0; attempt <= maxRetries; attempt++) {
    const response = await anthropic.messages.create({
      model: "claude-sonnet-4-6",
      max_tokens: 256,
      system: SCHEMA_INSTRUCTION,
      messages: [{ role: "user", content: userMessage }],
    });
 
    const text =
      response.content[0].type === "text" ? response.content[0].text : "";
 
    try {
      const parsed = JSON.parse(text);
      return BookingDecision.parse(parsed);
    } catch (err) {
      if (attempt === maxRetries) throw err;
      console.warn(`Schema mismatch on attempt ${attempt + 1}:`, err);
    }
  }
  throw new Error("Unreachable");
}

The validator-retry pattern does add latency on the retry path: one full model call. That's why you want retries to be rare, not the primary safety net.

Writing Schema Instructions That Actually Work

The biggest variable in prompt-plus-retry reliability is how you write your schema instruction. Clear instructions cut mismatch rates before any retry is needed.

Show the exact field names you expect. Don't describe them in prose. Write them out as if you're showing a TypeScript interface. Models are trained on enormous amounts of TypeScript and JSON Schema, so they understand type notation at the token level.

Put the schema in the system prompt, not the user turn. System prompts carry stronger instruction-following weight in most models. Mixing schema instructions into the user message lets the model partially override them when the user content is long or contradictory.

Use explicit negative constraints for common confusables. If you have a field called action, tell the model not to use decision, choice, or type. It sounds verbose but consistently cuts mismatch rates for fields with natural synonyms.

End with a formatting reminder. The last sentence before the user message has high influence on the model's response. Something like "Respond with only the JSON object. No markdown fences, no commentary before or after."

Here's how this looks for a customer service routing agent:

system-prompt-builder.ts·typescript

const ROUTING_SYSTEM_PROMPT = `
You are a customer service routing agent. Analyze the customer message
and decide how to handle it.
 
Respond with a JSON object matching this exact schema:
{
  "action": "book" | "transfer" | "deflect",
  "reason": string,       // 1-2 sentences explaining your decision
  "confidence": number    // 0.0 to 1.0, your certainty about this action
}
 
Field name rules:
- Use "action" not "decision", "choice", "route", or "outcome"
- Use "reason" not "explanation", "rationale", or "justification"
- Use "confidence" not "score", "probability", or "certainty"
 
Actions:
- "book": Customer wants to schedule, reschedule, or cancel an appointment
- "transfer": Complex issue that needs a human agent
- "deflect": Can be handled by self-service (FAQ, account portal)
 
Respond with only the JSON object. No markdown fences, no text before or after.
`;

With this pattern, Claude's out-of-the-box compliance for a three-field object typically lands in the 97-99% range before any retry logic kicks in.

Building the Validator-Retry Pattern

Once you have a reliable schema instruction, the retry loop is straightforward. Feed the Zod error back to the model as a specific message. Not just "try again" but showing exactly what was wrong.

structured-agent.ts·typescript

import { z, ZodError } from "zod";
import Anthropic from "@anthropic-ai/sdk";
 
const RoutingDecision = z.object({
  action: z.enum(["book", "transfer", "deflect"]),
  reason: z.string().min(10).max(300),
  confidence: z.number().min(0).max(1),
});
 
type RoutingDecision = z.infer<typeof RoutingDecision>;
 
async function routeCustomerMessage(
  client: Anthropic,
  message: string
): Promise<RoutingDecision> {
  const messages: Anthropic.MessageParam[] = [
    { role: "user", content: message },
  ];
 
  for (let attempt = 0; attempt < 2; attempt++) {
    const response = await client.messages.create({
      model: "claude-sonnet-4-6",
      max_tokens: 256,
      system: ROUTING_SYSTEM_PROMPT,
      messages,
    });
 
    const raw =
      response.content[0].type === "text"
        ? response.content[0].text.trim()
        : "";
 
    try {
      // Strip markdown fences. Claude adds them on some inputs despite instructions.
      const json = raw.replace(/^```json?\n?/, "").replace(/\n?```$/, "");
      const parsed = JSON.parse(json);
      return RoutingDecision.parse(parsed);
    } catch (err) {
      if (attempt === 1) {
        console.error("Structured output failed after retry:", { raw, err });
        throw new Error(`Output schema validation failed: ${err}`);
      }
 
      // Build a specific retry message with the actual Zod error paths
      const errorSummary =
        err instanceof ZodError
          ? err.errors
              .map((e) => `${e.path.join(".")}: ${e.message}`)
              .join("; ")
          : String(err);
 
      messages.push(
        { role: "assistant", content: raw },
        {
          role: "user",
          content: `Your response did not match the required schema. Errors: ${errorSummary}. Please respond with only the JSON object, correcting these issues.`,
        }
      );
    }
  }
 
  throw new Error("Unreachable");
}

Two details in this implementation are worth calling out.

The strip-markdown-fences step handles a real Claude behavior: despite "no markdown fences" in the prompt, some inputs trigger the model to wrap code in ```json. This regex cleanup costs nothing and prevents a class of failures.

The retry message includes the actual Zod error paths. Instead of "please try again," the model sees "action: Invalid enum value. Expected 'book' | 'transfer' | 'deflect', received 'book_appointment'." That specificity is what makes retries work. The model can fix a named error. It can't fix vague feedback.

Tracking Schema Compliance as a Health Signal

Three metrics tell you everything you need to know about structured output health in production: first-attempt compliance rate, retry resolution rate, and field-level failure distribution.

First-attempt compliance rate is what percentage of calls pass schema validation without a retry. Anything below 95% means your schema instruction needs revision or your model is seeing inputs it wasn't tested on.

Retry resolution rate is, of calls that fail first-attempt validation, what percentage succeed on retry. A retry resolution rate below 80% means the validation errors aren't actionable. The model can't fix what it doesn't understand from the error message.

Field-level failure distribution tells you which specific fields fail most often. If confidence is misfiring but action is perfect, your prompt has a specific weakness you can fix in one line rather than rewriting the whole instruction.

If you're using Chanl's monitoring dashboard, you can attach these as metadata on each call. Every call gets a schema_validation_attempts count and a schema_compliant boolean, giving you per-field failure rates broken down by input category and over time.

instrumented-agent.ts·typescript

import Chanl from "@chanl/sdk";
 
const chanl = new Chanl({ apiKey: process.env.CHANL_API_KEY });
 
async function routeWithInstrumentation(
  callId: string,
  message: string
): Promise<RoutingDecision> {
  const start = Date.now();
  let attempts = 0;
 
  try {
    const result = await routeCustomerMessage(anthropic, message);
    attempts = 1;
    return result;
  } catch {
    attempts = 2;
    throw new Error("Routing failed");
  } finally {
    await chanl.calls.update(callId, {
      metadata: {
        schema_validation_attempts: attempts,
        schema_compliant: attempts === 1,
        routing_latency_ms: Date.now() - start,
      },
    });
  }
}

Spikes in schema_validation_attempts > 1 are immediately visible in the monitoring dashboard, and you can drill into which inputs triggered them. This catches prompt drift weeks before it causes user-visible failures.

When Structured Outputs Aren't Enough

Structured outputs eliminate formatting failures. They don't eliminate reasoning failures.

An agent can return a perfectly valid schema with "action": "book" when it should have returned "action": "transfer". The confidence might say 0.92. Everything validates. The customer still gets routed wrong.

That's why structured outputs and testing need to work together. Once the output shape is stable, you can write deterministic assertions against field values, run hundreds of synthetic conversations, and measure how often the agent makes the right call rather than just the right-shaped call.

We cover building that evaluation layer in AI Agent Testing: How to Evaluate Agents Before They Talk to Customers and Scenario Testing: The QA Strategy That Catches What Unit Tests Miss. Structured outputs are the foundation. Scenario testing is the superstructure.

The Schema as a Type System

The most useful mental shift that comes from adopting structured outputs consistently is this: the schema becomes the type system for your agent's outputs.

When a new engineer touches the routing agent, they see the Zod schema and understand exactly what the agent can return. When you want to add a new action type, you update the schema first, update the prompt, run tests, and ship. When something breaks in production, you check whether the schema was violated or the reasoning was wrong. Two different debugging paths that lead to different fixes.

This is why, in Function Calling: Build a Multi-Tool AI Agent from Scratch, we push toward explicit TypeScript interfaces even for simple tool calls. The schema is load-bearing documentation that's automatically enforced at runtime. It's the one artifact that can't go stale without your code telling you.

Agents that generate free-form text and parse it with regex fail in production in ways you can't predict. Agents with schema-enforced outputs fail only in ways your schema allows, predictably. And predictable failures are fixable failures.

Structured output flow: from model call through Zod validation to action handler or retry

Shipping With Schema Confidence

Once structured outputs are in place, you've eliminated the entire class of "I don't understand what the agent returned" incidents. That's a meaningful slice of production failures for most teams.

The setup path is short:

Define your output schema with Zod and infer the TypeScript type from it
Write schema instructions that name exact field names and list common confusables to avoid
Implement the validator-retry loop with Zod error messages fed back to the model
Add a strip-markdown-fences pre-processing step
Log schema_validation_attempts per call and alert when first-attempt compliance drops below 95%

Start with your highest-volume agent action. Get that one to the high-99% range. Then expand across the rest of your routing logic. The structured output pattern pays off disproportionately because every downstream component benefits from the reliability. Your action handlers can trust what they receive, your tests can make deterministic assertions, and your monitoring sees clean data.

Deploy Gate

Pre-deploy quality checks

Score > 80%

92%

Latency < 500ms

234ms

Error Rate < 2%

3.1%

Deploy Blocked

Stop debugging JSON parsing failures in production

Chanl monitors schema compliance rates across all your agents, surfaces field-level failure patterns, and alerts when first-attempt compliance drops. Start with one agent and see the reliability difference in 48 hours.

Start Free

Key Takeaway

Testing edge cases before production deployment can reduce customer complaints by 80% and prevent costly emergency fixes post-launch.

structured-outputs json-mode tool-calling reliability typescript schema-validation

Dean Grover

Co-founder

Building the platform for AI agents at Chanl — tools, testing, and observability for customer experience.

The Signal Briefing

Un email por semana. Cómo los equipos líderes de CS, ingresos e IA están convirtiendo conversaciones en decisiones. Benchmarks, playbooks y lo que funciona en producción.