What is Conversation as a Service (CaaS)?

CaaS is a software model where AI-powered conversation, through voice or text, is the primary interface for interacting with business systems. Instead of clicking through dashboards and forms, users talk to an agent that accesses databases, triggers workflows, and returns results in natural language. The key difference from traditional chatbots is that CaaS platforms handle multi-step transactions, not just Q&A.

Are voice-first SaaS companies actually growing?

Some are. Companies like Bland AI, Vapi, and Retell have scaled rapidly in the voice AI infrastructure layer. Customer-facing voice agent companies are generating meaningful revenue in verticals like healthcare scheduling, restaurant ordering, and debt collection. But the market is still early, and most voice-first startups are below $10M ARR.

What is the difference between a voice-first SaaS and a traditional SaaS with voice features?

A voice-first SaaS is built around conversation as the primary interface. The entire product experience, pricing model, and architecture assume the user is talking, not clicking. A traditional SaaS with voice features is a screen-based product that adds voice as an optional input method. The distinction matters because voice-first products require fundamentally different UX patterns, error handling, and data architecture.

Why haven't voice-first SaaS companies reached the scale of screen-based SaaS yet?

Three structural reasons: voice lacks the information density of screens (you cannot scan a voice conversation the way you scan a dashboard), error recovery in voice is awkward (misunderstandings require verbal correction), and voice interactions are inherently serial (one thing at a time, unlike parallel visual scanning). These are not temporary limitations. They are properties of the medium.

Which verticals are best suited for voice-first SaaS?

Verticals where the user's hands and eyes are occupied (driving, cooking, warehouse work), where the task is sequential and low-branching (appointment scheduling, order status), or where the existing alternative is a phone call anyway (healthcare triage, customer support). Verticals that require comparing multiple options or scanning large datasets are poorly suited.

How does CaaS pricing differ from traditional SaaS pricing?

CaaS companies typically charge per conversation minute or per completed transaction, rather than per seat or per feature tier. This usage-based model aligns cost with value delivered but creates revenue volatility. Some CaaS companies are moving toward hybrid models with a platform fee plus per-conversation charges.

What infrastructure does a CaaS platform need?

At minimum: a speech-to-text pipeline, a language model for reasoning, a text-to-speech pipeline, an orchestration layer for multi-turn conversation management, integrations with business systems (CRMs, ERPs, scheduling tools), and monitoring for conversation quality. Chanl provides the orchestration, integration, and monitoring layers so teams can focus on their agent's domain logic.

Will AI agents replace SaaS dashboards entirely?

No. Dashboards and conversational interfaces will coexist for the foreseeable future. Conversation is better for sequential, intent-driven tasks. Screens are better for exploration, comparison, and monitoring. The winning products will blend both, letting users talk when it is faster and look when it is clearer.

Conversation as a Service: Will the Next SaaS Giants Be Voice-First?

Every few years, someone declares that a new interface paradigm will kill the old one. Mobile was going to kill the desktop. Chatbots were going to kill apps. Voice was going to kill screens.

None of those predictions came true in the way people expected. Mobile did not kill the desktop. It carved out specific contexts where phones were better and left the desktop dominant for others. Chatbots did not kill apps. They found niches (customer support, simple transactions) and stalled everywhere else. Voice assistants, despite billions of dollars from Amazon, Google, and Apple, settled into a narrow groove: timers, weather, and music.

Now the prediction is back, wearing new clothes. AI agents powered by large language models are dramatically more capable than the voice assistants of 2018. They can reason, hold context across long conversations, use tools, and complete multi-step tasks. The question is real this time: will this generation of voice-first technology produce genuine SaaS giants, or are we watching another hype cycle that ends in "voice is a feature, not a product"?

I think the honest answer is more interesting than either the hype or the skepticism. Voice-first SaaS is already generating real revenue in specific verticals. It is also structurally limited in ways that most voice-first evangelists do not acknowledge. The next SaaS giants probably will not be purely voice-first. But conversation as a service will be a layer in nearly every SaaS product, and the companies that build that layer will be very large.

Let me walk through both sides.

The real case for voice-first SaaS

What has actually changed since 2018

The Alexa-era voice assistants failed to become platforms for two reasons: they could not understand complex requests, and they could not take meaningful actions. You could ask Alexa to set a timer, but you could not ask it to reschedule your dentist appointment, check whether your insurance would cover it, and add the new time to your calendar. The intent recognition was too brittle, and the integration layer was too shallow.

Both of those constraints have loosened dramatically.

On the understanding side, large language models handle ambiguity, context, and multi-turn conversation at a level that would have seemed magical five years ago. A user can say "I need to move my Thursday appointment to next week, but not Monday because I have the thing with my kid's school" and a well-built agent can parse that, ask the right clarifying question, and take action. That was impossible with the intent-slot frameworks of 2018.

On the action side, tool-use capabilities (like MCP and function calling) let language models interact with external systems in structured ways. An agent can check a calendar API, query a CRM, update a database record, and send a confirmation, all within a single conversation. The agent is not just understanding language. It is operating software.

This combination means voice-first experiences that were previously impossible are now buildable. And some companies are building them successfully.

Where voice-first is working today

The voice-first SaaS companies generating the most traction are not building general-purpose voice assistants. They are building vertical solutions for specific workflows where voice is genuinely the best interface.

Healthcare scheduling and triage. Companies in this space are handling millions of calls per month for health systems. Appointment scheduling is an almost perfect voice use case: the patient knows what they want, the task is sequential, the alternative is waiting on hold, and the information exchange is structured enough for an AI to handle reliably.

Restaurant ordering. Several companies now process phone orders for restaurant chains. Again, the task is sequential, the domain is constrained, and the existing alternative is a phone call. The AI does not need to be perfect. It needs to be faster and more consistent than the teenager who used to answer the phone during the dinner rush.

Outbound calling for collections and scheduling. Companies are automating outbound calls for appointment reminders, payment collections, and follow-ups. These are high-volume, repetitive conversations that follow predictable patterns. The economics work because each call is short, the labor cost of human callers is high, and the conversation rarely goes off-script.

Customer support tier 1. The oldest and most mature category. Voice AI handles routine inquiries (order status, account balance, password reset) while routing complex issues to humans. This is less "voice-first SaaS" and more "voice AI as a feature," but it is where the most revenue lives today.

Vertical	Why Voice Works	Revenue Model	Maturity
Healthcare scheduling	Patient intent is clear, task is sequential, alternative is hold time	Per appointment booked	Growing rapidly
Restaurant ordering	Constrained domain, predictable dialogue, replaces phone orders	Per order processed	Established in chains
Outbound collections/reminders	High volume, repetitive, short calls	Per completed call	Mature
Customer support tier 1	Handles routine queries, routes complex issues	Per minute or per resolution	Most mature
Real estate lead qualification	Inbound calls need fast response, qualification is formulaic	Per qualified lead	Early
IT helpdesk	Password resets, ticket creation, status checks	Per resolution or per seat	Emerging

The infrastructure layer is booming

Underneath the vertical applications, the voice AI infrastructure layer is growing fast. Companies like Vapi, Bland AI, and Retell provide the plumbing: speech-to-text, text-to-speech, orchestration, and telephony integration. They make it possible to build a voice agent without building the entire stack from scratch.

This is where the "next SaaS giant" argument has the most merit. Just as Twilio became a massive business by providing the communication layer that other companies built on, voice AI infrastructure companies are positioning themselves as the Twilio of conversation. The analogy is not perfect (Twilio's early product was simpler and more commoditized), but the market structure is similar: a few infrastructure players enabling thousands of vertical applications.

The honest case against voice-first SaaS

Now for the part that most voice-first advocates skip. Voice has real, structural limitations as a primary interface, and those limitations are not going away with better models.

Voice lacks information density

A screen can display a table of 50 rows and 8 columns at once. You can scan it in seconds, compare values, spot outliers. Voice cannot do this. If an agent reads you 50 items, you will retain maybe 4 of them.

This is not a fidelity problem that better TTS will solve. It is a property of the auditory channel. Humans process visual information in parallel (your eyes scan a page) and auditory information in serial (you hear one word at a time). For tasks that require comparison, scanning, or reference, screens will always be superior.

This matters because many SaaS workflows involve exactly these tasks. Comparing pricing plans. Reviewing a pipeline of deals. Scanning a list of support tickets. Analyzing a chart. These are visual activities, and wrapping them in voice makes them worse, not better.

Error recovery is clumsy in voice

When you misclick on a screen, you hit undo or click somewhere else. When a voice agent misunderstands you, you have to verbally explain the error, wait for acknowledgment, and re-state your intent. This is slow, frustrating, and compounds quickly in multi-step tasks.

Good voice UX can mitigate this with confirmation prompts and graceful recovery. But it cannot eliminate the fundamental asymmetry: correcting a visual interface is instant and spatial (click the right thing), while correcting a voice interface is sequential and verbal (explain what went wrong).

Conversation is inherently serial

In a screen-based SaaS, you can have a CRM open in one tab, a spreadsheet in another, and Slack in a third. You switch between them fluidly, carrying mental context across all three.

Voice is one thing at a time. You cannot have two conversations simultaneously. You cannot "glance" at a voice interface the way you glance at a dashboard. This serial nature makes voice excellent for focused, single-task workflows and poor for the kind of multitasking that defines most knowledge work.

The "it's just a feature" problem

The most dangerous challenge for voice-first SaaS companies is that every traditional SaaS company can add voice as a feature. Salesforce can add a voice agent to its CRM. HubSpot can add voice to its support tools. Zendesk already has.

When the incumbents add voice, the voice-first startup's differentiation evaporates. The startup's voice UX might be better, but the incumbent has the data, the integrations, and the customer relationships. This is the classic platform problem: a feature of something large beats a product of something small.

The voice-first companies that survive this will be the ones whose entire value proposition depends on conversation. If removing the voice interface makes the product pointless, it is defensible. If removing the voice interface just means the product reverts to a decent dashboard, the incumbent wins.

The CaaS model: conversation as a layer, not a product

Here is where I think the market is actually heading: conversation as a service will be a layer in the software stack, not a standalone product category.

The analogy is payments. Stripe did not build a payments-first SaaS that replaced existing software. It built a payments layer that every other software product could embed. Payments is not a product. It is a capability. But the company that provides that capability at scale is worth $65 billion.

Conversation is following the same trajectory. The winning strategy is not "replace Salesforce with a voice-first CRM." It is "provide the conversation layer that Salesforce, HubSpot, and every custom application can embed."

This is already happening. Companies build AI agents that interact with their existing systems through tool integrations. The agent is not a new SaaS product. It is a new interface to existing products. The value is in the orchestration: understanding the user's intent, routing to the right system, executing the transaction, and reporting the result. Tools like Chanl's MCP integration make this possible by giving agents standardized access to external tools and data sources.

What a CaaS architecture looks like

A conversation-as-a-service platform has four layers.

The speech layer handles audio input and output: speech-to-text, text-to-speech, noise cancellation, endpointing. This is increasingly commoditized.

The reasoning layer interprets intent, maintains context, and decides what to do next. This is where LLMs live. It is the hardest layer to get right and the most differentiated.

The action layer executes operations against external systems: checking a database, updating a record, triggering a workflow, sending a notification. This requires integrations with tools and APIs, and it is where most voice agent projects stall. The AI can understand the request. It just cannot do anything about it.

The observation layer monitors conversation quality, tracks outcomes, scores agent performance, and surfaces issues. Without this layer, you are flying blind. With it, you can measure and improve continuously.

The companies that control the middle two layers (reasoning and action) are the ones most likely to become very large. The speech layer is becoming a utility. The observation layer is essential but often undervalued. The reasoning and action layers are where the intelligence and the integration lock-in live.

Pricing models that actually work

CaaS pricing is genuinely different from traditional SaaS, and the market has not settled on a winning model yet.

Pricing Model	How It Works	Pros	Cons
Per minute	Charge for each minute of conversation	Simple, familiar (telephony model)	Penalizes quality (longer calls cost more)
Per conversation	Flat fee per completed conversation	Predictable for buyer	Hard to define "conversation" boundaries
Per outcome	Charge when conversation achieves a goal (appointment booked, issue resolved)	Perfectly aligned with value	Hard to attribute outcomes, disputes
Platform + usage	Monthly platform fee plus per-conversation variable	Predictable base revenue, scales with usage	More complex to communicate
Per seat + AI credits	Traditional SaaS pricing with AI usage limits	Familiar to buyers	Misaligns conversation volume with cost

The per-outcome model is the most intellectually appealing (you pay only when the AI delivers value) but the hardest to implement (how do you prove the AI booked the appointment, not the patient who was going to book anyway?). Most CaaS companies are settling on platform-plus-usage, which gives them a revenue floor and a growth lever.

What this means for teams building AI agents

If you are building AI agents today, whether for customer support, sales, healthcare, or any other domain, here is the practical takeaway.

Voice will be one of your channels, not your only channel. Build your agent's reasoning and action layers to be channel-agnostic. The same logic that handles a voice call should handle a chat message, a text, or an API request. The speech layer is interchangeable. The brain is not. Platforms like Chanl are built around this principle, letting you connect agents to multiple channels while maintaining a single source of truth for configuration and monitoring.

Invest more in the action layer than the speech layer. The biggest gap in most voice agents is not speech quality. It is the ability to actually do things. An agent that understands perfectly but cannot check inventory, update a record, or schedule an appointment is just a polite dead end. Tool integration is the difference between a demo and a product.

Measure conversation quality from day one. The CaaS companies that will win long-term are the ones that know, quantitatively, how well their agents perform. Not "it seems to be working" but "our first-call resolution rate is 73%, up from 68% last month, with a 4.2/5 average customer satisfaction score." Scorecards and analytics are not nice-to-haves. They are the feedback loop that drives improvement.

Do not bet everything on voice-first. Unless your specific use case is one where voice is clearly the best interface (hands-free contexts, phone-based workflows, accessibility needs), build for multiple interfaces. The future is multimodal, and the most valuable AI agents will move between voice, chat, and screen as the context demands.

So will the next SaaS giants be voice-first?

Probably not purely voice-first. But conversation-first? Very possibly.

The next giant SaaS companies will likely be the ones that master conversation as a capability layer: the orchestration, the tool integration, the quality monitoring, and the multi-channel deployment. They will not sell voice as a product. They will sell the ability to make any software accessible through natural conversation, whether that conversation happens over the phone, in a chat window, or through a voice assistant.

The companies best positioned for this are the ones building the infrastructure that makes agents production-ready. Not the speech providers (that is becoming a commodity). Not the LLM providers (that is a different, larger market). The companies in the middle: the ones that connect reasoning to action and make the whole thing observable and improvable.

That is the real business. Not voice-first SaaS. Conversation as infrastructure. And if history is any guide, the infrastructure company always ends up bigger than the applications built on top of it.

Build conversation infrastructure, not just voice features

Chanl gives your AI agents tools, knowledge, memory, and monitoring across every channel. Build once, deploy to voice, chat, and messaging.

Start Free

Key Takeaway

Testing edge cases before production deployment can reduce customer complaints by 80% and prevent costly emergency fixes post-launch.

voice-ai conversational-ai ai-agents customer-experience

Dean Grover

Co-founder

Building the platform for AI agents at Chanl — tools, testing, and observability for customer experience.

The Signal Briefing

Un email por semana. Cómo los equipos líderes de CS, ingresos e IA están convirtiendo conversaciones en decisiones. Benchmarks, playbooks y lo que funciona en producción.

500+ líderes de CS e ingresos suscritos