Every few years, someone declares that a new interface paradigm will kill the old one. Mobile was going to kill the desktop. Chatbots were going to kill apps. Voice was going to kill screens.
None of those predictions came true in the way people expected. Mobile did not kill the desktop. It carved out specific contexts where phones were better and left the desktop dominant for others. Chatbots did not kill apps. They found niches (customer support, simple transactions) and stalled everywhere else. Voice assistants, despite billions of dollars from Amazon, Google, and Apple, settled into a narrow groove: timers, weather, and music.
Now the prediction is back, wearing new clothes. AI agents powered by large language models are dramatically more capable than the voice assistants of 2018. They can reason, hold context across long conversations, use tools, and complete multi-step tasks. The question is real this time: will this generation of voice-first technology produce genuine SaaS giants, or are we watching another hype cycle that ends in "voice is a feature, not a product"?
I think the honest answer is more interesting than either the hype or the skepticism. Voice-first SaaS is already generating real revenue in specific verticals. It is also structurally limited in ways that most voice-first evangelists do not acknowledge. The next SaaS giants probably will not be purely voice-first. But conversation as a service will be a layer in nearly every SaaS product, and the companies that build that layer will be very large.
Let me walk through both sides.
The real case for voice-first SaaS
What has actually changed since 2018
The Alexa-era voice assistants failed to become platforms for two reasons: they could not understand complex requests, and they could not take meaningful actions. You could ask Alexa to set a timer, but you could not ask it to reschedule your dentist appointment, check whether your insurance would cover it, and add the new time to your calendar. The intent recognition was too brittle, and the integration layer was too shallow.
Both of those constraints have loosened dramatically.
On the understanding side, large language models handle ambiguity, context, and multi-turn conversation at a level that would have seemed magical five years ago. A user can say "I need to move my Thursday appointment to next week, but not Monday because I have the thing with my kid's school" and a well-built agent can parse that, ask the right clarifying question, and take action. That was impossible with the intent-slot frameworks of 2018.
On the action side, tool-use capabilities (like MCP and function calling) let language models interact with external systems in structured ways. An agent can check a calendar API, query a CRM, update a database record, and send a confirmation, all within a single conversation. The agent is not just understanding language. It is operating software.
This combination means voice-first experiences that were previously impossible are now buildable. And some companies are building them successfully.
Where voice-first is working today
The voice-first SaaS companies generating the most traction are not building general-purpose voice assistants. They are building vertical solutions for specific workflows where voice is genuinely the best interface.
Healthcare scheduling and triage. Companies in this space are handling millions of calls per month for health systems. Appointment scheduling is an almost perfect voice use case: the patient knows what they want, the task is sequential, the alternative is waiting on hold, and the information exchange is structured enough for an AI to handle reliably.
Restaurant ordering. Several companies now process phone orders for restaurant chains. Again, the task is sequential, the domain is constrained, and the existing alternative is a phone call. The AI does not need to be perfect. It needs to be faster and more consistent than the teenager who used to answer the phone during the dinner rush.
Outbound calling for collections and scheduling. Companies are automating outbound calls for appointment reminders, payment collections, and follow-ups. These are high-volume, repetitive conversations that follow predictable patterns. The economics work because each call is short, the labor cost of human callers is high, and the conversation rarely goes off-script.
Customer support tier 1. The oldest and most mature category. Voice AI handles routine inquiries (order status, account balance, password reset) while routing complex issues to humans. This is less "voice-first SaaS" and more "voice AI as a feature," but it is where the most revenue lives today.
| Vertical | Why Voice Works | Revenue Model | Maturity |
|---|---|---|---|
| Healthcare scheduling | Patient intent is clear, task is sequential, alternative is hold time | Per appointment booked | Growing rapidly |
| Restaurant ordering | Constrained domain, predictable dialogue, replaces phone orders | Per order processed | Established in chains |
| Outbound collections/reminders | High volume, repetitive, short calls | Per completed call | Mature |
| Customer support tier 1 | Handles routine queries, routes complex issues | Per minute or per resolution | Most mature |
| Real estate lead qualification | Inbound calls need fast response, qualification is formulaic | Per qualified lead | Early |
| IT helpdesk | Password resets, ticket creation, status checks | Per resolution or per seat | Emerging |
The infrastructure layer is booming
Underneath the vertical applications, the voice AI infrastructure layer is growing fast. Companies like Vapi, Bland AI, and Retell provide the plumbing: speech-to-text, text-to-speech, orchestration, and telephony integration. They make it possible to build a voice agent without building the entire stack from scratch.
This is where the "next SaaS giant" argument has the most merit. Just as Twilio became a massive business by providing the communication layer that other companies built on, voice AI infrastructure companies are positioning themselves as the Twilio of conversation. The analogy is not perfect (Twilio's early product was simpler and more commoditized), but the market structure is similar: a few infrastructure players enabling thousands of vertical applications.
The honest case against voice-first SaaS
Now for the part that most voice-first advocates skip. Voice has real, structural limitations as a primary interface, and those limitations are not going away with better models.
Voice lacks information density
A screen can display a table of 50 rows and 8 columns at once. You can scan it in seconds, compare values, spot outliers. Voice cannot do this. If an agent reads you 50 items, you will retain maybe 4 of them.
This is not a fidelity problem that better TTS will solve. It is a property of the auditory channel. Humans process visual information in parallel (your eyes scan a page) and auditory information in serial (you hear one word at a time). For tasks that require comparison, scanning, or reference, screens will always be superior.
This matters because many SaaS workflows involve exactly these tasks. Comparing pricing plans. Reviewing a pipeline of deals. Scanning a list of support tickets. Analyzing a chart. These are visual activities, and wrapping them in voice makes them worse, not better.
Error recovery is clumsy in voice
When you misclick on a screen, you hit undo or click somewhere else. When a voice agent misunderstands you, you have to verbally explain the error, wait for acknowledgment, and re-state your intent. This is slow, frustrating, and compounds quickly in multi-step tasks.
Good voice UX can mitigate this with confirmation prompts and graceful recovery. But it cannot eliminate the fundamental asymmetry: correcting a visual interface is instant and spatial (click the right thing), while correcting a voice interface is sequential and verbal (explain what went wrong).
Conversation is inherently serial
In a screen-based SaaS, you can have a CRM open in one tab, a spreadsheet in another, and Slack in a third. You switch between them fluidly, carrying mental context across all three.
Voice is one thing at a time. You cannot have two conversations simultaneously. You cannot "glance" at a voice interface the way you glance at a dashboard. This serial nature makes voice excellent for focused, single-task workflows and poor for the kind of multitasking that defines most knowledge work.
The "it's just a feature" problem
The most dangerous challenge for voice-first SaaS companies is that every traditional SaaS company can add voice as a feature. Salesforce can add a voice agent to its CRM. HubSpot can add voice to its support tools. Zendesk already has.
When the incumbents add voice, the voice-first startup's differentiation evaporates. The startup's voice UX might be better, but the incumbent has the data, the integrations, and the customer relationships. This is the classic platform problem: a feature of something large beats a product of something small.
The voice-first companies that survive this will be the ones whose entire value proposition depends on conversation. If removing the voice interface makes the product pointless, it is defensible. If removing the voice interface just means the product reverts to a decent dashboard, the incumbent wins.
The CaaS model: conversation as a layer, not a product
Here is where I think the market is actually heading: conversation as a service will be a layer in the software stack, not a standalone product category.
The analogy is payments. Stripe did not build a payments-first SaaS that replaced existing software. It built a payments layer that every other software product could embed. Payments is not a product. It is a capability. But the company that provides that capability at scale is worth $65 billion.
Conversation is following the same trajectory. The winning strategy is not "replace Salesforce with a voice-first CRM." It is "provide the conversation layer that Salesforce, HubSpot, and every custom application can embed."
This is already happening. Companies build AI agents that interact with their existing systems through tool integrations. The agent is not a new SaaS product. It is a new interface to existing products. The value is in the orchestration: understanding the user's intent, routing to the right system, executing the transaction, and reporting the result. Tools like Chanl's MCP integration make this possible by giving agents standardized access to external tools and data sources.
What a CaaS architecture looks like
A conversation-as-a-service platform has four layers.
The speech layer handles audio input and output: speech-to-text, text-to-speech, noise cancellation, endpointing. This is increasingly commoditized.
The reasoning layer interprets intent, maintains context, and decides what to do next. This is where LLMs live. It is the hardest layer to get right and the most differentiated.
The action layer executes operations against external systems: checking a database, updating a record, triggering a workflow, sending a notification. This requires integrations with tools and APIs, and it is where most voice agent projects stall. The AI can understand the request. It just cannot do anything about it.
The observation layer monitors conversation quality, tracks outcomes, scores agent performance, and surfaces issues. Without this layer, you are flying blind. With it, you can measure and improve continuously.
The companies that control the middle two layers (reasoning and action) are the ones most likely to become very large. The speech layer is becoming a utility. The observation layer is essential but often undervalued. The reasoning and action layers are where the intelligence and the integration lock-in live.
Pricing models that actually work
CaaS pricing is genuinely different from traditional SaaS, and the market has not settled on a winning model yet.
| Pricing Model | How It Works | Pros | Cons |
|---|---|---|---|
| Per minute | Charge for each minute of conversation | Simple, familiar (telephony model) | Penalizes quality (longer calls cost more) |
| Per conversation | Flat fee per completed conversation | Predictable for buyer | Hard to define "conversation" boundaries |
| Per outcome | Charge when conversation achieves a goal (appointment booked, issue resolved) | Perfectly aligned with value | Hard to attribute outcomes, disputes |
| Platform + usage | Monthly platform fee plus per-conversation variable | Predictable base revenue, scales with usage | More complex to communicate |
| Per seat + AI credits | Traditional SaaS pricing with AI usage limits | Familiar to buyers | Misaligns conversation volume with cost |
The per-outcome model is the most intellectually appealing (you pay only when the AI delivers value) but the hardest to implement (how do you prove the AI booked the appointment, not the patient who was going to book anyway?). Most CaaS companies are settling on platform-plus-usage, which gives them a revenue floor and a growth lever.
What this means for teams building AI agents
If you are building AI agents today, whether for customer support, sales, healthcare, or any other domain, here is the practical takeaway.
Voice will be one of your channels, not your only channel. Build your agent's reasoning and action layers to be channel-agnostic. The same logic that handles a voice call should handle a chat message, a text, or an API request. The speech layer is interchangeable. The brain is not. Platforms like Chanl are built around this principle, letting you connect agents to multiple channels while maintaining a single source of truth for configuration and monitoring.
Invest more in the action layer than the speech layer. The biggest gap in most voice agents is not speech quality. It is the ability to actually do things. An agent that understands perfectly but cannot check inventory, update a record, or schedule an appointment is just a polite dead end. Tool integration is the difference between a demo and a product.
Measure conversation quality from day one. The CaaS companies that will win long-term are the ones that know, quantitatively, how well their agents perform. Not "it seems to be working" but "our first-call resolution rate is 73%, up from 68% last month, with a 4.2/5 average customer satisfaction score." Scorecards and analytics are not nice-to-haves. They are the feedback loop that drives improvement.
Do not bet everything on voice-first. Unless your specific use case is one where voice is clearly the best interface (hands-free contexts, phone-based workflows, accessibility needs), build for multiple interfaces. The future is multimodal, and the most valuable AI agents will move between voice, chat, and screen as the context demands.
So will the next SaaS giants be voice-first?
Probably not purely voice-first. But conversation-first? Very possibly.
The next giant SaaS companies will likely be the ones that master conversation as a capability layer: the orchestration, the tool integration, the quality monitoring, and the multi-channel deployment. They will not sell voice as a product. They will sell the ability to make any software accessible through natural conversation, whether that conversation happens over the phone, in a chat window, or through a voice assistant.
The companies best positioned for this are the ones building the infrastructure that makes agents production-ready. Not the speech providers (that is becoming a commodity). Not the LLM providers (that is a different, larger market). The companies in the middle: the ones that connect reasoning to action and make the whole thing observable and improvable.
That is the real business. Not voice-first SaaS. Conversation as infrastructure. And if history is any guide, the infrastructure company always ends up bigger than the applications built on top of it.
Build conversation infrastructure, not just voice features
Chanl gives your AI agents tools, knowledge, memory, and monitoring across every channel. Build once, deploy to voice, chat, and messaging.
Start FreeCo-founder
Building the platform for AI agents at Chanl — tools, testing, and observability for customer experience.
Aprende IA Agéntica
Una lección por semana: técnicas prácticas para construir, probar y lanzar agentes IA. Desde ingeniería de prompts hasta monitoreo en producción. Aprende haciendo.



