AI voice has moved from demo to deployment. Funding is accelerating, latency is falling toward human conversational ranges, and real businesses are replacing slices of phone workflows with agents. At the same time, entire phone-heavy industries remain underserved, which is where the next wins will likely come from.
Why voice, why now
Voice is the most frequent and information-dense interface we use, and for the first time it’s programmable end to end. For enterprises, voice agents extend availability to 24/7, cut labor costs, and often outperform humans on consistency and memory. For consumers, many investors expect voice to become a primary way to interact with AI, from coaching to access to services that were previously out of reach. That framing comes through clearly in a16z’s January update on voice agents, which also argues we are shifting from infrastructure to applications and that “voice is the wedge,” not the whole product.
Two technical shifts unlocked adoption. First, conversational latency dropped under the threshold where speech feels natural. Startups focused on ultra-low-latency synthesis report sub-100 ms generation, helping agents respond in a human-like rhythm. Second, the cost curve bent down as platforms introduced realtime and cached pricing, plus smaller realtime models. Together those changes widened what is economically viable beyond high-value calls. See CB Insights’ consolidation note for the latency context and OpenAI’s updates for pricing mechanics.
Methodology
We track the space through three lenses:
- Follow the money. Funding rounds, acquisitions, and public-market signals.
- Follow the traction. Deployed use cases with visible KPIs.
- Find the overlooked veins. Niches with high call volume and low competition.
Funding leaders and consolidation signals
Money is pouring into voice AI at a pace that looks like a new platform cycle.
ElevenLabs raised an $80 million Series B in 2024 and followed with a $180 million Series C in January 2025 at a $3.3 billion valuation, underscoring investor conviction in core voice infrastructure.
Venture dollars aimed at voice are surging. The Wall Street Journal put voice-AI VC at roughly $315 million in 2022 and about $2.1 billion in 2024, a nearly seven-fold increase in two years.
On public markets, Investor’s Business Daily highlighted that SoundHound lifted its 2025 revenue outlook to $157–$177 million from about $85 million in 2024, backed by a $1.2 billion bookings backlog and a $140 billion TAM. Shares jumped nearly 12 percent on the update.
Consolidation has started. Meta’s acquisition of PlayAI is a signal that big tech wants to own speech building blocks. CB Insights’ M&A watch list calls out ElevenLabs as a top target by its Mosaic score and elevates ultra-low-latency vendors like Cartesia as strategically important.
Where traction is real
Validated use cases have two things in common: high call volume and easy-to-measure ROI.
Restaurants and QSR. Ordering, drive-thru throughput, and missed-call capture. Note that it’s not all smooth sailing: The WSJ recently reported Taco Bell is rethinking how and when to deploy voice AI after mixed results at scale, a useful reminder that ops and context matter WSJ.
Auto service and dealerships. Service scheduling and lead handling, especially during off hours.
Healthcare front offices. Appointment reminders, follow-ups, and prescription refills with audit trails.
Creators and publishers. Audiobooks, multilingual dubbing, and YouTube localization; ElevenLabs sits at the center for many of these workflows and it’s ubiquity is cited as earned it unicorn status.
A16z emphasizes the “wedge” pattern and start with one call type, prove value and compliance, then expand. That’s visible across categories from reminders to collections.
“Especially for larger enterprises, we’ve rarely seen a shift from full human call-taking → full AI call-taking immediately. Founders instead find a “wedge” to start to capture what is often a small percentage of calls for a customer.” – Olivia Moore, A16z January 2025
Case study wedges
Recruiting and staffing
Screening interviews are an unexpectedly strong early wedge. a16z cites staffing agencies replacing first-round screens with AI interviews, reporting that a Fortune 100 staffing partner saw about 90 percent of AI-screened candidates advance to first round, compared with roughly half under human screeners. Agencies paid by candidate volume doubled throughput while gaining 24/7 scheduling and consistent scoring (a16z).
Banking and financial services
Financial services are one of the ripest wedges because the dollars are already concentrated in call centers and BPO. a16z’s fintech newsletter notes that banks account for roughly a quarter of global contact-center spend and more than $100 billion in annual BPO, and lays out a practical path starting with authenticated flows and tightly scoped calls like account questions, collections, or insurance FNOL. Integrations with legacy systems and on-prem constraints are table stakes, but the budgets are there (a16z fintech newsletter).
Agencies & builders
There is also a fast-moving layer of operators building on off-the-shelf platforms. Common stacks include AgentVoice, Vapi, and Retell, which let small teams ship production agents without owning every piece of the stack.
This layer looks like early SaaS agency work: freelancers selling “AI receptionist” packages on marketplaces, boutiques bundling voice with CRM and calendar integrations, and indie devs posting booking-rate screenshots on X.
Overlooked opportunities
The crowded categories get most headlines. The next growth will come from quiet, phone-heavy niches.
Manufacturing and industrial suppliers. Vendor coordination, purchase orders, and logistics confirmations still happen by phone in many plants and supply houses. Few voice vendors market here.
BPO and outsourcing providers. With billions of hours already pointed at phones, AI augmentation is an existential question for providers and a margin unlock for their clients. a16z’s separate note on unbundling BPO outlines how AI lets operators productize repeatable workflows (a16z on BPO).
VA-heavy sectors. Real estate investors, online coaching, and long-tail ecommerce support already delegate via virtual assistants. Voice scales that behavior without hourly ceilings.
Healthcare back office. Pharmacy refills, eligibility checks, referral triage. The regulatory surface is larger, but so is the payoff when you automate high-repeat tasks.
Government and utilities. City hotlines, DMVs, and bill-pay lines are high-volume, steady-budget environments with low competition from startups.
Debt collection and insurance. Structured, high-repeat calls where compliant agents can deliver measurable outcomes quickly.
Challenges and slow movers
Not everyone is adapting quickly.
Legacy call-center stacks move slowly under the weight of multi-year enterprise contracts and complex integrations.
Overpromising tools hit scale walls when latency, accents, or costs spike.
Low-quality outbound deployments risk consumer backlash and regulation.
Compliance paralysis stalls pilots. Teams that design verification and disclosures into flows from day one expand faster than those that bolt it on later.
Pricing, latency, and unit economics
Two levers dominate P&L for phone agents: token costs and round-trip time.
Token costs and realtime pricing. OpenAI has introduced cached pricing for realtime and added a lower-cost realtime model variant, pushing per-minute economics down and making routine admin calls pencil out where they didn’t before.
Latency and containment. CB Insights highlights that sub-300 ms experiences are the adoption tipping point, and vendors like Cartesia market sub-100 ms latency synthesis. Cartesia’s own materials break down pipeline targets for STT, LLM, and TTS, underscoring why shaving tens of milliseconds matters to hold times and hang-ups.

Market size and growth trajectories
Analysts disagree on exact figures because definitions vary, but directionally the story is the same: steep growth.
AI voice generators. Grand View Research estimates the segment at about $3.56 billion in 2023, projecting $21.75 billion by 2030 at roughly 29–30 percent CAGR (Grand View Research).
Alternate forecast. MarketsandMarkets pegs it at $3.0 billion in 2024 to $20.4 billion by 2030 at 37.1 percent CAGR (MarketsandMarkets).
Voice AI agents. Market.us sees $2.4 billion in 2024 rising to $47.5 billion by 2034 at roughly 34.8 percent CAGR. Treat directional, but useful for sizing the agents layer (Market.us).
Broader speech and voice recognition. Fortune Business Insights estimates $15.46 billion in 2024, reaching $81.59 billion by 2032 at 23.1 percent CAGR, which captures STT, TTS, and traditional recognition software (Fortune Business Insights).
Outlook
What to watch over the next 12 months:
Consolidation pace. Meta’s PlayAI deal is unlikely to be the last. Expect acquisitions that secure synthesis, streaming, and orchestration IP rather than just revenue. CB Insights’ framework is a useful lens here (CB Insights).
Pricing models. As realtime price curves bend down, expect platform-plus-usage hybrids, minimums, and vertical bundles.
Modality expansion. Voice agents will blend with chat, email, and screen-sharing flows.
Verticalization. “Voice-native” SaaS in niches like collections, dealership service, or recruiting.
Banking and recruiting as bellwethers. These wedges have clear KPIs and budgets, so their momentum is a leading indicator of broader adoption. See a16z’s banking note and the recruiting case study for early traction patterns (a16z fintech newsletter, a16z voice update).


