🌊 AI Voice Agents

The Rise of Conversational Intelligence

Jun 25, 2025

∙ Paid

👋 Hey, I’m Ivan. I write a newsletter about startups and investing. I share market maps, playbooks and tactical resources for founders surfing tech waves.

🌊 AI Voice Agents

2025 is going to go down as the year of Agents, but also Voice AI.

We’ve been digging into voice and done a few investments in the space (HappyRobot, Rauda, Konvo, and Altan which just partnered with ElevenLabs). There’s a lot of emerging tech that makes this wave worth surfing.

So lets break down why now and how big this could be:

1. Voice is the most natural interface, and it’s (finally) working

The most human interface has been the most painful one to use. Until now.

Humans love to talk. It’s how we build trust, handle urgency, feel heard.

But voice tech has always been broken: clunky tech, dropped calls, bad bots, endless hold music, and a few other horror stories we’re all familiar with.

AI changed that. For the first time, machines can listen, think, and talk back. All in real-time. Simple to the user. Wildly complex under the hood.

But it unlocks real stuff:

SMBs miss over 60% of inbound calls → voice agents can pick up 24/7
Customers don’t want to wait → AI doesn’t sleep
Agents can now schedule, qualify, renew etc.

And this goes beyond call centers.

For enterprises: voice agents directly replace human labor, often 10x cheaper, faster, and more consistent.
For consumers: voice feels native. It’s faster than typing, more intuitive than apps, and perfect for things like coaching, language learning, or even companionship.
For builders: voice is now a wedge. Infra is maturing. What matters next is the application layer: the workflows, verticals, and GTM that ride on top.

Bottom line: voice isn’t just working, it’s working better.

And it’s likely about to change how we work, buy, and communicate.

2. We got here after 5 tech waves:

“Please press 1 for rage”

Wave 1 — IVR Hell (1970s–2000s): Rigid phone menus (“Press 1 for billing”) defined early voice tech. Still a $5B+ market, despite being universally hated.
Wave 2 — STT Gets Usable (2010s–2021): Speech-to-text finally worked well enough for business. Gong turned sales calls into structured data. Google’s STT APIs brought real use cases online.
Wave 3 — The Whisper Moment (2022): OpenAI open-sourced Whisper, pushing transcription toward human-level accuracy. Suddenly indie devs could build high-quality STT into apps, free.
Wave 4 — Voice 1.0 (2023–early 2024): Cascading stacks emerged:
Voice → Text → LLM → Text → Voice. ChatGPT + ElevenLabs made agents sound decent, but latency sucked. Brittle UX, long gaps, awkward timing.
Wave 5 — Speech-Native (2024–2025): Speech-to-Speech flips the stack. GPT-4o handles voice input/output natively. 300ms latency, emotion, interruptions. Moshi runs locally, full-duplex. Hume adapts tone in real time.

3. A new Voice Stack makes applications viable

Better infra unlocked faster iteration.

Voice agents used to be a full-stack nightmare.

You had to wrangle real-time audio, transcription, latency, barge-ins, TTS quality, and orchestration logic, just to ship a mediocre demo.

But in the last 18 months, a new modular stack emerged. Infra finally caught up.

Now, each layer has best-in-class players:

Models → GPT‑4o (speech-native reasoning), ElevenLabs (TTS), Deepgram (fast STT), Moshi (open-source S2S)
Infra → Vapi, Retell, Hume, LiveKit handle orchestration, emotion, memory, interruptions.
Apps → Examples like Rauda (CS), Konvo (CX) and a big wave we’ll discuss next.

4. CS, Sales & Recruiting are leading the charge

These use cases are predictable, repetitive, and already voice-native.

Startup Riders

🌊 AI Voice Agents

The Rise of Conversational Intelligence

🌊 AI Voice Agents

1. Voice is the most natural interface, and it’s (finally) working

2. We got here after 5 tech waves:

3. A new Voice Stack makes applications viable

4. CS, Sales & Recruiting are leading the charge

This post is for paid subscribers