The 2026 voice stack
| Layer | Best in class | Why |
|---|---|---|
| Telephony | Twilio or Telnyx | Reliable, programmable, BAA available |
| STT | Deepgram Nova-3 or Whisper-3 | 96%+ accuracy, sub-200ms partials |
| LLM | GPT-4o Realtime or Claude 3.5 | Realtime API for sub-800ms loop; Claude for quality |
| TTS | ElevenLabs Turbo v3 or Cartesia Sonic | Natural-sounding, sub-150ms first audio |
| Orchestration | Vapi, Retell, or custom on LiveKit | Handles barge-in, interruptions, turn-taking |
Vendor comparison
- Vapi — best mid-market default. $0.05–$0.12/min all-in. Hosted or BYO.
- Retell — easiest to deploy. Strong default voices.
- Synthflow — best no-code UI for non-engineering teams.
- Bland.ai — best raw outbound throughput.
- Custom (Twilio + Deepgram + GPT Realtime) — best when you need full control.
- PolyAI / Cresta — enterprise contact-center grade.
Latency math — why <800ms matters
Human conversation has ~600ms median turn-taking gap. Anything over 1.2s feels noticeably robotic. To hit <800ms end-to-end you need:
- STT partials < 200ms
- LLM first-token < 400ms (this is the hardest)
- TTS first audio < 150ms
- Network + jitter buffer < 50ms
Streaming everything end-to-end is non-negotiable. Buffering anywhere blows the budget.
Use cases that pay back fastest
- After-hours appointment booking — pays back in 30–60 days for service businesses
- Insurance / mortgage intake — 70–85% of intake calls fully automated
- Outbound qualification — 3–5x throughput per "rep"
- Order status / FAQ deflection — 40–60% of inbound calls handled end-to-end
- Healthcare scheduling — recovers 25–40% of no-show appointments via reminders + rebooking
Top 6 voice-AI mistakes
- Choosing a generic voice — accent / cadence wrong for the audience tanks trust
- No barge-in support — feels robotic the second the caller tries to interrupt
- Long LLM responses — the agent should rarely speak more than 2 sentences before pausing
- No call recording / transcript review process
- Underestimating telephony complexity (carriers, STIR/SHAKEN, call answer rates)
- Skipping compliance — TCPA, recording-disclosure laws vary by state
Full cost analysis vs human agents: Voice AI vs Human Agents 2026.