OpenAI Realtime API: When It's Worth the Latency Premium
The realtime API isn't always the right answer for voice agents. Here's the decision framework.
The Realtime API is OpenAI's audio-in, audio-out streaming model — no chained STT or TTS, the model itself listens and speaks. It's a genuinely different experience, but it's not always the right call.
The latency numbers that matter
- Realtime API: ~300ms first-token latency, naturally interruptible, supports barge-in.
- Vapi / Retell with chained STT→LLM→TTS: ~700–1200ms, requires VAD tuning to handle interruptions.
- Self-rolled pipeline: wherever you can engineer it, usually 800–1500ms.
Where Realtime wins
Use it when conversational naturalness is the product. Sales discovery, therapy bots, AI companions, language tutors. The 300ms response time crosses the perceptual threshold where users stop noticing they're talking to a machine.
Where chained pipelines win
Use them when accuracy, cost, and auditability matter more than naturalness. Compliance hotlines, structured intake, medical history forms, anything you'll need to log and review. With a chained pipeline you get a clean transcript at every step — Realtime audio is harder to audit and re-process.
Cost math at 1,000 minutes/day
Realtime: roughly $0.06/min input + $0.24/min output for GPT-4o = $300/day. Chained (Deepgram + GPT-4o + ElevenLabs): roughly $90/day. The 3× cost differential rarely pencils out unless the conversational quality directly affects conversion.
Hybrid pattern we use
Realtime for the first 30 seconds (warm hello, intent detection), then switch to a chained pipeline for the structured part of the call (data collection, transactional steps). Best of both worlds, harder to engineer, worth it for high-value use cases.
Bottom line
Realtime is a premium feature, not a default. Use it where naturalness drives revenue. Use chained pipelines everywhere else.