Why did you exclude vendor-reported latency numbers?

Every vendor measures latency differently. We measured end-to-end (caller-stopped-talking to first-audio-from-AI) on the same infrastructure for every platform, so the numbers are directly comparable. Most vendor-reported numbers measure only the LLM portion.

Are these numbers reproducible?

Yes — the configurations are documented and the call sample is publicly available on request (anonymized). Email research@creativegenius.ai to request the dataset.

Will you re-run this in 6 months?

Yes. The field is moving fast. We'll re-benchmark every 6 months and publish updates.

Did vendors pay to be included?

No. No vendor relationship influenced the inclusion or the results. We pay for our own platform credits like any customer.

We tested 11 voice AI platforms in production: latency, cost

Methodology

We deployed 11 voice AI platforms — Vapi, Retell, Bland, Synthflow, Voiceflow, Air.ai, Cresta, Convoso AI, Replicant, PolyAI, and a custom build on Twilio Media Streams + Deepgram + OpenAI Realtime — to a portfolio of 8 client production numbers. Total call volume: 12,400 calls over 90 days. Use cases: inbound sales, appointment booking, service intake, customer support.

Every platform was configured with equivalent prompts, the same voice (ElevenLabs "Rachel" where supported), and equivalent tools/functions. Calls were routed using a round-robin between platforms so each platform saw a comparable distribution of call types, accents, and times of day.

Measurements were captured server-side via WebSocket logs and ground-truthed against full call recordings. We did not rely on vendor-reported metrics.

End-to-end latency (time from user-stopped-talking to first AI sound)

Platform	p50	p95	p99
Custom (Twilio + Deepgram + GPT-4o Realtime)	540ms	820ms	1.1s
Vapi	620ms	910ms	1.4s
Retell	680ms	980ms	1.5s
Bland	720ms	1.1s	1.8s
Synthflow	880ms	1.3s	2.1s
Voiceflow	960ms	1.4s	2.4s
Air.ai	1.1s	1.6s	2.9s
PolyAI	1.3s	1.8s	2.8s
Replicant	1.4s	2.1s	3.4s
Cresta	1.5s	2.2s	3.6s
Convoso AI	1.7s	2.4s	3.9s

Key insight: the under-800ms p95 threshold is the line between "feels human" and "feels like a robot." Only the custom stack, Vapi, and Retell consistently cleared it on standard infrastructure.

Transcription accuracy (word error rate on a calibrated U.S. accent set)

Platform	STT engine	WER (clean)	WER (noisy)
Custom	Deepgram Nova-3	4.1%	8.9%
Vapi	Deepgram (default)	4.4%	9.2%
Retell	OpenAI Whisper / Deepgram switchable	4.6%	9.7%
Bland	Proprietary	5.2%	11.3%
PolyAI	Proprietary + Google	5.4%	10.8%
Synthflow	Deepgram / OpenAI	5.5%	11.9%
Voiceflow	Google Cloud STT	6.1%	13.4%
Air.ai	Proprietary	6.4%	14.2%
Cresta	Proprietary	6.8%	14.9%
Replicant	Proprietary	7.2%	15.6%
Convoso AI	Proprietary	8.1%	17.8%

Key insight: Deepgram-based stacks dominate clean-audio WER. The gap widens dramatically in noisy environments — relevant for any call coming from a car, a kitchen, or a job site.

All-in cost per minute (telephony + STT + LLM + TTS)

Platform	$/min (typical)	$/min (heavy tool use)
Custom (Twilio + Deepgram + GPT-4o + ElevenLabs)	$0.09	$0.14
Vapi	$0.11	$0.18
Retell	$0.12	$0.19
Bland	$0.13	$0.21
Synthflow	$0.15	$0.24
Voiceflow	$0.18	$0.29
Air.ai	$0.22	$0.34
PolyAI	Enterprise quote only	—
Cresta	Enterprise quote only	—
Replicant	Enterprise quote only	—
Convoso AI	Bundled with dialer license	—

Key insight: the published-pricing platforms cluster between $0.09–$0.22/min all-in. Enterprise-only platforms typically come in 3–5x that.

Call completion rate (user got their need met without escalation)

Across 12,400 calls, weighted equally by use case:

Custom stack: 71% completion, 22% transferred to human, 7% disconnected
Vapi: 69% / 24% / 7%
Retell: 68% / 25% / 7%
Bland: 64% / 27% / 9%
Synthflow: 60% / 30% / 10%
Voiceflow: 58% / 32% / 10%
Air.ai: 53% / 33% / 14%
PolyAI: 67% / 27% / 6%
Cresta: 61% / 31% / 8%
Replicant: 55% / 33% / 12%
Convoso AI: 47% / 36% / 17%

Differences here are mostly driven by interruption handling and how well each platform's barge-in logic distinguishes a caller speaking from background noise.

Category winners

Best for engineering teams who want maximum control: custom on Twilio + Deepgram + OpenAI Realtime. Cheapest, lowest latency, fully owned. Requires real engineering effort to build correctly.

Best mid-market default: Vapi. Closest to custom-stack performance with managed infrastructure. Pricing is transparent. SDK + tool calling are well-designed.

Best for non-technical teams: Retell or Synthflow. Visual builders with reasonable performance.

Best for enterprise contact center replacement: PolyAI or Cresta. Higher cost, but they bring change-management and human-in-the-loop tooling that matters at 10K+ daily call volume.

Avoid for new builds: any platform whose 95th-percentile latency is above 1.5s. Callers hang up.

What to pick if you're shipping this quarter

If you're a $10M–$500M business with 50–5,000 monthly inbound calls, the practical answer is almost always Vapi or a custom stack. The line between them: Vapi if you don't have an engineering team, custom stack if you do.

Anyone telling you the answer is "Air.ai, definitely" or "you just need ChatGPT and Twilio" is selling either novelty or oversimplification. Voice AI is mature enough that the technical answer is mostly settled — Deepgram for STT, GPT-4o or Claude 3.5 Sonnet for reasoning, ElevenLabs or Cartesia for TTS, Twilio or Telnyx for the carrier — and the question becomes: who's going to wire it together and operate it well.

That's the part we do. If you want to talk through a specific use case, get in touch.

Cite as: Creative Genius (2026). Voice AI Platform Benchmark Q1 2026. Retrieved from creativegenius.ai/research/voice-ai-benchmark-2026

We tested 11 voice AI platforms in production: latency, cost, accuracy

Table of contents