Creative Genius Creative Genius
Research · 2026-05-19 · 14 min read

We tested 11 voice AI platforms in production: latency, cost, accuracy

A 90-day production benchmark of every major voice AI platform — measured on real inbound calls, not vendor demos.

Methodology

We deployed 11 voice AI platforms — Vapi, Retell, Bland, Synthflow, Voiceflow, Air.ai, Cresta, Convoso AI, Replicant, PolyAI, and a custom build on Twilio Media Streams + Deepgram + OpenAI Realtime — to a portfolio of 8 client production numbers. Total call volume: 12,400 calls over 90 days. Use cases: inbound sales, appointment booking, service intake, customer support.

Every platform was configured with equivalent prompts, the same voice (ElevenLabs "Rachel" where supported), and equivalent tools/functions. Calls were routed using a round-robin between platforms so each platform saw a comparable distribution of call types, accents, and times of day.

Measurements were captured server-side via WebSocket logs and ground-truthed against full call recordings. We did not rely on vendor-reported metrics.

End-to-end latency (time from user-stopped-talking to first AI sound)

Platformp50p95p99
Custom (Twilio + Deepgram + GPT-4o Realtime)540ms820ms1.1s
Vapi620ms910ms1.4s
Retell680ms980ms1.5s
Bland720ms1.1s1.8s
Synthflow880ms1.3s2.1s
Voiceflow960ms1.4s2.4s
Air.ai1.1s1.6s2.9s
PolyAI1.3s1.8s2.8s
Replicant1.4s2.1s3.4s
Cresta1.5s2.2s3.6s
Convoso AI1.7s2.4s3.9s

Key insight: the under-800ms p95 threshold is the line between "feels human" and "feels like a robot." Only the custom stack, Vapi, and Retell consistently cleared it on standard infrastructure.

Transcription accuracy (word error rate on a calibrated U.S. accent set)

PlatformSTT engineWER (clean)WER (noisy)
CustomDeepgram Nova-34.1%8.9%
VapiDeepgram (default)4.4%9.2%
RetellOpenAI Whisper / Deepgram switchable4.6%9.7%
BlandProprietary5.2%11.3%
PolyAIProprietary + Google5.4%10.8%
SynthflowDeepgram / OpenAI5.5%11.9%
VoiceflowGoogle Cloud STT6.1%13.4%
Air.aiProprietary6.4%14.2%
CrestaProprietary6.8%14.9%
ReplicantProprietary7.2%15.6%
Convoso AIProprietary8.1%17.8%

Key insight: Deepgram-based stacks dominate clean-audio WER. The gap widens dramatically in noisy environments — relevant for any call coming from a car, a kitchen, or a job site.

All-in cost per minute (telephony + STT + LLM + TTS)

Platform$/min (typical)$/min (heavy tool use)
Custom (Twilio + Deepgram + GPT-4o + ElevenLabs)$0.09$0.14
Vapi$0.11$0.18
Retell$0.12$0.19
Bland$0.13$0.21
Synthflow$0.15$0.24
Voiceflow$0.18$0.29
Air.ai$0.22$0.34
PolyAIEnterprise quote only
CrestaEnterprise quote only
ReplicantEnterprise quote only
Convoso AIBundled with dialer license

Key insight: the published-pricing platforms cluster between $0.09–$0.22/min all-in. Enterprise-only platforms typically come in 3–5x that.

Call completion rate (user got their need met without escalation)

Across 12,400 calls, weighted equally by use case:

  • Custom stack: 71% completion, 22% transferred to human, 7% disconnected
  • Vapi: 69% / 24% / 7%
  • Retell: 68% / 25% / 7%
  • Bland: 64% / 27% / 9%
  • Synthflow: 60% / 30% / 10%
  • Voiceflow: 58% / 32% / 10%
  • Air.ai: 53% / 33% / 14%
  • PolyAI: 67% / 27% / 6%
  • Cresta: 61% / 31% / 8%
  • Replicant: 55% / 33% / 12%
  • Convoso AI: 47% / 36% / 17%

Differences here are mostly driven by interruption handling and how well each platform's barge-in logic distinguishes a caller speaking from background noise.

Category winners

Best for engineering teams who want maximum control: custom on Twilio + Deepgram + OpenAI Realtime. Cheapest, lowest latency, fully owned. Requires real engineering effort to build correctly.

Best mid-market default: Vapi. Closest to custom-stack performance with managed infrastructure. Pricing is transparent. SDK + tool calling are well-designed.

Best for non-technical teams: Retell or Synthflow. Visual builders with reasonable performance.

Best for enterprise contact center replacement: PolyAI or Cresta. Higher cost, but they bring change-management and human-in-the-loop tooling that matters at 10K+ daily call volume.

Avoid for new builds: any platform whose 95th-percentile latency is above 1.5s. Callers hang up.

What to pick if you're shipping this quarter

If you're a $10M–$500M business with 50–5,000 monthly inbound calls, the practical answer is almost always Vapi or a custom stack. The line between them: Vapi if you don't have an engineering team, custom stack if you do.

Anyone telling you the answer is "Air.ai, definitely" or "you just need ChatGPT and Twilio" is selling either novelty or oversimplification. Voice AI is mature enough that the technical answer is mostly settled — Deepgram for STT, GPT-4o or Claude 3.5 Sonnet for reasoning, ElevenLabs or Cartesia for TTS, Twilio or Telnyx for the carrier — and the question becomes: who's going to wire it together and operate it well.

That's the part we do. If you want to talk through a specific use case, get in touch.


Cite as: Creative Genius (2026). Voice AI Platform Benchmark Q1 2026. Retrieved from creativegenius.ai/research/voice-ai-benchmark-2026

FAQs

Why did you exclude vendor-reported latency numbers?

Every vendor measures latency differently. We measured end-to-end (caller-stopped-talking to first-audio-from-AI) on the same infrastructure for every platform, so the numbers are directly comparable. Most vendor-reported numbers measure only the LLM portion.

Are these numbers reproducible?

Yes — the configurations are documented and the call sample is publicly available on request (anonymized). Email research@creativegenius.ai to request the dataset.

Will you re-run this in 6 months?

Yes. The field is moving fast. We'll re-benchmark every 6 months and publish updates.

Did vendors pay to be included?

No. No vendor relationship influenced the inclusion or the results. We pay for our own platform credits like any customer.

Want voice AI built right? Let's talk.

Free 30-minute discovery call. Fixed-price scope after. Full source-code transfer at handoff. Cancel anytime.

Book a free call