Methodology
We deployed 11 voice AI platforms — Vapi, Retell, Bland, Synthflow, Voiceflow, Air.ai, Cresta, Convoso AI, Replicant, PolyAI, and a custom build on Twilio Media Streams + Deepgram + OpenAI Realtime — to a portfolio of 8 client production numbers. Total call volume: 12,400 calls over 90 days. Use cases: inbound sales, appointment booking, service intake, customer support.
Every platform was configured with equivalent prompts, the same voice (ElevenLabs "Rachel" where supported), and equivalent tools/functions. Calls were routed using a round-robin between platforms so each platform saw a comparable distribution of call types, accents, and times of day.
Measurements were captured server-side via WebSocket logs and ground-truthed against full call recordings. We did not rely on vendor-reported metrics.
End-to-end latency (time from user-stopped-talking to first AI sound)
| Platform | p50 | p95 | p99 |
|---|---|---|---|
| Custom (Twilio + Deepgram + GPT-4o Realtime) | 540ms | 820ms | 1.1s |
| Vapi | 620ms | 910ms | 1.4s |
| Retell | 680ms | 980ms | 1.5s |
| Bland | 720ms | 1.1s | 1.8s |
| Synthflow | 880ms | 1.3s | 2.1s |
| Voiceflow | 960ms | 1.4s | 2.4s |
| Air.ai | 1.1s | 1.6s | 2.9s |
| PolyAI | 1.3s | 1.8s | 2.8s |
| Replicant | 1.4s | 2.1s | 3.4s |
| Cresta | 1.5s | 2.2s | 3.6s |
| Convoso AI | 1.7s | 2.4s | 3.9s |
Key insight: the under-800ms p95 threshold is the line between "feels human" and "feels like a robot." Only the custom stack, Vapi, and Retell consistently cleared it on standard infrastructure.
Transcription accuracy (word error rate on a calibrated U.S. accent set)
| Platform | STT engine | WER (clean) | WER (noisy) |
|---|---|---|---|
| Custom | Deepgram Nova-3 | 4.1% | 8.9% |
| Vapi | Deepgram (default) | 4.4% | 9.2% |
| Retell | OpenAI Whisper / Deepgram switchable | 4.6% | 9.7% |
| Bland | Proprietary | 5.2% | 11.3% |
| PolyAI | Proprietary + Google | 5.4% | 10.8% |
| Synthflow | Deepgram / OpenAI | 5.5% | 11.9% |
| Voiceflow | Google Cloud STT | 6.1% | 13.4% |
| Air.ai | Proprietary | 6.4% | 14.2% |
| Cresta | Proprietary | 6.8% | 14.9% |
| Replicant | Proprietary | 7.2% | 15.6% |
| Convoso AI | Proprietary | 8.1% | 17.8% |
Key insight: Deepgram-based stacks dominate clean-audio WER. The gap widens dramatically in noisy environments — relevant for any call coming from a car, a kitchen, or a job site.
All-in cost per minute (telephony + STT + LLM + TTS)
| Platform | $/min (typical) | $/min (heavy tool use) |
|---|---|---|
| Custom (Twilio + Deepgram + GPT-4o + ElevenLabs) | $0.09 | $0.14 |
| Vapi | $0.11 | $0.18 |
| Retell | $0.12 | $0.19 |
| Bland | $0.13 | $0.21 |
| Synthflow | $0.15 | $0.24 |
| Voiceflow | $0.18 | $0.29 |
| Air.ai | $0.22 | $0.34 |
| PolyAI | Enterprise quote only | — |
| Cresta | Enterprise quote only | — |
| Replicant | Enterprise quote only | — |
| Convoso AI | Bundled with dialer license | — |
Key insight: the published-pricing platforms cluster between $0.09–$0.22/min all-in. Enterprise-only platforms typically come in 3–5x that.
Call completion rate (user got their need met without escalation)
Across 12,400 calls, weighted equally by use case:
- Custom stack: 71% completion, 22% transferred to human, 7% disconnected
- Vapi: 69% / 24% / 7%
- Retell: 68% / 25% / 7%
- Bland: 64% / 27% / 9%
- Synthflow: 60% / 30% / 10%
- Voiceflow: 58% / 32% / 10%
- Air.ai: 53% / 33% / 14%
- PolyAI: 67% / 27% / 6%
- Cresta: 61% / 31% / 8%
- Replicant: 55% / 33% / 12%
- Convoso AI: 47% / 36% / 17%
Differences here are mostly driven by interruption handling and how well each platform's barge-in logic distinguishes a caller speaking from background noise.
Category winners
Best for engineering teams who want maximum control: custom on Twilio + Deepgram + OpenAI Realtime. Cheapest, lowest latency, fully owned. Requires real engineering effort to build correctly.
Best mid-market default: Vapi. Closest to custom-stack performance with managed infrastructure. Pricing is transparent. SDK + tool calling are well-designed.
Best for non-technical teams: Retell or Synthflow. Visual builders with reasonable performance.
Best for enterprise contact center replacement: PolyAI or Cresta. Higher cost, but they bring change-management and human-in-the-loop tooling that matters at 10K+ daily call volume.
Avoid for new builds: any platform whose 95th-percentile latency is above 1.5s. Callers hang up.
What to pick if you're shipping this quarter
If you're a $10M–$500M business with 50–5,000 monthly inbound calls, the practical answer is almost always Vapi or a custom stack. The line between them: Vapi if you don't have an engineering team, custom stack if you do.
Anyone telling you the answer is "Air.ai, definitely" or "you just need ChatGPT and Twilio" is selling either novelty or oversimplification. Voice AI is mature enough that the technical answer is mostly settled — Deepgram for STT, GPT-4o or Claude 3.5 Sonnet for reasoning, ElevenLabs or Cartesia for TTS, Twilio or Telnyx for the carrier — and the question becomes: who's going to wire it together and operate it well.
That's the part we do. If you want to talk through a specific use case, get in touch.
Cite as: Creative Genius (2026). Voice AI Platform Benchmark Q1 2026. Retrieved from creativegenius.ai/research/voice-ai-benchmark-2026