Sticker prices on LLM pricing pages are the worst possible way to estimate your bill. Real per-task cost is 3-10x more variable than the per-token price suggests.
Methodology
We ran four standardized production tasks (customer support response, sales email draft, document extraction, code generation) 1,000 times each across 14 models, with prompt caching where supported, real conversational context, and tool-use enabled. Measured total tokens, latency, and dollar cost per task.
Cost per task (median across all 4 tasks)
- Claude 3.5 Haiku: $0.0011
- GPT-4o-mini: $0.0014
- Gemini 1.5 Flash: $0.0017
- Llama 3.3 70B (Groq): $0.0021
- Claude 3.5 Sonnet: $0.0089
- GPT-4o: $0.0094
- Gemini 1.5 Pro: $0.011
- Claude 3.7 Sonnet: $0.013
- GPT-4-turbo: $0.018
- o1-mini: $0.024
- Claude 3 Opus: $0.041
- o1: $0.087
- o1-pro: $0.31
- Self-hosted Llama 3.3 70B: variable, $0.0008 marginal once amortized
Best price/quality by task type
- Customer support response: Claude 3.5 Haiku wins on cost + quality. GPT-4o-mini close second.
- Sales email drafting: Claude 3.5 Sonnet — quality premium worth the cost.
- Document extraction: GPT-4o-mini + structured outputs is the sweet spot.
- Code generation: Claude 3.7 Sonnet wins on quality, o1-mini wins on hard reasoning.
Prompt caching delivered real savings
Across Anthropic + OpenAI prompt caching: 38-71% cost reduction on workflows with shared system prompts. Anthropic's 5-minute / 1-hour cache TTLs were the most impactful single optimization across our test suite.
Quality / cost tradeoffs
The top-tier models (Opus, o1) cost 5-30x the mid-tier and rarely produce 5-30x the value. Use them for the 5-10% of tasks where quality genuinely matters and route everything else to Haiku / 4o-mini / Flash.
Practical recommendations
- Default to Claude 3.5 Haiku or GPT-4o-mini for >80% of production traffic.
- Use Sonnet / 4o for the quality-sensitive 15%.
- Reserve Opus / o1 for the genuinely hard 5%.
- Always enable prompt caching where the provider supports it.
- Set per-feature cost budgets with alerts — not after-the-fact dashboards.
Want a cost audit of your current LLM stack? Book a 30-minute call.
Cite as: Creative Genius (2026). LLM Cost Benchmarks 2026. Retrieved from creativegenius.ai/research/llm-cost-benchmarks-2026