Will model routing hurt quality?

Done right, no — routing only sends tasks to a smaller model when evals confirm equivalent quality. Done badly, yes. Always run a regression eval before promoting.

Is prompt caching worth implementing?

If your prompts have any stable prefix > 1K tokens and you make > 100 calls/day, yes. It's the highest-ROI optimization most teams haven't shipped.

What's a normal AI cost as % of revenue?

For AI-native products: 8–18% of revenue. For AI-enhanced traditional products: 1–4%. If you're > 25% you have an unoptimized stack — usually fixable in 2–4 weeks of work.

AI cost optimization 2026

Start with an audit

You can't optimize what you can't see. First step: per-feature, per-model, per-user cost breakdown. Tools: Helicone, Langfuse, OpenMeter, or build directly on OpenTelemetry. You'll find that 5–15% of features drive 60–80% of cost — optimize those first.

12 tactics that consistently work

Model routing — use the cheapest model that passes evals for each task (typical savings: 40–60%)
Prompt caching — Anthropic + OpenAI both support cached prefixes at 10x discount (savings: 30–80% on repetitive prefixes)
Batch API — 50% discount for non-realtime workloads (OpenAI + Anthropic both offer)
Shorter system prompts — every duplicated token is paid for every call
JSON mode + structured outputs — eliminates retries from parse failures
Output token limits — agents that ramble waste tokens (cap aggressively)
Distillation — fine-tune a small model on outputs from a big one for high-volume tasks
Semantic caching — cache responses to similar queries (Redis + vector lookup)
Embedding caching — text-embedding-3-large at $0.13/M can compound at scale
Drop CoT in production when you've extracted the patterns into a simpler prompt
Use open-source on Groq / Together for high-volume classification
Right-size context — RAG narrowly, not the entire knowledge base

Model routing in detail

The single highest-impact optimization. Build a router that classifies each request and sends it to the right model:

Tier 1 (cheapest): GPT-4o-mini, Claude Haiku for classification, routing, simple tasks
Tier 2 (workhorse): Claude 3.5 Sonnet, GPT-4o for most tasks
Tier 3 (premium): Claude 4 Opus, o3 for hard reasoning

The router itself can be a small classifier (a fine-tuned 8B model is plenty). Cost of routing: < $0.001 per request. Savings: 40–60% across the workload.

Prompt caching

Anthropic and OpenAI both support cached prefixes at 10x discount for cache hits. The pattern: put your large stable system prompt + few-shot examples in the cached prefix, your dynamic user input outside the cache. Typical savings on agent workloads: 60–80%.

Cost monitoring discipline

Per-customer / per-feature cost budgets with alerts at 1.5x baseline
Weekly cost-by-model review
P95 + P99 cost per request (not just average — the tails kill you)
Cost regression as a CI check on PRs that touch prompts

Want an AI cost audit? Book a free 30-minute call.

AI cost optimization 2026: cut LLM spend 40-70% without losing quality

Table of contents

Start with an audit

12 tactics that consistently work

Model routing in detail

Prompt caching

Cost monitoring discipline

FAQs

More guides

Want this built for your business?