Start with an audit
You can't optimize what you can't see. First step: per-feature, per-model, per-user cost breakdown. Tools: Helicone, Langfuse, OpenMeter, or build directly on OpenTelemetry. You'll find that 5–15% of features drive 60–80% of cost — optimize those first.
12 tactics that consistently work
- Model routing — use the cheapest model that passes evals for each task (typical savings: 40–60%)
- Prompt caching — Anthropic + OpenAI both support cached prefixes at 10x discount (savings: 30–80% on repetitive prefixes)
- Batch API — 50% discount for non-realtime workloads (OpenAI + Anthropic both offer)
- Shorter system prompts — every duplicated token is paid for every call
- JSON mode + structured outputs — eliminates retries from parse failures
- Output token limits — agents that ramble waste tokens (cap aggressively)
- Distillation — fine-tune a small model on outputs from a big one for high-volume tasks
- Semantic caching — cache responses to similar queries (Redis + vector lookup)
- Embedding caching — text-embedding-3-large at $0.13/M can compound at scale
- Drop CoT in production when you've extracted the patterns into a simpler prompt
- Use open-source on Groq / Together for high-volume classification
- Right-size context — RAG narrowly, not the entire knowledge base
Model routing in detail
The single highest-impact optimization. Build a router that classifies each request and sends it to the right model:
- Tier 1 (cheapest): GPT-4o-mini, Claude Haiku for classification, routing, simple tasks
- Tier 2 (workhorse): Claude 3.5 Sonnet, GPT-4o for most tasks
- Tier 3 (premium): Claude 4 Opus, o3 for hard reasoning
The router itself can be a small classifier (a fine-tuned 8B model is plenty). Cost of routing: < $0.001 per request. Savings: 40–60% across the workload.
Prompt caching
Anthropic and OpenAI both support cached prefixes at 10x discount for cache hits. The pattern: put your large stable system prompt + few-shot examples in the cached prefix, your dynamic user input outside the cache. Typical savings on agent workloads: 60–80%.
Cost monitoring discipline
- Per-customer / per-feature cost budgets with alerts at 1.5x baseline
- Weekly cost-by-model review
- P95 + P99 cost per request (not just average — the tails kill you)
- Cost regression as a CI check on PRs that touch prompts
Want an AI cost audit? Book a free 30-minute call.