Two cache layers everyone should have:
- Exact-match cache. Same prompt → same response, served from Redis. Saves 100% of cost.
- Semantic cache. Embed the prompt, find a near-duplicate answer that's already cached, serve it. Saves 70–95% of cost when the user base asks similar questions (which they almost always do).
OpenAI's prompt caching (automatic since late 2024) gives 50% discount on cached input tokens. Restructure long prompts so the static parts come first — that's what gets cached.