AI API costs can explode fast. One production app with moderate traffic was costing $800/month before we applied these techniques. Here's the playbook.
1. Prompt Caching (50% savings immediately)
If your system prompt is the same across requests — and it usually is — you're paying for it every single call. Both Anthropic and OpenAI now offer prompt caching that charges 10–25% of the normal rate for cached tokens. Enable it, save half your bill overnight.
2. Model Tiering: Don't Use GPT-4 for Simple Tasks
Classify requests by complexity before routing them. FAQ answering, extraction, summarization of short texts — these don't need your most expensive model. Use Claude Haiku or GPT-4o-mini for Tier 1, and reserve the big model for Tier 2+ tasks. Typical savings: 60–70% on simple queries.
3. Context Window Discipline
The single biggest lever. Every token in context costs money. Audit what you're sending:
- Are you passing the full conversation history? Use a rolling window instead.
- Are you re-sending the same documents repeatedly? Cache and reference them.
- Are you fetching 10,000 tokens when the answer is in the first 500?
4. Batch Processing vs Real-Time
Non-real-time jobs (nightly reports, bulk enrichment, content generation) can use the Batch API at 50% discount. You wait a few hours instead of milliseconds, but for async jobs this is a no-brainer.
5. Output Caching (Often Overlooked)
If 30% of your requests are semantically identical, cache the output for 24–48h. Tools like Semantic Cache or even a simple Redis + embedding similarity check can eliminate duplicate LLM calls.
The 80% Number
Our $800 → $160 breakdown: prompt caching (-40%), model tiering (-30%), context window trim (-20%), batch processing for async jobs (-10%). The gains compound.