How I Cut My AI API Costs by 97% Using Local LLMs

Last October, I got a billing notification that made me pause: $340 in OpenAI API costs for a single month. My little side project — a content enrichment pipeline that processed a few hundred records a day — had quietly become expensive. I wasn't building GPT-4 into a B2B SaaS. I was running batch jobs on a personal project, paying premium inference prices for tasks that didn't need them.

Two weeks later, that same workload cost me $9/month. Here's exactly what I changed.

Why API Costs Spiral Faster Than You Expect

The problem isn't that APIs are overpriced for what they deliver. It's that most developers use GPT-4 or Claude for everything, including tasks that a smaller, local model handles just as well.

When you're early in a project, you reach for the most capable model because you're still figuring out what the prompts need to do. That habit sticks even after the prompts stabilize. Meanwhile, your pipeline runs on cron, your token counts grow, and you're paying per-call margins to a cloud provider for inference you could run on your own hardware.

A few things compound the cost: retries on API errors, debugging runs, development iterations — none of these need GPT-4-class reasoning. They just need something that works.

What Ollama Actually Is (and What It Isn't)

Ollama is a tool that lets you run open-weight LLMs locally with a single command. It handles model downloading, quantization, and serving — including an OpenAI-compatible REST API so you can swap it into existing code with minimal changes.

It's not magic. You still need a machine with enough RAM. A 7B model in 4-bit quantization needs about 5–6 GB of RAM. A 13B model needs 10–12 GB. If you're on a MacBook with 16 GB, you can run Mistral 7B or Llama 3.1 8B without breaking a sweat. On a basic cloud VM with 8 GB RAM, you can still run smaller quantized models.

What Ollama isn't is a replacement for frontier models in every use case. Structured reasoning, complex multi-step logic, nuanced creative writing — GPT-4 and Claude 3 Opus still have an edge. But for classification, summarization, extraction, formatting, and light text generation, local 7B–13B models are genuinely competitive.

Step-by-Step: Replacing API Calls with a Local Model

1. Install Ollama

curl -fsSL https://ollama.com/install.sh | sh

That's it. On Mac, there's a desktop app. On Linux, it runs as a systemd service.

2. Pull a model

ollama pull llama3.1:8b

For most batch tasks, I use mistral:7b-instruct or llama3.1:8b. For anything requiring longer context, qwen2.5:14b is excellent and runs on 16 GB.

3. Test it immediately

ollama run mistral:7b-instruct "Summarize this in one sentence: [your text]"

4. Switch your code to the local endpoint

If you're using the OpenAI Python SDK, change exactly two lines:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama"  # required by the SDK, not actually used
)

response = client.chat.completions.create(
    model="mistral:7b-instruct",
    messages=[{"role": "user", "content": your_prompt}]
)

Your existing prompt logic, response parsing, retry handling — all of it stays the same.

Choosing the Right Model for the Job

Not all tasks are equal, and neither are models. Here's how I split workloads in my own pipelines:

Stay local (Ollama):
- Extracting structured fields from unstructured text (names, addresses, tags)
- Classifying content into predefined categories
- Rewriting or cleaning messy text
- Generating short descriptions from templates
- Summarizing documents under 4,000 tokens

Keep on the cloud API:
- Tasks requiring deep reasoning across many steps
- Generating code in unfamiliar frameworks
- Anything where a subtle error has a high cost (financial summaries, legal drafts)
- Long-context tasks beyond 32K tokens

In my pipeline, about 80% of calls were classification and extraction. Moving those to a local Mistral 7B cut the API call volume by 4:1 before I even looked at token counts.

Real Numbers: Before and After

My original setup processed roughly 600 records/day through GPT-3.5-turbo (at the time, $0.002/1K tokens). Average prompt+completion was about 800 tokens per record. That's:

600 records × 800 tokens × $0.002/1K = $0.96/day → ~$29/month

But I also had a second pipeline running an enrichment pass with GPT-4 on about 100 records/day:

100 records × 1,200 tokens × $0.03/1K = $3.60/day → ~$108/month

After migrating classification to Mistral 7B locally and keeping GPT-4 only for the 15–20 records/day that actually needed it:

20 records × 1,200 tokens × $0.03/1K = $0.72/day → ~$22/month
Local inference: electricity cost, negligible

Total: from ~$137/month to ~$22/month. With further prompt compression on the remaining GPT-4 calls, I got it below $10.

The 97% figure in my headline comes from specific projects where 100% of calls moved local. For most mixed workloads, 70–90% reduction is realistic without touching output quality for your core use cases.

Running Ollama in Production (Not Just Locally)

For solo projects and internal tools, running Ollama on a $12/month VPS (4 vCPUs, 8 GB RAM) is completely viable for moderate traffic. I use a basic Ubuntu VM with Ollama running as a service, accessible over a private network.

Key things to get right:

Model preloading: Ollama keeps models in memory by default. Set OLLAMA_KEEP_ALIVE=24h to avoid cold-start latency on infrequent calls.
Concurrency: By default, Ollama processes one request at a time. For parallel batch jobs, either queue requests or run multiple Ollama instances on different ports.
Monitoring: Log your local inference times. If a request takes >30 seconds, your model is swapping to disk — you need more RAM or a smaller model.

For heavier workloads (hundreds of requests/hour), a GPU-enabled cloud instance changes the math dramatically — A10G spot instances on AWS run $0.30–0.60/hour and can serve dozens of concurrent requests.

Going Further

Switching to Ollama isn't a one-afternoon task if you want to do it properly — you need to benchmark your specific prompts, test output quality against your baseline, and handle edge cases where the local model underperforms. There's also model selection, quantization tradeoffs, and production deployment to think through.

If you want a structured walkthrough covering all of this — including which models to use for which tasks, how to evaluate output quality, and how to deploy Ollama on a VPS for production use — I put together a detailed guide: Local LLMs: Réduisez vos coûts IA de 97%. It's 12€ and covers the full process from first install to production pipeline, with real prompt examples and benchmarks.

The short version: most developers are massively overpaying for inference on tasks that don't need frontier models. Ollama makes it practical to stop.