Local LLMs in 2025: When to Self-Host

Llama 3.1 405B and Mistral Large 2 changed the game. Here's when self-hosting makes sense.

By Creative Genius · May 12, 2026 · 8 min read

Open-weight models caught up. Llama 3.1 405B, Mistral Large 2, and Qwen 2.5 are within striking distance of frontier closed models for most production tasks. That changes the self-host vs hosted equation — but not as much as the marketing suggests.

When self-hosting actually wins

Data residency requires it. Some EU customers, all classified-government work, certain healthcare deployments.
Inference volume exceeds $10K/mo on hosted APIs. Below that, operational cost eats your savings.
You need fine-tuning that hosted providers won't allow. Domain dialects, copyrighted material, sensitive prompt techniques.
Latency floor matters. Co-locating the model with your app can shave 100–300ms vs hosted round trips.

The honest cost math

Llama 3.1 70B on 2× H100s: roughly $15K/month on a cloud GPU provider, or ~$80K capex for bare metal + colo. At 500K queries/month with average 1K input + 500 output tokens, the equivalent OpenAI bill is roughly $4K. Self-hosting only wins above ~3M queries/month at those token sizes.

The operational tax nobody mentions

You now own model upgrades, GPU driver compatibility, vLLM/TGI tuning, and capacity planning.
P95 latency under load is a real engineering project, not a config.
Multi-tenant isolation requires extra work that hosted providers do for you.
One on-call engineer with GPU experience costs more per year than your inference bill at most volumes.

The hybrid that often wins

Use hosted (OpenAI, Anthropic) for the long tail of low-volume features. Use self-hosted for the 1–2 features that drive most of your traffic. The 80/20 of cost savings comes from moving the hottest path, not from going all-in.

Bottom line

Self-hosting is a real option in 2025, but it's an option with a price tag. Run the math at your actual traffic, include the operational cost, and don't pretend you're going to save money on something you've never deployed before.