Everyone talks about self-hosting Llama or Mistral. Few actually do it.

When it makes sense

Confidentiality, sovereignty, massive constant volume, ultra-low latency, predictable cost at scale.

Model choice

Llama 3.3 70B, Mistral Large 2, Qwen 2.5 72B, DeepSeek V3.

Hardware

2x H100 or 1x H200 for quality 70B with 4-bit quantization.

Stack

vLLM is the 2026 reference. OpenAI-compatible API out of the box.