Everyone talks about self-hosting Llama or Mistral. Few actually do it.
When it makes sense
Confidentiality, sovereignty, massive constant volume, ultra-low latency, predictable cost at scale.
Model choice
Llama 3.3 70B, Mistral Large 2, Qwen 2.5 72B, DeepSeek V3.
Hardware
2x H100 or 1x H200 for quality 70B with 4-bit quantization.
Stack
vLLM is the 2026 reference. OpenAI-compatible API out of the box.