What is Distillation?

Distillation is the technique of “using a large model to teach a small one.” Workflow: take a high-quality teacher model, have it produce answers to a large set of prompts, then train a smaller student model to mimic the teacher’s output distribution. The student ends up tiny and cheap to run, but punches well above its weight.

The industry impact is massive: Claude Haiku is distilled from Opus, Gemini Flash from Pro, GPT-4o-mini from GPT-4o — distillation is the secret sauce behind modern “cheap but accurate” SLMs. When DeepSeek released distilled versions of R1, Llama-7B and Qwen-1.5B suddenly approached GPT-4o on math benchmarks, flipping the assumption that small models can only do weak tasks. For self-hosted teams, distillation is the lever that drops API costs to 1/20th.