Distillation

Distillation is the technique of 'using a large model to teach a small one.' A teacher LLM generates answers; a student SLM is trained to mimic the teacher's output distribution. Result: a small, cheap model that performs close to the big one. Industry impact: Claude Haiku is distilled from Opus, Gemini Flash from Pro, GPT-4o-mini from GPT-4o — the secret sauce behind modern cheap-but-smart SLMs — Judy AI Lab AI Glossary

core beginner

What is Distillation?

Distillation is the technique of “using a large model to teach a small one.” Workflow: take a high-quality teacher model, have it produce answers to a large set of prompts, then train a smaller student model to mimic the teacher’s output distribution. The student ends up tiny and cheap to run, but punches well above its weight.

The industry impact is massive: Claude Haiku is distilled from Opus, Gemini Flash from Pro, GPT-4o-mini from GPT-4o — distillation is the secret sauce behind modern “cheap but accurate” SLMs. When DeepSeek released distilled versions of R1, Llama-7B and Qwen-1.5B suddenly approached GPT-4o on math benchmarks, flipping the assumption that small models can only do weak tasks. For self-hosted teams, distillation is the lever that drops API costs to 1/20th.

What is Distillation?#

Related Terms

Get our weekly AI digest:

What is Distillation?