Model Distillation

Modelldestillation ist eine Technik, bei der ein kleineres, weniger komplexes Modell (der Schüler) trainiert wird, um das Verhalten eines größeren, komplexeren Modells (des Lehrers) nachzuahmen.

🌐 Begriffe in anderen Sprachen:

English Deutsch Español Français 日本語 한국어 Polski Português Русский Türkçe Українська

Modelldestillation zielt darauf ab, Wissen von einem großen, oft rechenintensiven Modell auf ein kleineres zu übertragen. Das Lehrer-Modell, das auf einem Datensatz vortrainiert wurde, liefert Soft-Labels oder Wahrscheinlichkeiten, von denen das Schüler-Modell lernen kann, anstatt nur Hard-Labels. Dies ermöglicht es dem Schüler-Modell, die Generalisierungsfähigkeiten des Lehrers zu erfassen und möglicherweise eine ähnliche Leistung mit weniger Parametern und geringeren Rechenkosten zu erzielen. Der Prozess beinhaltet typischerweise das Trainieren des Schüler-Modells auf einer Kombination aus den ursprünglichen Trainingsdaten und den Ausgaben des Lehrers, oft unter Verwendung einer Verlustfunktion, die die Vorhersagen des Schülers dazu anregt, sich an den Vorhersagen des Lehrers auszurichten.

Diese Technik ist besonders nützlich in Szenarien, in denen die Bereitstellung des vollständigen Lehrer-Modells aufgrund von Ressourcenbeschränkungen unpraktisch ist, beispielsweise in Edge-Geräten oder mobilen Anwendungen. Sie ermöglicht die Erstellung effizienter Modelle, die die Leistung ihrer größeren Pendants beibehalten und schnellere Inferenz und einen geringeren Speicherbedarf ermöglichen.

        graph LR
  Center["Model Distillation"]:::main
  Pre_machine_learning["machine-learning"]:::pre --> Center
  click Pre_machine_learning "/terms/machine-learning"
  Pre_neural_network["neural-network"]:::pre --> Center
  click Pre_neural_network "/terms/neural-network"
  Pre_deep_learning["deep-learning"]:::pre --> Center
  click Pre_deep_learning "/terms/deep-learning"
  Rel_llm["llm"]:::related -.-> Center
  click Rel_llm "/terms/llm"
  classDef main fill:#7c3aed,stroke:#8b5cf6,stroke-width:2px,color:white,font-weight:bold,rx:5,ry:5;
  classDef pre fill:#0f172a,stroke:#3b82f6,color:#94a3b8,rx:5,ry:5;
  classDef child fill:#0f172a,stroke:#10b981,color:#94a3b8,rx:5,ry:5;
  classDef related fill:#0f172a,stroke:#8b5cf6,stroke-dasharray: 5 5,color:#94a3b8,rx:5,ry:5;
  linkStyle default stroke:#4b5563,stroke-width:2px;

🕸️ Open in Universe

🧠 Wissenstest

1 / 3

🧒 Erkläre es wie einem 5-Jährigen

It's like a master chef teaching an apprentice. The apprentice learns not just the final recipe (the right answer) but also the subtle techniques and reasoning the master uses, so the apprentice can cook almost as well but much faster.

🤓 Expert Deep Dive

Model distillation, also known as knowledge distillation, is a form of model compression that leverages a teacher-student architecture. The objective function for the student model typically takes the form: L_total = α L_hard + (1 - α) L_soft, where L_hard is a standard cross-entropy loss against ground-truth labels, and L_soft is a loss (e.g., KL divergence or cross-entropy) comparing the student's softened outputs to the teacher's softened outputs. The temperature parameter T in the softmax function (softmax(z_i / T)) is crucial; a higher T produces a softer probability distribution, emphasizing inter-class similarities, while T=1 recovers the standard softmax. Variants include distilling intermediate feature representations (feature distillation) or attention maps, rather than just the final output probabilities. This can be particularly effective when the teacher and student architectures differ significantly. Offline distillation involves training the teacher first, then distilling it. Online distillation trains the teacher and student simultaneously. Self-distillation involves using a model of the same architecture as both teacher and student. Challenges include selecting the appropriate distillation loss, tuning the temperature and weighting parameters (α), and the potential for negative transfer if the teacher model is poorly suited or the student capacity is too low. The effectiveness often depends on the similarity between the teacher's learned function and the true underlying data distribution.