Model Distillation

A destilação de modelos é uma técnica em que um modelo menor e menos complexo (o aluno) é treinado para imitar o comportamento de um modelo maior e mais complexo (o professor).

🌐 Termos em outros idiomas:

English Deutsch Español Français 日本語 한국어 Polski Português Русский Türkçe Українська

A destilação de modelos visa transferir conhecimento de um modelo grande, muitas vezes computacionalmente caro, para um menor. O modelo professor, pré-treinado em um conjunto de dados, fornece rótulos suaves ou probabilidades para o modelo aluno aprender, em vez de apenas rótulos rígidos. Isso permite que o modelo aluno capture as capacidades de generalização do professor e potencialmente alcance um desempenho semelhante com menos parâmetros e menor custo computacional. O processo normalmente envolve o treinamento do modelo aluno em uma combinação dos dados de treinamento originais e das saídas do professor, geralmente usando uma função de perda que incentiva as previsões do aluno a se alinharem com as previsões do professor.

Esta técnica é particularmente útil em cenários onde a implantação do modelo professor completo é impraticável devido a restrições de recursos, como em dispositivos de borda ou aplicativos móveis. Ela permite a criação de modelos eficientes que retêm o desempenho de suas contrapartes maiores, facilitando uma inferência mais rápida e uma pegada de memória reduzida.

        graph LR
  Center["Model Distillation"]:::main
  Pre_machine_learning["machine-learning"]:::pre --> Center
  click Pre_machine_learning "/terms/machine-learning"
  Pre_neural_network["neural-network"]:::pre --> Center
  click Pre_neural_network "/terms/neural-network"
  Pre_deep_learning["deep-learning"]:::pre --> Center
  click Pre_deep_learning "/terms/deep-learning"
  Rel_llm["llm"]:::related -.-> Center
  click Rel_llm "/terms/llm"
  classDef main fill:#7c3aed,stroke:#8b5cf6,stroke-width:2px,color:white,font-weight:bold,rx:5,ry:5;
  classDef pre fill:#0f172a,stroke:#3b82f6,color:#94a3b8,rx:5,ry:5;
  classDef child fill:#0f172a,stroke:#10b981,color:#94a3b8,rx:5,ry:5;
  classDef related fill:#0f172a,stroke:#8b5cf6,stroke-dasharray: 5 5,color:#94a3b8,rx:5,ry:5;
  linkStyle default stroke:#4b5563,stroke-width:2px;

🕸️ Open in Universe

🧠 Teste de conhecimento

1 / 3

🧒 Explique como se eu tivesse 5 anos

It's like a master chef teaching an apprentice. The apprentice learns not just the final recipe (the right answer) but also the subtle techniques and reasoning the master uses, so the apprentice can cook almost as well but much faster.

🤓 Expert Deep Dive

Model distillation, also known as knowledge distillation, is a form of model compression that leverages a teacher-student architecture. The objective function for the student model typically takes the form: L_total = α L_hard + (1 - α) L_soft, where L_hard is a standard cross-entropy loss against ground-truth labels, and L_soft is a loss (e.g., KL divergence or cross-entropy) comparing the student's softened outputs to the teacher's softened outputs. The temperature parameter T in the softmax function (softmax(z_i / T)) is crucial; a higher T produces a softer probability distribution, emphasizing inter-class similarities, while T=1 recovers the standard softmax. Variants include distilling intermediate feature representations (feature distillation) or attention maps, rather than just the final output probabilities. This can be particularly effective when the teacher and student architectures differ significantly. Offline distillation involves training the teacher first, then distilling it. Online distillation trains the teacher and student simultaneously. Self-distillation involves using a model of the same architecture as both teacher and student. Challenges include selecting the appropriate distillation loss, tuning the temperature and weighting parameters (α), and the potential for negative transfer if the teacher model is poorly suited or the student capacity is too low. The effectiveness often depends on the similarity between the teacher's learned function and the true underlying data distribution.