Complete Guide to BERT Model Distillation (Principle & Technique & Code)

Complete Guide to BERT Model Distillation (Principle/Tips/Code)

Children, do you have many question marks about model distillation:

  • What is distillation? How to steam BERT?
  • What are the tricks of BERT distillation? How to adjust parameters?
  • How to write distillation code? Is there a ready made one?

Today, rumor combines the six classic models of Distilled BiLSTM/BERT-PKD/DistillBERT/TinyBERT/MobileBERT/MiniLM to take everyone to distill BERT to the point of clarity!

img

Note: At the end of the article, there is a summary of BERT interview points & related models, as well as the way of adding NLP team learning groups~

Principle of model distillation

Hinton proposed the concept of Knowledge Distillation in NIPS2014**[1]**, which aims to transfer the knowledge learned from a large model or multiple model ensembles to another lightweight single model for easy deployment. Simply put, it is to use the small model to learn the prediction results of the large model, instead of directly learning the label in the training set.

In the process of distillation, we call the original large model the teacher model (teacher), the new small model is called the student model (student), the label in the training set is called the hard label, the probability output predicted by the teacher model is the soft label, and temperature (T) is used to adjust the hyperparameter of the soft label.

Guess you like

Origin blog.csdn.net/linjie_830914/article/details/131543848