Require: Set of labeled training data D = {(xi,yi)} |
Require: Set of K teacher models T = Tk |
Require: Student model S |
Ensure: Trained student model |
1: Initialize student model parameters θS randomly. |
2: for each teacher model Tk ∈ T do |
3: Compute predictions pk (x) for each xi ∈ D. |
4: Initialize student model weights to match Tk. |
5: Train student model on D using: KDLoss |
(\theta_S,\theta^{(k)}_{T};D)=\frac{1}{n}\sum^\nolimits{n}_{i=1}D_{KL}(p_k(x_i)\parallel q_s (x_i;\theta_S,\theta^{(k)}_T))
|
where DKL denotes Kullback-Leibler divergence and
q_s (x_i;\theta_S,\theta^{(k)}_T)
is the softmax output of student model. |
6: end for |
7: return Trained student model S |