DAMO-YOLO 模型蒸馏优化：第一阶段加速收敛，第二阶段关闭蒸馏

Model distillation is an effective means of improving model performance. The training process of DAMO-YOLO is divided into two stages: the first stage is based on strong 'mosaic' augmentation, and the second stage is training with 'mosaic' augmentation turned off. It was found that using distillation in the first stage can achieve faster convergence and higher performance. However, continuing to use distillation in the second stage does not lead to further improvement in performance. This is because there is a significant distribution shift between the data in the second stage compared to the first stage, and distillation in the second stage can disrupt the knowledge distribution learned in the first stage. Additionally, the short training time in the second stage prevents the model from fully transitioning from the knowledge distribution learned in the first stage to that of the second stage. If the training period is forcibly extended or the learning rate is increased, it will increase the training cost and time, and weaken the effect of distillation in the first stage. Therefore, in DAMO-YOLO, distillation is only performed in the first stage, and not in the second stage.

Furthermore, two techniques are introduced in distillation: the alignment module is used to align the feature map sizes of the teacher and student, and normalization is used to weaken the impact of numerical scale fluctuations between the teacher and student. When the weight of the distillation loss is increased, the convergence of the classification loss slows down and there is significant fluctuation. Therefore, in DAMO-YOLO, a smaller distillation weight is used to control the distillation loss, which weakens the conflict between the distillation loss and the classification loss.