We consider the clip tokens learned at different scales as representations of different views of the clip instance Then we cluster clip representations learnedin all scales while enforcing consistenc

The benefit of clustering clip representations at different scales is that it allows us to capture multiple perspectives of the same instance. For instance, a clip of a person walking can be viewed from different angles, and these different views can provide complementary information about the action. By clustering representations learned at different scales, we can capture this complementary information and ensure that the resulting embedding space is robust to variations in viewpoint.

Enforcing consistency between cluster assignments produced from different scales is important because it promotes a shared understanding of the clip across all scales. For example, if the representation of a clip at a coarser scale suggests that it contains a person walking, then the representations at finer scales should also reflect this information. By using a swapped prediction mechanism, we ensure that the cluster assignments are consistent across all scales, which encourages the model to learn a more coherent representation of the clip.

Overall, the clustering of clip representations at different scales and the enforcement of consistency between cluster assignments is an effective way to regularize the embedding space during training and improve the model's ability to capture the common semantic information hidden across different scales

We consider the clip tokens learned at different scales as representations of different views of the clip instance Then we cluster clip representations learnedin all scales while enforcing consistenc