润色: However the above models did not deeply explore the entity information of each modality and only used pre-trained encoders of each modality for encoding so their ability to mine deep features of t
However, the previously mentioned models did not fully utilize the entity information of each modality, but only relied on the pre-trained encoders for encoding. This limits their ability to extract deep features from the data. For instance, the movie entity in the real world may have different description images in various knowledge graphs, which may result in low similarity if a visual model is used for encoding. However, there may be similar text in the posters, such as the movie name and slogan. By extracting the text information to supplement the entity image information, the alignment accuracy can be further improved.
原文地址: http://www.cveoy.top/t/topic/buza 著作权归作者所有。请勿转载和采集!