翻译一下：As the core componentin our model to fuse the multi-modal context the architec-ture of the visual-linguistic fusion module abbreviated asV-L module is simple and elegant Specifically the V-Lmodul

作为我们融合多模态上下文模型的核心组件，视觉语言融合模块（简称为V-L模块）的架构简单优雅。具体而言，V-L模块包括两个线性投影层（每个模态一个）和一个视觉语言变换器（具有6个变换器编码器层的堆栈）。

原文地址: http://www.cveoy.top/t/topic/fmqJ 著作权归作者所有。请勿转载和采集!