翻译一下：We leverage the output state of REGtoken from the V-L module as the input of our predictionhead To perform box coordinates prediction we attach aregression block to the REG token The regression b

我们利用来自V-L模块的[REG]令牌的输出状态作为我们预测头的输入。为了进行框坐标预测，我们将一个回归块附加到[REG]令牌上。回归块由具有两个ReLU激活的256维隐藏层和一个线性输出层的MLP实现。预测头的输出是4维框坐标。

原文地址: http://www.cveoy.top/t/topic/fmst 著作权归作者所有。请勿转载和采集!