Enhancing Transformer Structures for Computer Vision with Contextual Information

The traditional transformer structures used for computer vision tasks mostly use 1x1 convolutions to obtain the Key and Query, and then use the multiplication of Query and Key to obtain the attention matrix. This construction method does not consider the contextual relationship between adjacent positions in the Key matrix, resulting in insufficient extraction of information. Therefore, this paper uses the Contextual Transformer (CoT) block to extract contextual information between neighboring positions to enrich the attention matrix and improve the expressive power of the output features. Specifically, the CoT block is used in the encoder of the segmentation network, which combines the self-attention mechanism with convolution operations to obtain a new attention matrix, extracting rich contextual information from the Key, as shown in Figure 4. Figure 4(a) represents the CoT structure, and Figure 4(b) represents the CoT block structure.