Enhancing Attention with Contextual Information for Computer Vision Transformers

The traditional transformer structure used for computer vision tasks mostly utilizes 1x1 convolutions to obtain the Key and Query, and then multiplies the paired interactions of query and key to obtain the attention matrix. This construction method does not consider the rich contextual relationships between adjacent positions in the Key matrix. Therefore, this paper utilizes contextual information between neighboring positions to enrich the attention matrix and improve the expressive power of the output features. Specifically, this paper introduces the Contextual Transformer (CoT) block into the segmentation network encoder. The CoT block combines self-attention mechanism with convolution operations to obtain a new attention matrix, thereby extracting rich contextual information from the Key as shown in Figure 4. Figure 4(a) illustrates the CoT structure, and Figure 4(b) illustrates the CoT block structure.