ResNet50 with CBAM: Enhanced Feature Extraction and Downsampling

Among them, MS is the spatial attention module and F represents the operation of the convolutional layer, while F' denotes the features obtained after passing through the spatial attention module. The CBAM module is integrated into the operation of each potential residual block in ResNet50. In the classical ResNet model, when downsampling is necessary, a convolution operation with a 1x1 kernel and a stride of 2 is performed, which inevitably leads to information loss. Therefore, the convolution operation with a 1x1 kernel and a stride of 2 will be avoided in this study. Figure 3 describes the new residual block structure, which is modified from the original structures (a) and (b). The downsampling 1x1 convolution is also changed to a 3x3 layer convolution with the introduction of the CBAM module. When downsampling is necessary, the mapping component directly adjusts the size of the convolution kernel from 1x1 to 2x2.