Enhancing Image Encoding with a Hybrid Convolutional Neural Network and Transformer Architecture

As is well known, various methods such as convolution, feature mapping, and pooling operations can effectively extract rich information from images. Due to its advantages in parallel computation, establishing long-range dependencies, and interpretability, Transformer has been widely applied in the field of image processing. In order to improve the performance of the encoder, this paper combines the convolutional neural network module with the Transformer module. By utilizing the advantages of both, the encoder can effectively capture the local and global features of the input image, thereby obtaining a more comprehensive understanding of the image content.