SuperYOLO: Enhancing Object Detection with Super-Resolution for Small Targets in Remote Sensing Images

As mentioned in Section III, the feature size retained for multiscale detection in the backbone is far smaller than that of the original input image. Most of the existing methods conduct upsampling operations to recover the feature size. Unfortunately, this approach has produced limited success due to the information loss in texture and pattern, which explains that it is inappropriate to employ this operation to detect small targets that require HR preservation in RSI.

To address this issue, as shown in Fig. 2, we introduce an auxiliary SR branch. First, the introduced branch shall facilitate the extraction of HR information in the backbone and achieve satisfactory performance. Second, the branch should not add more computation to reduce the inference speed. It shall realize a trade-off between accuracy and computation time during the inference stage. Inspired by the study of Wang et al. [38] where the proposed super resolution succeeded in facilitating segmentation tasks without additional requirements, we introduce a simple and effective branch named SR into the framework. Our proposal can improve detection accuracy without computation and memory overload, especially under circumstances of LR input.

Specifically, the SR structure can be regarded as a simple Encode-Decoder model. We select the backbone's low-level and high-level features to fuse local textures and patterns and semantic information, respectively. As depicted in Fig. 4, we select the result of the fourth and ninth modules as the low-level and high-level features, respectively. The Encoder integrates the low-level feature and high-level feature generated in the backbone. As illustrated in Fig. 5, in Encoder, the first CR module is conducted on the low-level feature. For the high-level feature, we use an Upsampling operation to match the spatial size of the low-level feature and then we use a concatenation operation and two CR modules to merge the low-level and high-level features. The CR module includes a convolution and ReLU. For the Decoder, the LR feature is upscaled to the HR space in which the SR module's output size is twice larger than that of the input image. As illustrated in Fig. 5, the Decoder is implemented using three deconvolutional layers. The SR guides the related learning of spatial dimension and transfers it to the main branch, thereby improving the performance of object detection. In addition, we introduce EDSR [43] as our Encoder structure to explore the SR performance and its influence on detection performance.

To present a more visually interpretable description, we visualize the features of backbones for YOLOv5s, YOLOv5x and SuperYOLO in Fig. 6. The features are upsampled to the same scale as the input image for comparison. By comparing the pairwise images of (c), (f) and (i); (d), (g) and (j); (e) (h) and (k) in Fig. 6, it can be observed that SuperYOLO contains clearer object structures with higher resolution with the assistance of the SR. Eventually, we obtain a bumper harvest in high-quality HR representation with the SR branch and utilize the Head of YOLOv5 to detect small objects.

SuperYOLO: Enhancing Object Detection with Super-Resolution for Small Targets in Remote Sensing Images