SuperYOLO: An Enhanced YOLO Framework with Multimodal Fusion and Assisted Super-Resolution

Fig. 2. The overview of the proposed SuperYOLO framework.

Our proposed SuperYOLO framework introduces three key contributions:

Removal of the Focus Module: We eliminate the Focus module to preserve high-resolution information, which is crucial for accurate object detection.2. Multimodal Fusion: SuperYOLO effectively fuses data from multiple modalities, enriching feature representations and improving detection accuracy.3. Assisted Super-Resolution (SR) Branch: An auxiliary SR branch guides the backbone network to learn better spatial representations during training. This branch is removed during inference for optimized speed.

The architecture is jointly optimized using Mean Square Error (MSE) loss for the SR branch and task-specific loss for object detection. This ensures that the SR branch enhances high-resolution information preservation within the backbone. During testing, the SR branch is discarded to achieve inference speeds comparable to the baseline YOLO model.

Key Advantages:

Improved Accuracy: SuperYOLO's innovative design leads to significant improvements in object detection accuracy.* Fast Inference: Despite incorporating an SR branch during training, SuperYOLO maintains fast inference speeds comparable to baseline models during testing.* Efficient Architecture: By removing the Focus module and utilizing an assisted SR branch, SuperYOLO achieves a balance between performance and efficiency.

SuperYOLO: An Enhanced YOLO Framework with Multimodal Fusion and Assisted Super-Resolution