Abstract:
In this study, we propose an enhanced lightweight detection model named VSA-YOLOv11n to address the challenges of detecting small objects in infrared imagery collected by uncrewed aerial vehicles (UAVs) such as weak thermal signatures, complex background interference, and significant scale variation. The proposed model is based on the YOLOv11n architecture and integrates a streamlined and efficient backbone network called VanillaNet to improve feature extraction capability while significantly reducing inference latency. A structured multi-scale convolutional module is introduced in the feature fusion stage to enhance the model's sensitivity to targets of varying sizes in cluttered backgrounds. Furthermore, an adaptive spatial feature fusion (ASFF) head is incorporated to enable fine-grained semantic aggregation and selective enhancement across scales to improve detection accuracy and robustness for small infrared targets. The results of extensive experiments conducted on the HIT-UAV infrared small object dataset demonstrate that the proposed model achieved comprehensive improvements in terms of accuracy, inference speed, and parameter efficiency. Specifically, the model attained an mAP50 of 81.3% with an inference latency of only 1.79 ms, which thus outperformed existing mainstream lightweight detectors. These results highlight the model's strong practical applicability and deployment potential, particularly in low-altitude infrared scenarios with high real-time requirements.