Abstract:
Small object detection in UAV-based visible and infrared imagery remains challenging due to scale variation, weak thermal signals, and complex background interference. This paper proposes a dual-modality detection model that integrates receptive field enhancement and global cross-scale semantic fusion, built upon the YOLOv11 architecture. A reparameterized receptive field attention convolution (RFAConv) module expands shallow-layer receptive fields via a dual-branch structure to improve spatial sensitivity and modality adaptability. A Transformer-guided global fusion mechanism aligns multi-scale semantics non-locally, and a mixed local channel attention module enhances focus on small-object regions while suppressing noise. Experiments on VisDrone2021 and HIT-UAV datasets show that the proposed method achieves superior accuracy, structural efficiency, and robustness compared to existing lightweight and Transformer-based detectors.