Infrared Visible Light Object Detection Integrating Global and Local Information
-
Abstract
In order to improve the performance of multimodal object detection in complex scenes, a model for infrared-visible object detection that integrates global and local features is proposed. The model is designed using a multi-source heterogeneous network architecture. For visible light images, group convolution based on feature importance is used to build a convolutional neural network, extracting detailed texture features of objects from local spatial dimensions through a sliding window mechanism; for infrared images, a highly efficient multi-head self-attention mechanism is employed to build a visual Transformer network, capturing salient features of objects through global context modeling. To facilitate the interaction of infrared-visible features, a dynamic weight adaptive fusion strategy is designed to achieve cross-modal feature complementarity through key feature enhancement and modal weight allocation. Additionally, to alleviate information conflicts when fusing features of different scales, a spatial-channel multi-scale collaborative optimization module is proposed, which highlights effective features and suppresses noise information from spatial and channel dimensions through an attention mechanism, thereby reducing mutual interference between features. Experiments on multiple standard datasets show that the proposed method has significant improvements in multimodal feature extraction, cross-modal feature interaction, and multi-scale feature fusion. Moreover, compared to existing networks of the same type, this method also demonstrates better detection performance, enabling accurate and efficient object detection in complex environments.
-
-