Abstract:
Existing infrared and visible image fusion approaches cannot fully integrate local and global feature representations, resulting in bias and smoothness in the fused image. Therefore, in this study, we propose a fusion approach for jointly learning local and global features, namely JLFuse. First, a convolution transformer is introduced based on traditional convolution sampling to enhance the modeling ability of the global features. Second, a fusion strategy (JLFN), based on spatially separable self-attention, is designed using locally grouped self-attention and global sub-sampled attention alternately guided transformer modules to achieve joint learning of local and global fusion features. Finally, the pyramid design principle is adopted to obtain multiscale features and enhance the local propagation. Experimental results on the TNO and RoadScene datasets show that the proposed approach outperforms six advanced fusion approaches in multiple objective evaluation metrics. Subjectively, the fused images are more consistent with human visual preferences.