Abstract:
Fusing infrared and visible light images can produce fused images that highlight objects and provide rich texture details. Considering the dependence of traditional infrared and visible image fusion methods on artificial design, current mainstream fusion methods based on convolutional neural network (CNN) cannot effectively extract global context information. A collaborative image fusion visual network based on the Swin Transformer and CNN was proposed. Using the Swin Transformer as a component, a Swin Transformer Block module is formed to extract the global features of an image that are input into the multilevel fusion module embedded with the CNN for local feature extraction such that the extracted features retain both local and global context information. In addition, cross-modal feature complementarity is realized by introducing an interactive fusion module. The final fusion feature generates a fusion image using an image reconstruction module. A comparative experiment was conducted based on the TNO and RoadScene datasets using eight classic methods. In terms of objective fusion indicators, compared with other existing models, significant improvements were achieved in information entropy, standard deviation, spatial frequency, and multiscale structural similarity metrics. Subjectively, in terms of visual effects, the proposed method effectively retained the thermal radiation information in the infrared image and the detailed texture information in the visible light image, resulting in a better fusion effect.