Swin Transformer与卷积神经网络的协同图像融合网络

王嘉帅; 易诗; 姚淋; 陈梦婷; 伍朗

Swin Transformer与卷积神经网络的协同图像融合网络

Cooperative Image Fusion Network Based on Swin Transformer and CNN

摘要

摘要: 将红外与可见光图像融合可以获得既突出物体又能展示丰富纹理细节的融合图像。鉴于传统融合方法对人工设计的依赖，以及主流的卷积神经网络进行特征提取的融合方法无法有效提取全局特征等问题，本文提出了Swin Transformer与卷积神经网络的协同图像融合网络。以Swin Transformer为组件，构成Swin Transformer Block模块进行图像的全局特征提取，将提取的全局特征输入到嵌有卷积神经网络的多层次融合模块中进行局部特征提取，使提取的特征既保留局部信息又保留全局上下文信息。此外，通过引入交互融合模块实现跨模态特征互补，最终融合特征通过图像重构模块生成融合图像。基于TNO和RoadScene数据集与8种经典方法进行对比实验。客观上融合指标上，相较于其它现有模型，在信息熵、标准差、空间频率以及多尺度结构相似性度量指标上取得了显著提升。主观上视觉效果方面，本文方法有效保留了红外图像中的热辐射信息以及可见光图像中的细节纹理信息，融合效果更佳。

Abstract: Fusing infrared and visible light images can produce fused images that highlight objects and provide rich texture details. Considering the dependence of traditional infrared and visible image fusion methods on artificial design, current mainstream fusion methods based on convolutional neural network (CNN) cannot effectively extract global context information. A collaborative image fusion visual network based on the Swin Transformer and CNN was proposed. Using the Swin Transformer as a component, a Swin Transformer Block module is formed to extract the global features of an image that are input into the multilevel fusion module embedded with the CNN for local feature extraction such that the extracted features retain both local and global context information. In addition, cross-modal feature complementarity is realized by introducing an interactive fusion module. The final fusion feature generates a fusion image using an image reconstruction module. A comparative experiment was conducted based on the TNO and RoadScene datasets using eight classic methods. In terms of objective fusion indicators, compared with other existing models, significant improvements were achieved in information entropy, standard deviation, spatial frequency, and multiscale structural similarity metrics. Subjectively, in terms of visual effects, the proposed method effectively retained the thermal radiation information in the infrared image and the detailed texture information in the visible light image, resulting in a better fusion effect.

HTML全文

参考文献(34)

施引文献

资源附件(0)