基于自注意力特征交互的红外与可见光图像融合方法

Infrared and Visible Image Fusion Method Based on The Interactions of Self-Attention Features

  • 摘要: 现有多传感器图像融合方法存在层次特征融合不充分、解耦的互补特征难以鉴别的问题,为此本文提出了一种基于自注意力特征交互的红外与可见光图像融合方法。该方法在孪生分支上约束深层特征的相似性,促使多层级多尺度的互补特征通过交互模块进行合理的交换与融合。具体地,交互模块利用跨模态注意力机制计算多模态图像之间的局部特征不相似度,并以此作为特征的交互系数实现上下分支特征的交互。但是不相似度量易受噪声、伪影等信息影响,误判其为互补信息。由于该类信息对于本模态图像较为孤立,本方法通过计算全局自注意力系数判别该类信息。最终的交互系数由跨模态注意力系数与全局自注意力系数两部分组成,可以有效地提取互补特征。同时,为了保证融合特征的完整性与一致性,本方法提出特征循环一致性损失对融合特征进行约束,促使融合图像具备更丰富的源图像信息。为适应融合场景的多样性,本文提出了一种基于掩码与池化的融合损失函数。通过在TNO、RoadScene等数据集上与其他State-of-the-Art方法进行主客观指标比较,检验了本文方法的优越性。

     

    Abstract: Current multi-sensor image fusion methods encounter challenges such as inadequacy in integrating hierarchical features and difficulty in accurately identifying decoupled complementary features. To address these issues, this paper introduces a novel image fusion approach that leverages the interactions between local cross-modal features and global self-attention mechanisms. The proposed method constrains the similarity of deep features across twin branches, facilitating the effective exchange and fusion of multi-level and multi-scale complementary features through interactive modules. Specifically, the interaction module employs a cross-modal attention mechanism to compute the local feature dissimilarities between multi-modal images that are then used as interaction coefficients to enable the fusion of the upper and lower branch features. However, this dissimilarity metric is prone to noise, artifacts, and other irrelevant information and is often misinterpreted as complementary features. Given the isolated nature of this information within a specific modality, the proposed method differentiates this information by calculating a global self-attention coefficient. The final interaction coefficient, comprising both cross-modal attention and global self-attention components, enables the efficient extraction of complementary features. Furthermore, to ensure the completeness and consistency of the fused features, a feature cyclic consistency loss is introduced to constrain the fusion process, thereby promoting the preservation of richer source image information. To accommodate a wide range of fusion scenarios, this study proposes a fusion loss function based on masking and pooling operations. The effectiveness and superiority of the proposed approach are demonstrated through comprehensive comparisons with state-of-the-art methods using both subjective and objective metrics on benchmark datasets such as TNO and RoadScene.

     

/

返回文章
返回