Abstract:
Current multi-sensor image fusion methods encounter challenges such as inadequacy in integrating hierarchical features and difficulty in accurately identifying decoupled complementary features. To address these issues, this paper introduces a novel image fusion approach that leverages the interactions between local cross-modal features and global self-attention mechanisms. The proposed method constrains the similarity of deep features across twin branches, facilitating the effective exchange and fusion of multi-level and multi-scale complementary features through interactive modules. Specifically, the interaction module employs a cross-modal attention mechanism to compute the local feature dissimilarities between multi-modal images that are then used as interaction coefficients to enable the fusion of the upper and lower branch features. However, this dissimilarity metric is prone to noise, artifacts, and other irrelevant information and is often misinterpreted as complementary features. Given the isolated nature of this information within a specific modality, the proposed method differentiates this information by calculating a global self-attention coefficient. The final interaction coefficient, comprising both cross-modal attention and global self-attention components, enables the efficient extraction of complementary features. Furthermore, to ensure the completeness and consistency of the fused features, a feature cyclic consistency loss is introduced to constrain the fusion process, thereby promoting the preservation of richer source image information. To accommodate a wide range of fusion scenarios, this study proposes a fusion loss function based on masking and pooling operations. The effectiveness and superiority of the proposed approach are demonstrated through comprehensive comparisons with state-of-the-art methods using both subjective and objective metrics on benchmark datasets such as TNO and RoadScene.