跨模态注意力引导的RGB-T融合分割网络

董莹新; 王茂宁; 钟羽中

跨模态注意力引导的RGB-T融合分割网络

Cross-modal Attention Guided RGB-T Fusion Segmentation Network

摘要

摘要: 近年来，深度学习在计算机视觉领域取得了显著进展。然而，单一模态的RGB图像在城市场景下存在局限，容易受到光照和恶劣天气等条件的影响，鲁棒性不高。图像语义分割任务需要精细的细节信息和高鉴别力的语义信息，在特征提取阶段，连续的下采样会导致细节信息的丢失。针对上述问题，本文提出了一种跨模态注意力引导的RGB-T融合分割网络CMASeg，该网络采用ResNet-152作为编码器，通过通道-空间注意力模块进行特征增强，并利用跨模态特征细化模块挖掘RGB-T图像的互补信息，实现了不同模态数据的有效融合和特征提取。实验结果表明，在公开数据集MFNet上，CMASeg的分割精度达到了71.7%的平均准确率和55.9%的平均交并比，相较于现有算法表现更优。本文的方法在城市场景下表现出色，为图像语义分割任务提供了一种新的解决方案。

Abstract: In recent years, significant advancements have been achieved in computer vision using deep learning. However, single-modal RGB images have limitations in urban scenes and are easily affected by lighting and adverse weather conditions, resulting in low robustness. Image semantic segmentation tasks require detailed and highly discriminative semantic information, but continuous downsampling during feature extraction can lead to the loss of detailed features. To address these issues, this study proposes a CMASeg cross-modal attention-guided RGB-T fusion segmentation network. This network uses ResNet-152 as the encoder and enhances the features using a channel-spatial attention module. It also employs a cross-modal feature refinement module to exploit the complementary information of RGB-T images, thereby achieving effective fusion and feature extraction of multimodal data. Experimental results show that CMASeg achieved a segmentation accuracy of 71.7% mean accuracy and 55.9% mean intersection over union on the publicly available MFNet dataset, outperforming existing algorithms. The proposed method performs well in urban scenes and provides a new solution for semantic image segmentation tasks.

HTML全文

参考文献(20)

施引文献

资源附件(0)