Abstract:
In recent years, significant advancements have been achieved in computer vision using deep learning. However, single-modal RGB images have limitations in urban scenes and are easily affected by lighting and adverse weather conditions, resulting in low robustness. Image semantic segmentation tasks require detailed and highly discriminative semantic information, but continuous downsampling during feature extraction can lead to the loss of detailed features. To address these issues, this study proposes a CMASeg cross-modal attention-guided RGB-T fusion segmentation network. This network uses ResNet-152 as the encoder and enhances the features using a channel-spatial attention module. It also employs a cross-modal feature refinement module to exploit the complementary information of RGB-T images, thereby achieving effective fusion and feature extraction of multimodal data. Experimental results show that CMASeg achieved a segmentation accuracy of 71.7% mean accuracy and 55.9% mean intersection over union on the publicly available MFNet dataset, outperforming existing algorithms. The proposed method performs well in urban scenes and provides a new solution for semantic image segmentation tasks.