RGB-T Salient Object Detection: A Survey
-
摘要:
除RGB图像外,热红外图像也能提取出对显著性目标检测至关重要的显著性信息。热红外图像随着红外传感设备的发展和普及已经变得易于获取,RGB-T显著性目标检测已成为了热门研究领域,但目前仍缺少对现有方法全面的综述。首先介绍了基于机器学习的RGB-T显著性目标检测方法,然后着重介绍了两类基于深度学习的RGB-T显著性目标检测方法:基于卷积神经网络和基于Vision Transformer的方法。随后对相关数据集和评价指标进行介绍,并在这些数据集上对代表性的方法进行了定性和定量的比较分析。最后对RGB-T显著性目标检测面临的挑战及未来的发展方向进行了总结与展望。
-
关键词:
- 显著性目标检测 /
- 热红外图像 /
- RGB-T显著性目标检测 /
- 深度学习
Abstract:In addition to RGB images, thermal IR images can be used to extract salient information, which is crucial for salient object detection. With the development and popularization of IR sensing equipment, thermal IR images have become readily available, and RGB-T salient object detection has become a popular research topic. However, there is currently a lack of comprehensive surveys on the existing methods. First, we briefly introduce machine learning-based RGB-T salient object detection methods and then focus on two types of deep learning methods based on CNNs and vision transformers. Subsequently, relevant datasets and evaluation metrics are introduced, and both qualitative and quantitative comparative analyses are conducted on representative methods using these datasets. Finally, challenges and future development directions for RGB-T salient object detection are summarized and discussed.
-
-
表 1 RGB-T显著性目标检测数据集
Table 1 The RGB-T salient object detection datasets
Name Year Scales Camera equipment Disadvantage VT821 2018 821 FLIR A310、SONY TD-2073 1. Simple scenes that lack complexity and variety.
2. The camera uses different parameters when capturing RGB and thermal images.
3. Additional whitespace is introduced when aligning images.VT1000 2019 1000 FLIR SC620 1. There are potential errors as the images are aligned manually.
2. Limited scenario complexity and diversity.VT5000 2020 5000 FLIR T640、FLIR T610 1. Images are affected by thermal crossover, making detection challenging. 表 2 基于机器学习的RGB-T显著性目标检测方法定量比较
Table 2 Quantitative comparison of machine learning-based RGB-T salient object detection methods
Algorithms VT821 VT1000 VT5000 S↑ F↑ E↑ MAE↓ S↑ F↑ E↑ MAE↓ S↑ F↑ E↑ MAE↓ MTMR[7] 0.725 0.662 0.815 0.108 0.706 0.715 0.836 0.119 0.680 0.595 0.795 0.114 N3S-NIR[10] 0.723 0.734 0.859 0.140 0.726 0.717 0.827 0.145 0.652 0.575 0.780 0.168 LTCR[11] 0.762 0.737 0.854 0.088 0.799 0.794 0.872 0.084 - MGFL[12] 0.782 0.725 0.841 0.071 0.820 0.801 0.882 0.066 0.751 0.661 0.817 0.085 Note: ↑ indicates that the larger the indicator, the better, and ↓ indicates that the smaller the indicator, the better. Bold and underline indicate optimal and sub-optimal results, respectively. 表 3 基于深度学习的RGB-T显著性目标检测方法定量比较
Table 3 Quantitative comparison of deep learning-based RGB-T salient object detection methods
Methods Algorithms Backbone VT821 VT1000 VT5000 S↑ F↑ E↑ MAE↓ S↑ F↑ E↑ MAE↓ S↑ F↑ E↑ MAE↓ CNN-based FMCF[8] VGG16 0.760 0.640 0.796 0.080 0.873 0.823 0.921 0.037 0.814 0.734 0.864 0.055 SGDL[15] VGG19 0.765 0.730 0.847 0.085 0.787 0.764 0.856 0.090 0.750 0.672 0.824 0.089 ADFNet[21] VGG16 0.810 0.716 0.842 0.077 0.910 0.847 0.921 0.034 0.863 0.778 0.891 0.048 MIDD[22] VGG16 0.871 0.804 0.895 0.045 0.915 0.882 0.933 0.027 0.867 0.801 0.897 0.043 CGFNet[23] VGG16 0.881 0.845 0.912 0.038 0.923 0.906 0.944 0.023 0.883 0.851 0.922 0.035 CGMDRNet[25] Res2Net-50 0.894 0.840 0.920 0.035 0.931 0.893 0.940 0.020 0.896 0.846 0.928 0.032 TNet[27] ResNet-50 0.898 0.841 0.919 0.030 0.928 0.889 0.937 0.021 0.894 0.847 0.927 0.033 MIA_DPD[28] ResNet-50 0.844 - 0.850 0.070 0.924 - 0.926 0.025 0.879 - 0.893 0.040 MMNet[29] ResNet-50 0.875 0.798 0.893 0.040 0.917 0.863 0.924 0.027 0.864 0.785 0.890 0.043 CAVER[30] ResNet-50 0.891 0.839 0.919 0.033 0.935 0.903 0.945 0.018 0.891 0.842 0.930 0.032 CSRNet[31] ESPNet’v2 0.885 0.830 0.908 0.038 0.918 0.877 0.925 0.024 0.868 0.810 0.905 0.042 ViT-based SwinNet[35] Swin transformer 0.904 0.847 0.926 0.030 0.938 0.896 0.947 0.018 0.912 0.865 0.942 0.026 HRTransNet[37] HRFormer 0.906 0.853 0.929 0.026 0.938 0.900 0.945 0.017 0.912 0.871 0.945 0.025 MITF-Net[36] PVT’v2 0.905 0.853 0.927 0.027 0.938 0.906 0.949 0.016 0.910 0.870 0.943 0.025 Note: ↑ indicates that the larger the indicator, the better, and ↓ indicates that the smaller the indicator, the better. Bold and underline indicate optimal and sub-optimal results, respectively. -
[1] XU H, ZHANG H, MA J Y. Classification saliency-based rule for visible and infrared image fusion[J]. IEEE Transactions on Computational Imaging, 2021, 7: 824-836. DOI: 10.1109/TCI.2021.3100986
[2] LI G Y, WANG Y K, LIU Z, et al. RGB-T semantic segmentation with location, activation, and sharpening [J]. IEEE Transactions on Circuits and Systems for Video Technology, 2023, 33(3): 1223-1235. DOI: 10.1109/TCSVT.2022.3208833
[3] 侯毅苇, 李林汉, 王彦. 结合红外显著性目标导引的改进YOLO网络的智能装备目标识别研究[J]. 红外技术, 2020, 42(7): 644-650. http://hwjs.nvir.cn/article/id/hwjs202007007 HOU Yiwei, LI Linhan, WANG Yan. Intelligent equipment object recognition based on improved YOLO network guided by infrared saliency detection[J]. Infrared Technology, 2020, 42(7): 644-650. http://hwjs.nvir.cn/article/id/hwjs202007007
[4] Itti L, Koch C, Niebur E. A model of saliency-based visual attention for rapid scene analysis[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1998, 20(11): 1254-1259. DOI: 10.1109/34.730558
[5] LI C L, CHENG H, HU S Y, et al. Learning collaborative sparse representation for grayscale-thermal tracking[J]. IEEE Transactions on Image Processing, 2016, 25(12): 5743-5756. DOI: 10.1109/TIP.2016.2614135
[6] 张骏, 张鹏, 张政, 等. 类HED网络的热红外图像显著性人体检测深度网络[J]. 红外技术, 2023, 45(6): 649-657. http://hwjs.nvir.cn/article/id/bc2b522e-24dc-4229-8ed3-0b973874e0f4 ZHANG Jun, ZHANG Peng, ZHANG Zheng, et al. Similar HED-Net for salient human detection in thermal infrared images[J]. Infrared Technology, 2023, 45(6): 649-657. http://hwjs.nvir.cn/article/id/bc2b522e-24dc-4229-8ed3-0b973874e0f4
[7] WANG G Z, LI C L, MA Y P, et al. RGB-T saliency detection benchmark: dataset, baselines, analysis and a novel approach[C]//IGTA 2018: The 13th Academic Conference on Image Graphics Technology and Application, 2018: 359-369.
[8] MA Y, SUN D, MENG Q, et al. Learning multiscale deep features and svm regressors for adaptive RGB-T saliency detection[C]//ISCID 2017: 2017 10th International Symposium on Computational Intelligence and Design, 2017: 389-392.
[9] ZHOU D Y, Weston J, Gretton A, et al. Ranking on data manifolds[C]// NIPS 2003: Advances in Neural Information Processing Systems, 2003: 169-176.
[10] TU Z Z, XIA T, LI C L, et al. M3S-NIR: multi-modal multi-scale noise-insensitive ranking for RGB-T saliency detection[C]// MIPR 2019: 2019 IEEE Conference on Multimedia Information Processing and Retrieval, 2019: 141-146.
[11] HUANG L M, SONG K C, WANG J, et al. Multi-graph fusion and learning for RGBT image saliency detection[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2022, 32(3): 1366-1377. DOI: 10.1109/TCSVT.2021.3069812
[12] HUANG L M, SONG K C, GONG A J, et al. RGB-T saliency detection via low-rank tensor learning and unified collaborative ranking[J]. IEEE Signal Processing Letters, 2020, 27: 1585-1589. DOI: 10.1109/LSP.2020.3020735
[13] 张冬明, 靳国庆, 代锋, 等. 基于深度融合的显著性目标检测算法[J]. 计算机学报, 2019, 42(9): 2076-2086. ZHANG D M, JIN G Q, DAI F. Sailent object detection based on deep fusion of hand-craft features[J]. Chinese Journal of Computers, 2019, 42(9): 2076-2086.
[14] Sandler M, Howard A, ZHU M L, et al. MobileNetV2: inverted residuals and linear bottlenecks[C]// CVPR 2018: Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018: 4510-4520.
[15] TU Z Z, XIA T, LI C L, et al. RGB-t image saliency detection via collaborative graph learning[J]. IEEE Transactions on Multimedia, 2020, 22(1): 160-173. DOI: 10.1109/TMM.2019.2924578
[16] PANG Y, WU H, WU C D. Cross-modal co-feedback cellular automata for RGB-T saliency detection[J]. Pattern Recognition, 2023, 135: 109-138.
[17] LIU Z Y, HUANG X S, ZHANG G H et al. Scribble-supervised RGB-T salient object detection[C]//ICME 2023: Proceedings of the IEEE International Conference on Multimedia and Expo, 2023: 2369-2374.
[18] ZHANG Q, HUANG N C, YAO L, et al. RGB-T salient object detection via fusing multi-level CNN features[J]. IEEE Transactions on Image Processing, 2020, 29: 3321-3335. DOI: 10.1109/TIP.2019.2959253
[19] ZHANG Q, HUANG N C, XIAO T, et al. Revisiting feature fusion for RGB-T salient object detection[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2020, 31(5): 1804-1818.
[20] BI H B, WU R W, LIU Z Q, et al. PSNet: parallel symmetric network for RGB-T salient object detection[J]. Neurocomputing, 2022, 511: 410-425. DOI: 10.1016/j.neucom.2022.09.052
[21] TU Z Z, MA Y, LI Z, et al. RGBT salient object detection: a large-scale dataset and benchmark[J]. IEEE Transactions on Multimedia, 2022, 25: 4163-4176.
[22] TU Z Z, LI Z, LI C L, et al. Multi-interactive dual-decoder for RGB-thermal salient object detection[J]. IEEE Transactions on Image Processing, 2021, 30: 5678-5691. DOI: 10.1109/TIP.2021.3087412
[23] WANG J, SONG K C, BAO Y Q, et al. CGFNet: cross-guided fusion network for RGB-T salient object detection[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2022, 32(5): 2949-2961. DOI: 10.1109/TCSVT.2021.3099120
[24] CHEN Q, LIU Z, ZHANG Y, et al. RGB-D Salient Object Detection via 3D Convolutional Neural Networks[C]// Proceedings of the AAAI Conference on Artificial Intelligence, 2022: 1063-1071.
[25] CHEN G, SHAO F, CHAI X L, et al. CGMDRNet: cross-guided modality difference reduction network for RGB-T salient object detection[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2022, 32(9): 6308-6323. DOI: 10.1109/TCSVT.2022.3166914
[26] LIAO G B, GAO W, LI G, et al. Cross-collaborative fusion-encoder network for robust rgb-thermal salient object detection[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2022, 32(11): 7646-7661. DOI: 10.1109/TCSVT.2022.3184840
[27] CONG R M, ZHANG K P, ZHANG C, et al. Does thermal really always matter for RGB-T salient object detection?[J]. IEEE Transactions on Multimedia, 2022, 25: 1-12.
[28] LIANG Y H, QIN G H, SUN M H, et al. Multi-modal interactive attention and dual progressive decoding network for RGB-D/T salient object detection[J]. Neurocomputing, 2022, 490: 132-145. DOI: 10.1016/j.neucom.2022.03.029
[29] GAO W, LIAO G B, MA S W, et al. Unified information fusion network for multi-modal RGB-D and RGB-T salient object detection[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2022, 32(4): 2091-2106. DOI: 10.1109/TCSVT.2021.3082939
[30] PANG Y W, ZHAO X Q, ZHANG L H, et al. CAVER: cross-modal view-mixed transformer for bi-modal salient object detection[J]. IEEE Transactions on Image Processing, 2023, 32: 892-904.
[31] ZHOU W J, GUO Q L, LEI J S, et al. ECFFNet: effective and consistent feature fusion network for RGB-T salient object detection[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2022, 32(3): 1224-1235. DOI: 10.1109/TCSVT.2021.3077058
[32] ZHOU W J, ZHU Y, LEI J S, et al. LSNet: lightweight spatial boosting network for detecting salient objects in RGB-thermal images[J]. IEEE Transactions on Image Processing, 2023, 32: 1329-1340. DOI: 10.1109/TIP.2023.3242775
[33] Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[C]//NIPS 2017: Advances in Neural Information Processing Systems, 2017: 6000-6010.
[34] WANG W H, XIE E Z, LI X, et al. PVTv2: Improved baselines with pyramid vision transformer[J]. Computational Visual Media, 2021, 8: 415-424.
[35] LIU Z Y, TAN Y C, HE Q, et al. SwinNet: swin transformer drives edge-aware RGB-D and RGB-T salient object detection[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2022, 32(7): 4486-4497. DOI: 10.1109/TCSVT.2021.3127149
[36] CHEN G, SHAO F, CHAI X L, et al. Modality-induced transfer-fusion network for RGB-D and RGB-T salient object detection[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2023, 33(4): 1787-1801.
[37] TANG B, LIU Z Y, TAN Y C, et al. HRTransNet: HRFormer-driven two-modality salient object detection[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2023, 33(2): 728-742.
[38] YUAN Y H, FU R, HUANG L, et al. HRFormer: high-resolution vision transformer for dense predict[C]//NIPS 2021: Advances in Neural Information Processing Systems, Virtual, 2021: 7281-7293.
[39] FAN D P, CHENG M M, LIU Y, et al. Structure-measure: a new way to evaluate foreground maps[C]//ICCV 2017: Proceedings of the 2017 IEEE/CVF International Conference on Computer Vision, 2017: 4558-4567.
[40] FAN D P, GONG C, CAO Y, et al. Enhanced-alignment measure for binary foreground map evaluation[C]//IJCAI 2018: The 27th International Joint Conference on Artificial Intelligence, 2018: 698-704.
[41] YAN Q, XU L, SHI J P, et al. Hierarchical saliency detection[C]//CVPR 2013: Proceedings of the 2013 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2013: 1155-1162.
[42] LIN Y, HOU X D, Koch C, et al. The secrets of salient object segmentation[C]//CVPR 2014: Proceedings of the 2014 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2014: 280-287.