Infrared Pedestrian Action Recognition Based on Improved Spatial-temporal Two-stream Convolution Network
-
摘要: 为了提升复杂背景下红外序列的行人动作识别精度,本文提出了一种改进的空时双流网络,该网络首先采用深度差分网络代替时间信息网络,提高时空特征的表征能力与提取效率;然后,采用基于决策级特征融合机制的代价函数对模型进行训练,可以更大限度地保留不同网络帧间图像的时空特征,更加真实地反映行人的动作类别。仿真结果表明,本文提出的改进网络在自建的红外视频数据集上获得了81%的识别精度,且计算效率也提升了25%,具有较高的工程应用价值。Abstract: This study proposes an improved spatial-temporal two-stream network to improve the pedestrian action recognition accuracy of infrared sequences in complex backgrounds. First, a deep differential network replaces the temporal stream network to improve the representation ability and extraction efficiency of spatio-temporal features. Then, the improved softmax loss function based on the decision-making level feature fusion mechanism is used to train the model, which can retain the spatio-temporal characteristics of images between different network frames to a greater extent and reflect the action category of pedestrians more realistically. Simulation results show that the proposed improved network achieves 87% recognition accuracy on the self-built infrared dataset, and the computational efficiency is improved by 25%, which has a high engineering application value.
-
表 1 数据集类别及其数量
Table 1. Classes and quantities of data-sets
NO Categories Total 1 Walk 152 2 Stand 203 3 climb 186 4 Jog 265 5 Jump 174 5 Punch 128 7 Lying 295 8 Wave1 168 9 Wave2 177 10 Crouch 312 11 Sitting 268 12 Handclapping 208 13 Push 158 14 Fight 119 15 Handshake 134 16 Hug 168 表 2 不同模块性能分析
Table 2. Performance analysis of different modules
DDN IS DF Pr/% FPS 77.12 13.9 ☑ 77.83 18.1 ☑ 79.91 13.8 ☑ 79.78 12.7 ☑ ☑ 81.79 17.8 ☑ ☑ 82.09 18.5 ☑ ☑ 81.83 11.6 ☑ ☑ ☑ 83.01 17.7 表 3 不同对比算法的性能分析
Table 3. Performance analysis of different comparison models
Categories IDT C3D SCNN-3G L-LSTM Ts-3D OFGF Our Pr Mr Rr Pr Mr Rr Pr Mr Rr Pr Mr Rr Pr Mr Rr Pr Mr Rr Pr Mr Rr Walk 64 27 70 66 21 72 68 23 72 74 19 77 76 27 74 79 16 80 78 10 80 Stand 72 20 75 76 19 77 76 19 74 82 19 87 84 20 75 84 16 85 85 20 86 climb 50 36 61 53 31 63 61 34 66 66 25 67 71 36 61 76 24 81 78 16 81 Jog 66 28 70 68 23 75 70 23 70 67 28 76 71 28 70 76 19 78 86 8 90 Jump 60 32 65 61 31 68 67 34 67 60 32 74 72 32 65 72 22 77 71 16 80 Punch 41 50 44 41 40 43 46 51 48 51 40 58 60 50 64 61 30 64 67 22 69 Lying 56 36 60 57 31 66 59 33 65 56 36 67 70 30 67 66 22 69 67 16 70 Wave1 65 31 65 68 29 68 68 30 68 65 31 76 72 23 75 75 11 80 82 11 85 Wave2 68 28 69 70 30 71 71 23 76 68 28 87 78 28 79 81 17 86 88 8 88 Crouch 41 29 41 43 34 45 44 23 46 41 29 58 53 20 50 60 22 61 68 26 71 Sitting 70 24 78 73 28 80 72 28 79 71 24 81 78 19 81 80 15 88 82 14 87 Handclap 37 33 38 38 34 42 38 30 33 37 33 50 45 23 58 67 22 68 72 23 76 Push 41 46 44 44 47 46 42 42 47 41 46 57 66 30 64 71 23 74 71 16 79 Fight 53 35 57 58 30 58 56 31 58 53 35 67 67 29 67 63 15 77 80 13 80 Handshake 62 29 67 65 31 70 66 26 70 62 29 76 71 20 77 75 19 87 76 22 81 Hug 67 26 69 66 27 72 61 28 74 76 28 74 74 26 78 78 25 79 81 14 85 Mixed dataset 57 31 60 59 30 63 60 29 63 60 30 70 69 27 69 72 18 77 77 15 80 -
[1] Karpathy A, Toderici G, Shetty S, et al. Large- scale video classification with convolutional neural networks[C]// CVPR, 2014: 1725-1732. [2] Tran D, Bourdev L D, Fergus R, et al. Learning spatiotem-poral features with 3d convolutional networks[C]//ICCV, 2015: 4489-4497. [3] ZHANG B, WANG L, WANG Z, et al. Real-time action recognition with enhanced motion vector CNNs[C]//CVPR, 2016: 2718-2726. [4] Niebles J C, CHEN C W, LI F F. Modeling temporal structure of decomposable motion segments for activity classification[C]// ECCV, 2010: 392-405. [5] Tumas P, Nowosielski A, Serackis A. Pedestrian detection in severe weather conditions[J]. IEEE Access, 2020, 8: 62775-62784. doi: 10.1109/ACCESS.2020.2982539 [6] 魏丽, 丁萌, 曾丽君. 红外图像中基于似物性与稀疏编码的行人检测[J]. 红外技术, 2016, 38(9): 752-757. http://hwjs.nvir.cn/article/id/hwjs201609007WEI Li, DING Meng, ZENG Lijun. Pedestrian Detection Based on Objectness and Sparse Coding in a Single Infrared Image[J]. Infrared Technology, 2016, 38(9): 752-757. http://hwjs.nvir.cn/article/id/hwjs201609007 [7] Fernando B, Gavves E M, Ghodrati J O, et al. Modeling video evolution for action recognition[C]//CVPR, 2015: 5378-5387. [8] Varol G, Laptev I, Schmid C. Long-term temporal convolutions for action recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2016, 40(6): 1510-1517. [9] Donahue J, Anne Hendricks L, Guadarrama S, et al. Long-term Recurrent Convolutional Networks for Visual Recognition and Description[M]. Elsevier, 2015: 2625-2634. [10] Soomro K, Zamir A R, Shah M. A dataset of 101 human actions classes from videos in the wild[J/OL]. Computer Vision and Pattern Recognition, arXiv: 1212.0402, 2012. [11] Kuehne H, Jhuang H, Garrote E, et al. HMDB: A large video database for human motion recognition[C]//ICCV, 2011: 2556-2563. [12] Ioffe S, Szegedy C. Batch normalization: Accelerating deep network training by reducing internal covariate shift[C]//ICML, 2015: 448-456. [13] WANG L, QIAO Y, TANG X. Video action detection with relational dynamic- poselets[C]//ECCV, 2014: 565-580. [14] GAN C, YAO T, YANG K, et a. You lead, we exceed: Labor-free video concept learning by jointly exploiting web videos and images[C]//CVPR, 2016: 923-932. [15] Simonyan K, Zisserman A. Two-Stream Convolutional Networks for Action Recognition in Videos[J]. Advances in Neural Information Processing Systems, 2014, 150: 109-125. http://de.arxiv.org/pdf/1406.2199 [16] 冉鹏, 王灵, 李昕, 等. 改进Softmax分类器的深度卷积神经网络及其在人脸识别中的应用[J]. 上海大学学报: 自然科学版, 2018, 24(3): 352-366. https://www.cnki.com.cn/Article/CJFDTOTAL-SDXZ201803004.htmRAN Peng, WANG Ling, LI Xin, et al. Deep convolution neural network based on improved softmax classifier and its application in face recognition[J]. Journal of Shanghai University: Natural Science Edition, 2018, 24(3): 352-366. https://www.cnki.com.cn/Article/CJFDTOTAL-SDXZ201803004.htm [17] Yasin H, Hussain M, Weber A. Keys for Action: An Efficient Keyframe-Based Approach for 3D Action Recognition Using a Deep Neural Network[J]. Sensors, 2020, 20(8): 2226. doi: 10.3390/s20082226 [18] GAO Chenqiang, DU Yinhe, LIU Jiang, et al. InfAR dataset: Infrared action recognition at different times[J]. Neurcomputing, 2016, 212: 36-47. doi: 10.1016/j.neucom.2016.05.094 [19] WANG H, SCHMID C. Action recognition with improved trajectories[C]//Proceedings of the 2013 IEEE International Conference on Computer Vision. Piscataway: IEEE, 2013: 3551-3558. [20] Du Tran, Lubomir Bourdev, Rob Fergus, et al. Learning spatiotemporal features with 3D convolutional networks[C]//Proceedings of the 2015 IEEE, International Conference on Computer Vision. Piscataway: IEEE, 2015: 4489-4497. [21] 杨天明, 陈志, 岳文静. 基于视频深度学习的时空双流人物动作识别模型[J]. 计算机应用, 2018, 38(3): 895-899. https://www.cnki.com.cn/Article/CJFDTOTAL-JSJY201803050.htmYANG T M, CHENG Z, YU, W J, et al. Spatio-temporal two-stream human action recognition model based on video deep learning[J]. Journal of Computer Applications, 2018, 38(3): 895-899, 915. https://www.cnki.com.cn/Article/CJFDTOTAL-JSJY201803050.htm [22] LIN S, JIA K, CHEN K, et al. Lattice long short-term memory for human action recognition[C]// Proceedings of the 2017 IEEE International Conference on Computer Vision. Piscataway: IEEE, 2017: 2166-2175. [23] Carrlira J, Gisslrman A. Quo vadis. action recognition? A new model and the kinetics dataset[C]//Proceedings of the 2017 IEEE, Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2017: 4724-4733. [24] SUN S, KUANG Z, SHENG L, et al. Optical Flow Guided Feature: A Fast and Robust Motion Representation for Video Action Recognition[C]//The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018: 20118-20132.