Video Action Recognition in Complex Background Based on Deep Learning

doi:10.3969/j.issn.1006-2475.2020.07.019

Abstract

Abstract: Recognizing human actions in videos has broad application prospects and great potential economic value. However, the accuracy of video action recognition is affected by a number of factors such as swaying, background changes, camera shaking and moving shadows. To reduce the influence of such complex background, we proposed non-local temporal segment networks (NLTSNet). The NLTSNet is based on the temporal segment network but is enhanced with non-local modules over the ResNet so as to capture the non-local spatial and temporal information contained in the video clips. To furthermore improve the network’s robustness against stationary cluttered background, we integrate the optical flow into the non-local module. Finally, we adopt a learnable ensemble network to fuse the prediction results from both the appearance and temporal modality. Extensive experimental results on the TDAP dataset show that our new method can recognize human actions with more accuracy in a complex background compared with several state of the art methods, without increasing the time complexity.

Key words: action recognition, non-local module, temporal segment network, complex background, self-attention

CLC Number:

TP391

PAN Chen-ting, TAN Xiao-yang, . Video Action Recognition in Complex Background Based on Deep Learning[J]. Computer and Modernization, 2020, 0(07): 97-103.

References

［1］ RUMELHART D E, HINTON G E, WILLIAMS R J. Learning representations by back-propagating errors［J］. Nature, 1986,323:533-536.
［2］ KRIZHEVSKY A, SUTSKEVER I, HINTON G E. Imagenet classification with deep convolutional neural networks［C］// Proceedings of the Advances in Neural Information Processing Systems. 2012:1097-1105.
［3］ XU K, BA J, KIROS R, et al. Show, attend and tell: Neural image caption generation with visual attention［C］// Proceedings of the International Conference on Machine Learning. 2015:2048-2057.
［4］ BUADES A, COLL B, MOREL J M. A non-local algorithm for image denoising［C］// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2005,2:60-65.
［5］ WANG X, GIRSHICK R, GUPTA A, et al. Non-local neural networks［C］// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018:7794-7803.
［6］ WOLD S, ESBENSEN K, GELADI P. Principal component analysis［J］. Chemometrics and Intelligent Laboratory Systems, 1987,2(1-3):37-52.
［7］ BURGES C J C. A tutorial on support vector machines for pattern recognition［J］. Data Mining and Knowledge Discovery, 1998,2(2):121-167.
［8］〖KG-*3〗FREUND Y, SCHAPIRE R E. Experiments with a new boosting algorithm［C］// Proceedings of the International Conference on Machine Learning. 1996,96:148-156.
［9］ LOWE D G. Distinctive image features from scale-invariant keypoints［J］. International Journal of Computer Vision, 2004,60(2):91-110.
［10］SCOVANNER P, ALI S, SHAH M. A 3-Dimensional sift descriptor and its application to action recognition［C］// Proceedings of the 15th ACM International Conference on Multimedia. ACM, 2007:357-360.
［11］DALAL N, TRIGGS B. Histograms of oriented gradients for human detection［C］// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2005:886-893.〖HJ1.6mm〗
［12］KLASER A, MARSZAEK M, SCHMID C. A spatio-temporal descriptor based on 3D-Gradients［C］// The 19th British Machine Vision Conference(BMVC 2008), British Machine Vision Association, 2008: DOI: 10.5244/C.22.99.
［13］CHAUDHRY R, RAVICHANDRAN A, HAGER G, et al. Histograms of oriented optical flow and binet-cauchy kernels on nonlinear dynamical systems for the recognition of human actions［C］// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2009:1932-1939.
［14］WANG H, KLSER A, SCHMID C, et al. Dense trajectories and motion boundary descriptors for action recognition［J］. International Journal of Computer Vision, 2013,103(1):60-79.
［15］WANG H, KLSER A, SCHMID C, et al. Action recognition by dense trajectories［C］// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2011:3169-3176.
［16］WANG H, SCHMID C. Action recognition with improved trajectories［C］// Proceedings of the IEEE International Conference on Computer Vision. 2013:3551-3558.
［17］YANG J, JIANG Y G, HAUPTMANN A G, et al. Evaluating bag-of-visual-words representations in scene classification［C］// Proceedings of the International Workshop on Multimedia Information Retrieval. ACM, 2007:197-206.
［18］SNCHEZ J, PERRONNIN F, MENSINK T, et al. Image classification with the fisher vector: Theory and practice［J］. International Journal of Computer Vision, 2013,105(3):222-245.
［19］JGOU H, DOUZE M, SCHMID C, et al. Aggregating local descriptors into a compact image representation［C］// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2010:3304-3311.
［20］SIMONYAN K, ZISSERMAN A. Two-stream convolutional networks for action recognition in videos［C］// Proceedings of the Advances in Neural Information Processing Systems. 2014:568-576.
［21］FEICHTENHOFER C, PINZ A, ZISSERMAN A. Convolutional two-stream network fusion for video action recognition［C］// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016:1933-1941.
［22］WANG L M, QIAO Y, TANG X O. Action recognition with trajectory-pooled deep-convolutional descriptors［C］// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015:4305-4314.
［23］TRAN D, BOURDEV L, FERGUS R, et al. Learning spatiotemporal features with 3D convolutional networks［C］// Proceedings of the IEEE International Conference on Computer Vision. 2015:4489-4497.
［24］SUN L, JIA K, YEUNG D Y, et al. Human action recognition using factorized spatio-temporal convolutional networks［C］// Proceedings of the IEEE International Conference on Computer Vision. 2015:4597-4605.
［25］TRAN D, WANG H, TORRESANI L, et al. A closer look at spatiotemporal convolutions for action recognition［C］// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018:6450-6459.
［26］LEA C, FLYNN M D, VIDAL R, et al. Temporal convolutional networks for action segmentation and detection［C］// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017:156-165.
［27］QIU Z F, YAO T, MEI T. Learning spatio-temporal representation with pseudo-3D residual networks［C］// Proceedings of the IEEE International Conference on Computer Vision. 2017:5533-5541.
［28］CARREIRA J, ZISSERMAN A. Quo vadis, action recognition? A new model and the kinetics dataset［C］// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017:6299-6308.
［29］HOCHREITER S, SCHMIDHUBER J. Long short-term memory［J］. Neural Computation, 1997,9(8):1735-1780.
［30］CHUNG J, GULCEHRE C, CHO K H, et al. Empirical evaluation of gated recurrent neural networks on sequence modeling［J］. arXiv preprint arXiv:1412.3555, 2014.
［31］DONAHUE J, ANNE HENDRICKS L, GUADARRAMA S, et al. Long-term recurrent convolutional networks for visual recognition and description［C］// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015:2625-2634.
［32］WANG L M, XIONG Y J, WANG Z, et al. Temporal segment networks: Towards good practices for deep action recognition［C］// Proceedings of the European Conference on Computer Vision. 2016:20-36.
［33］HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition［C］// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016:770-778.
［34］KIPF T N, WELLING M. Semi-supervised classification with graph convolutional networks［J］. arXiv preprint arXiv:1609.02907, 2016.
［35］ZACH C, POCK T, BISCHOF H. A duality based approach for realtime TV-L1 optical flow［C］// Proceedings of the Joint Pattern Recognition Symposium. Springer, 2007:214-223.
［36］IOFFE S, SZEGEDY C. Batch normalization: Accelerating deep network training by reducing internal covariate shift［C］// Proceedings of the International Conference on Machine Learning. 2015:448-456.
［37］SRIVASTAVA N, HINTON G, KRIZHEVSKY A, et al. Dropout: A simple way to prevent neural networks from overfitting［J］. The Journal of Machine Learning Research, 2014,15(1):1929-1958.

[1]	HE Sida, CHEN Pinghua. Intent-based Lightweight Self-Attention Network for Sequential Recommendation [J]. Computer and Modernization, 2024, 0(12): 1-9.
[2]	ZHANG Xiaodong1, BAI Guangzhi1, LI Min1, LI Haoyang2. Oil and Gas Well Production Prediction Model Based on Empirical Wavelet Transform [J]. Computer and Modernization, 2024, 0(12): 53-58.
[3]	QI Xian, LIU Daming, CHANG Jiaxin. Multi-view 3D Reconstruction Based on Improved Self-attention Mechanism [J]. Computer and Modernization, 2024, 0(11): 106-112.
[4]	HOU Congying, YANG Wengqing, WANG Zhao, CHENG Cong. Speech Enhancement Based on Time-frequency Self-attention Residual Temporal#br# Convolutional Networks [J]. Computer and Modernization, 2024, 0(09): 20-24.
[5]	YE Senhui, WANG Lei. Multi-view Reconstruction with Local Self-attention and Deep Optimization [J]. Computer and Modernization, 2024, 0(05): 92-98.
[6]	YOU Jiajing1, 2, HE Yueshun1, HE Linlin1, ZHONG Hailong1, 2. Encryption Traffic Classification Method Based on AHP-CNN [J]. Computer and Modernization, 2024, 0(04): 83-87.
[7]	YAN Zi-xian, DONG Bao-liang, TANG Si-mi. Improved YOLOv7 Algorithm for Low-resolution Ship Object Detection in Complex Backgrounds#br# [J]. Computer and Modernization, 2023, 0(11): 120-126.
[8]	ZHENG Hai-li, CHEN Ping-hua. Sequence Recommendation Model Based on Dynamic Convolution and Self-attention [J]. Computer and Modernization, 2023, 0(10): 9-16.
[9]	GONG Xuan, GUO Zhong-hua, CHEN Wang. Remote Sensing Image Road Segmentation Based on CA-TransUNet [J]. Computer and Modernization, 2023, 0(07): 112-118.
[10]	WANG Jie, PAN Feng, ZHANG Yan-sha, TAN Mian, YAN Xiao-bo, WANG Lin, . Surface Defect Classification of Aluminum Profiles with Weighted Non-local Modules [J]. Computer and Modernization, 2023, 0(05): 86-92.
[11]	XIA Yi-chun, LI Wang-gen, LI Dou-dou, GE Ying-kui, WANG Zhi-ge. CTR Prediction Model Combining Attention Mechanism and Graph Neural Network [J]. Computer and Modernization, 2023, 0(03): 29-37.
[12]	QIAO Shan-bao, GAO Yong-bin, HUANG Bo, YU Wen-jun. Monocular Depth and Pose Estimation Based on Conditionally Convolution and Polarized Self-attention [J]. Computer and Modernization, 2023, 0(02): 34-39.
[13]	LI Wei-song, TANG Min-fang, HE Zheng-ling, WANG Peng, DU Li-dong, FANG Zhen, CHEN Xian-xiang. Automatic Sleep Staging Algorithm Based on Self-attention Mechanism and Single Lead ECG [J]. Computer and Modernization, 2022, 0(12): 50-59.
[14]	HE Chen-xi, WANG Zheng-yong, QING Lin-bo, HE Xiao-hai, WU Xiao-qiang. Wheat Image Recognition Based on Global Self-attention [J]. Computer and Modernization, 2022, 0(04): 38-44.
[15]	WAN Fa-yang, YU Xu, XU Qi-jiang. Deep Bug Triage Model Based on Multi-head Self-attention Mechanism [J]. Computer and Modernization, 2021, 0(11): 39-43.