复杂背景下基于深度学习的视频动作识别

doi:10.3969/j.issn.1006-2475.2020.07.019

摘要/Abstract

摘要： 快速有效地识别视频中的人体动作，具有广泛的应用前景及潜在的经济价值。但目前的视频动作识别方法易受到运动人体晃动、背景变化、摄相机抖动、运动人体阴影等背景因素影响。为解决上述问题，本文提出一种非局域时间段网络方法。该方法在双流网络的基础上，通过加入非局域计算使网络能关注到更大时空范围的信息，并进一步融入光流信息使网络更精确地将注意力放在动作区域，从而增强对视频复杂静态背景的鲁棒性。此外，为了融合双流分段网络的多路判别结果，本文使用可学习的加权平均取代简单平均来融合多模态信息。经过在TDAP数据集上的实验验证，本文的模型可在复杂背景下较为精确地识别出人体动作，与原有模型相比在几乎不增加时间复杂度的前提下提升了识别性能。

关键词: 动作识别, 非局域模块, 时间段网络, 复杂背景, 自注意力

Abstract: Recognizing human actions in videos has broad application prospects and great potential economic value. However, the accuracy of video action recognition is affected by a number of factors such as swaying, background changes, camera shaking and moving shadows. To reduce the influence of such complex background, we proposed non-local temporal segment networks (NLTSNet). The NLTSNet is based on the temporal segment network but is enhanced with non-local modules over the ResNet so as to capture the non-local spatial and temporal information contained in the video clips. To furthermore improve the network’s robustness against stationary cluttered background, we integrate the optical flow into the non-local module. Finally, we adopt a learnable ensemble network to fuse the prediction results from both the appearance and temporal modality. Extensive experimental results on the TDAP dataset show that our new method can recognize human actions with more accuracy in a complex background compared with several state of the art methods, without increasing the time complexity.

Key words: action recognition, non-local module, temporal segment network, complex background, self-attention

中图分类号:

TP391

潘陈听, 谭晓阳, . 复杂背景下基于深度学习的视频动作识别[J]. 计算机与现代化, 2020, 0(07): 97-103.

PAN Chen-ting, TAN Xiao-yang, . Video Action Recognition in Complex Background Based on Deep Learning[J]. Computer and Modernization, 2020, 0(07): 97-103.

参考文献

［1］ RUMELHART D E, HINTON G E, WILLIAMS R J. Learning representations by back-propagating errors［J］. Nature, 1986,323:533-536.
［2］ KRIZHEVSKY A, SUTSKEVER I, HINTON G E. Imagenet classification with deep convolutional neural networks［C］// Proceedings of the Advances in Neural Information Processing Systems. 2012:1097-1105.
［3］ XU K, BA J, KIROS R, et al. Show, attend and tell: Neural image caption generation with visual attention［C］// Proceedings of the International Conference on Machine Learning. 2015:2048-2057.
［4］ BUADES A, COLL B, MOREL J M. A non-local algorithm for image denoising［C］// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2005,2:60-65.
［5］ WANG X, GIRSHICK R, GUPTA A, et al. Non-local neural networks［C］// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018:7794-7803.
［6］ WOLD S, ESBENSEN K, GELADI P. Principal component analysis［J］. Chemometrics and Intelligent Laboratory Systems, 1987,2(1-3):37-52.
［7］ BURGES C J C. A tutorial on support vector machines for pattern recognition［J］. Data Mining and Knowledge Discovery, 1998,2(2):121-167.
［8］〖KG-*3〗FREUND Y, SCHAPIRE R E. Experiments with a new boosting algorithm［C］// Proceedings of the International Conference on Machine Learning. 1996,96:148-156.
［9］ LOWE D G. Distinctive image features from scale-invariant keypoints［J］. International Journal of Computer Vision, 2004,60(2):91-110.
［10］SCOVANNER P, ALI S, SHAH M. A 3-Dimensional sift descriptor and its application to action recognition［C］// Proceedings of the 15th ACM International Conference on Multimedia. ACM, 2007:357-360.
［11］DALAL N, TRIGGS B. Histograms of oriented gradients for human detection［C］// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2005:886-893.〖HJ1.6mm〗
［12］KLASER A, MARSZAEK M, SCHMID C. A spatio-temporal descriptor based on 3D-Gradients［C］// The 19th British Machine Vision Conference(BMVC 2008), British Machine Vision Association, 2008: DOI: 10.5244/C.22.99.
［13］CHAUDHRY R, RAVICHANDRAN A, HAGER G, et al. Histograms of oriented optical flow and binet-cauchy kernels on nonlinear dynamical systems for the recognition of human actions［C］// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2009:1932-1939.
［14］WANG H, KLSER A, SCHMID C, et al. Dense trajectories and motion boundary descriptors for action recognition［J］. International Journal of Computer Vision, 2013,103(1):60-79.
［15］WANG H, KLSER A, SCHMID C, et al. Action recognition by dense trajectories［C］// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2011:3169-3176.
［16］WANG H, SCHMID C. Action recognition with improved trajectories［C］// Proceedings of the IEEE International Conference on Computer Vision. 2013:3551-3558.
［17］YANG J, JIANG Y G, HAUPTMANN A G, et al. Evaluating bag-of-visual-words representations in scene classification［C］// Proceedings of the International Workshop on Multimedia Information Retrieval. ACM, 2007:197-206.
［18］SNCHEZ J, PERRONNIN F, MENSINK T, et al. Image classification with the fisher vector: Theory and practice［J］. International Journal of Computer Vision, 2013,105(3):222-245.
［19］JGOU H, DOUZE M, SCHMID C, et al. Aggregating local descriptors into a compact image representation［C］// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2010:3304-3311.
［20］SIMONYAN K, ZISSERMAN A. Two-stream convolutional networks for action recognition in videos［C］// Proceedings of the Advances in Neural Information Processing Systems. 2014:568-576.
［21］FEICHTENHOFER C, PINZ A, ZISSERMAN A. Convolutional two-stream network fusion for video action recognition［C］// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016:1933-1941.
［22］WANG L M, QIAO Y, TANG X O. Action recognition with trajectory-pooled deep-convolutional descriptors［C］// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015:4305-4314.
［23］TRAN D, BOURDEV L, FERGUS R, et al. Learning spatiotemporal features with 3D convolutional networks［C］// Proceedings of the IEEE International Conference on Computer Vision. 2015:4489-4497.
［24］SUN L, JIA K, YEUNG D Y, et al. Human action recognition using factorized spatio-temporal convolutional networks［C］// Proceedings of the IEEE International Conference on Computer Vision. 2015:4597-4605.
［25］TRAN D, WANG H, TORRESANI L, et al. A closer look at spatiotemporal convolutions for action recognition［C］// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018:6450-6459.
［26］LEA C, FLYNN M D, VIDAL R, et al. Temporal convolutional networks for action segmentation and detection［C］// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017:156-165.
［27］QIU Z F, YAO T, MEI T. Learning spatio-temporal representation with pseudo-3D residual networks［C］// Proceedings of the IEEE International Conference on Computer Vision. 2017:5533-5541.
［28］CARREIRA J, ZISSERMAN A. Quo vadis, action recognition? A new model and the kinetics dataset［C］// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017:6299-6308.
［29］HOCHREITER S, SCHMIDHUBER J. Long short-term memory［J］. Neural Computation, 1997,9(8):1735-1780.
［30］CHUNG J, GULCEHRE C, CHO K H, et al. Empirical evaluation of gated recurrent neural networks on sequence modeling［J］. arXiv preprint arXiv:1412.3555, 2014.
［31］DONAHUE J, ANNE HENDRICKS L, GUADARRAMA S, et al. Long-term recurrent convolutional networks for visual recognition and description［C］// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015:2625-2634.
［32］WANG L M, XIONG Y J, WANG Z, et al. Temporal segment networks: Towards good practices for deep action recognition［C］// Proceedings of the European Conference on Computer Vision. 2016:20-36.
［33］HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition［C］// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016:770-778.
［34］KIPF T N, WELLING M. Semi-supervised classification with graph convolutional networks［J］. arXiv preprint arXiv:1609.02907, 2016.
［35］ZACH C, POCK T, BISCHOF H. A duality based approach for realtime TV-L1 optical flow［C］// Proceedings of the Joint Pattern Recognition Symposium. Springer, 2007:214-223.
［36］IOFFE S, SZEGEDY C. Batch normalization: Accelerating deep network training by reducing internal covariate shift［C］// Proceedings of the International Conference on Machine Learning. 2015:448-456.
［37］SRIVASTAVA N, HINTON G, KRIZHEVSKY A, et al. Dropout: A simple way to prevent neural networks from overfitting［J］. The Journal of Machine Learning Research, 2014,15(1):1929-1958.

[1]	何思达, 陈平华. 基于意图的轻量级自注意力序列推荐模型[J]. 计算机与现代化, 2024, 0(12): 1-9.
[2]	张晓东1, 白广芝1, 李敏1, 李昊洋2. 基于经验小波变换的油气井产量预测模型 [J]. 计算机与现代化, 2024, 0(12): 53-58.
[3]	祁贤, 刘大铭, 常佳鑫. 基于改进自注意力机制的多视图三维重建[J]. 计算机与现代化, 2024, 0(11): 106-112.
[4]	候聪颖, 杨文清, 王召, 程聪. 基于时频自注意力残差时序卷积网络的语音增强[J]. 计算机与现代化, 2024, 0(09): 20-24.
[5]	叶森辉, 王蕾. 结合局部自注意力和深度优化的多视图重建[J]. 计算机与现代化, 2024, 0(05): 92-98.
[6]	游嘉靖1, 2, 何月顺1, 何璘琳1, 钟海龙1, 2. 基于AHP-CNN的加密流量分类方法[J]. 计算机与现代化, 2024, 0(04): 83-87.
[7]	李亚平, 王军防, 余红梅, 窦一民, 肖媛, 田继林. Regformer：基于稀疏注意力的输油管道水力压降预测方法[J]. 计算机与现代化, 2024, 0(01): 59-66.
[8]	闫子贤, 董宝良, 唐思谜. 针对复杂背景下低分辨率舰船目标的改进YOLOv7算法[J]. 计算机与现代化, 2023, 0(11): 120-126.
[9]	郑海利, 陈平华. 基于动态卷积和自注意力的序列推荐模型[J]. 计算机与现代化, 2023, 0(10): 9-16.
[10]	龚轩, 郭中华, 陈旺. 基于CA-TransUNet的遥感图像道路分割[J]. 计算机与现代化, 2023, 0(07): 112-118.
[11]	夏义春, 李汪根, 李豆豆, 葛英奎, 王志格. 结合注意力机制和图神经网络的CTR预估模型[J]. 计算机与现代化, 2023, 0(03): 29-37.
[12]	乔善宝, 高永彬, 黄勃, 余文俊. 基于条件卷积与极化自注意力的单目深度与位姿估计[J]. 计算机与现代化, 2023, 0(02): 34-39.
[13]	栗伟松, 汤敏芳, 何征岭, 王鹏, 杜利东, 方震, 陈贤祥. 基于自注意力机制和单导联心电信号的自动睡眠分期算法#br#[J]. 计算机与现代化, 2022, 0(12): 50-59.
[14]	何晨曦, 王正勇, 卿粼波, 何小海, 吴小强. 基于全局自注意力的小麦图像识别[J]. 计算机与现代化, 2022, 0(04): 38-44.
[15]	万发洋, 于旭, 徐其江. 基于多头自注意力机制的深度缺陷分派模型[J]. 计算机与现代化, 2021, 0(11): 39-43.