基于深度学习框架的多模态动作识别

doi:10.3969/j.issn.1006-2475.2017.07.009

摘要/Abstract

摘要： 提出一种基于深度神经网络的多模态动作识别方法，根据不同模态信息的特性分别采用不同的深度神经网络，适应不同模态的视频信息，并将多种深度网络相结合，挖掘行为识别的多模态特征。主要考虑人体行为静态和动态2种模态信息，结合微软Kinect的多传感器摄像机获得传统视频信息的同时也能获取对应的深度骨骼点信息。对于静态信息采用卷积神经网络模型，对于动态信息采用递归循环神经网络模型。最后将2种模型提取的特征相融合进行动作识别和分类。在MSR 3D的行为数据库上实验结果表明，本文的方法对动作识别具有良好的分类效果。

关键词: 深度学习, 多模态, 动作识别

Abstract: This paper proposes an approach for multi-modal action recognition based on deep neural networks. In order to process different modal video information, different artificial networks are utilized and combined to exploit the multi-modal features. We mainly consider the static and dynamic modalities of human action. With the assistance of Microsoft Kinect sensor camera, the visual and depth skeleton data of video can be captured simultaneously. For the static RGB information, we implement Convolutional Neural Networks, while for the dynamic information we use Recurrent Neural Networks. Finally, we combine the extraction features through these two networks and train the action classifier. The experiment results on the MSR 3D datasets show the effectiveness of our method.

Key words: deep learning, multi-modality, action recognition

中图分类号:

TP391.41

韩敏捷. 基于深度学习框架的多模态动作识别[J]. 计算机与现代化, 2017, 0(7): 48-52.

HAN Min-jie. Multi-modal Action Recognition Based on Deep Learning Framework[J]. Computer and Modernization, 2017, 0(7): 48-52.

参考文献

［1］Sipiran I, Bustos B. Harris 3D: A robust extension of the Harris operator for interest point detection on 3D meshes［J］. The Visual Computer， 2011,27(11):963-976.〖HJ0.9mm〗
［2］Piotr D, Rabaud V, Cottrell G, et al. Behavior recognition via sparse spatio-temporal features［C］//IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance. 2005:65-72.
［3］Geert W, Tuytelaars T, Luc V G. An efficient dense and scale-invariant spatio-temporal interest point detector［C］//European Conference on Computer Vision. 2008: 650-663.
［4］Christian T, Hlavác V. Pose primitive based human action recognition in videos or still images［C］//IEEE Conference on Computer Vision and Pattern Recognition. 2008:1-8.
［5］Wang Heng, Klser A. Action recognition by dense trajectories［C］//IEEE Conference on Computer Vision and Pattern Recognition. 2011: 3169-3176.
［6］Zhang Zhengyou. Microsoft Kinect sensor and its effect［J］. IEEE Multimedia, 2012,19(2): 4-10.
［7］LeCun Y, Boser B E, Denker J S, et al. Handwritten digit recognition with a back-propagation network［C］//Advances in Neural Information Processing Systems. 1989.
［8］LeCun Y, Bottou L, Bengio Y, et al. Gradient-based learning applied to document recognition［J］. Proceedings of the IEEE， 1998, 86(11): 2278-2324.
［9］Krizhevsky A, Sutskever I, Hinton G E. Imagenet classification with deep convolutional neural networks［C］//Advances in Neural Information Processing Systems. 2012:1097-1105.
［10］Tran D, Bourdev L, Fergus R, et al. Learning spatiotemporal features with 3D convolutional networks［C］//IEEE International Conference on Computer Vision (ICCV). 2015:4489-4497.
［11］Donahue J, Hendricks L A, Guadarrama S, et al. Long-term recurrent convolutional networks for visual recognition and description［C］//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015:2626-2634.
［12］Simonyan K, Zisserman A. Two-stream convolutional networks for action recognition in videos［C］//Advances in Neural Information Processing Systems. 2014:568-576.
［13］Laptev I, Lindeberg T. On space-time interest points［J］. International Journal of Computer Vision, 2005,64(2-3):107-123.
［14］Lowe D G. Distinctive image features from scale-invariant keypoints［J］. International Journal of Computer Vision, 2004,60(2):91-110.
［15］Dalal N, Triggs B. Histograms of oriented gradients for human detection［C］//IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05). 2005:886-893.
［16］Scovanner P, Ali S, Shah M. A 3-dimensional sift descriptor and its application to action recognition［C］//Proceedings of the 15th ACM International Conference on Multimedia. 2007:357-360.
［17］Klaser A, Marszaｌek M, Schmid C. A spatio-temporal descriptor based on 3d-gradients［C］//BMVC 2008-19th British Machine Vision Conference. 2008:275.
［18］Karpathy A, Toderici G, Shetty S, et al. Large-scale video classification with convolutional neural networks［C］//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2014:1725-1732.
［19］Jordan M I. Serial order: A parallel distributed processing approach［J］. Advances in Psychology, 1997,121:471-495.
［20］Rabiner L R. A tutorial on hidden Markov models and selected applications in speech recognition［J］. Proceedings of the IEEE, 1989,77(2):257-286.
［21］Hochreiter S, Schmidhuber J. Long short-term memory［J］. Neural Computation, 1997,9(8):1735-1780.
［22］Chollet F. Keras［DB/OL］.https://github.com/fchollet/keras, 2015-01-01.
［23］Yu Gang, Liu Zicheng, Yuan Junsong. Discriminative orderlet mining for real-time recognition of human-object interaction［C］//Asian Conference on Computer Vision. 2014:50-65.
［24］Xia Lu, Aggarwal J K. Spatio-temporal depth cuboid similarity feature for activity recognition using depth camera［C］//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2013:2834-2841.
［25］Wang Jiang, Liu Zicheng, Wu Ying, et al. Mining actionlet ensemble for action recognition with depth cameras［C］//IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2012:1290-1297.
［26］Yang Xiaodong, Tian Yingli. Eigenjoints-based action recognition using naive-bayes-nearest-neighbor［C］//IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops. 2012:14-19.

[1]	祁贤, 刘大铭, 常佳鑫. 基于改进自注意力机制的多视图三维重建[J]. 计算机与现代化, 2024, 0(11): 106-112.
[2]	陈凯1, 李宜汀1, 2, 全华凤1 . 基于改进YOLOv8的河道废弃瓶检测方法[J]. 计算机与现代化, 2024, 0(11): 113-120.
[3]	杨骏1, 胡为1, 朱文福2. 基于改进MobileNetV3的视觉SLAM回环检测算法[J]. 计算机与现代化, 2024, 0(10): 21-26.
[4]	王莹莹, 郝潇. 基于Res2Net和递归门控卷积的细粒度图像分类[J]. 计算机与现代化, 2024, 0(10): 74-79.
[5]	史星宇1, 李强2, 庄莉3, 梁懿3, 王秋琳3, 陈锴3, 伍臣周3, 常胜1. 一种面向工业部署的目标检测模型蒸馏技术[J]. 计算机与现代化, 2024, 0(10): 93-99.
[6]	杨世军1, 狄广义1, 高军1, 陈见飞1, 王耀坤1, 季晓晗2. 跨模态注意力融合和信息感知的情感一致检测[J]. 计算机与现代化, 2024, 0(10): 113-119.
[7]	王聪1, 杨文娟1, 丁兴旺2. 基于多模态信息融合的历史档案资源聚合模型[J]. 计算机与现代化, 2024, 0(10): 120-126.
[8]	张泽1, 张建权2, 3, 周国鹏2, 3. 基于改进YOLOv8s的摄像头模组缺陷检测[J]. 计算机与现代化, 2024, 0(09): 107-113.
[9]	程亚子1, 雷亮1, 2, 陈瀚1, 赵毅然1. 基于转置注意力的多尺度深度融合单目深度估计[J]. 计算机与现代化, 2024, 0(09): 121-126.
[10]	程萌, 李浩. 改进YOLOv5s的落叶树鸟巢检测方法[J]. 计算机与现代化, 2024, 0(08): 24-29.
[11]	王梦溪, 李峻. 老年人跌倒检测技术研究综述[J]. 计算机与现代化, 2024, 0(08): 30-36.
[12]	时现伟1, 范鑫2. 基于轻量化的视频帧场景语义分割方法[J]. 计算机与现代化, 2024, 0(08): 49-53.
[13]	魏嘉焜, 王家润. 手势识别与交互综述[J]. 计算机与现代化, 2024, 0(08): 67-76.
[14]	徐新爱, 李钢. 基于DCGAN的课堂表情图像生成方法[J]. 计算机与现代化, 2024, 0(08): 88-91.
[15]	高帅鹏, 王怡凡. 基于图像的群体情绪识别综述[J]. 计算机与现代化, 2024, 0(08): 98-107.