计算机与现代化 ›› 2020, Vol. 0 ›› Issue (07): 97-103.doi: 10.3969/j.issn.1006-2475.2020.07.019

• 图像处理 • 上一篇    下一篇

复杂背景下基于深度学习的视频动作识别

  

  1. (1.南京航空航天大学计算机科学与技术学院,江苏南京211106;2.模式分析与机器智能工业
    和信息化部重点实验室,江苏南京211106;3.软件新技术与产业化协同创新中心,江苏南京211106)
  • 出版日期:2020-07-06 发布日期:2020-07-15
  • 作者简介:潘陈听(1995-),男,浙江武义人,硕士研究生,研究方向:计算机视觉,深度学习,E-mail: pct9978@outlook.com; 谭晓阳(1971-),男,教授,博士,研究方向:人脸识别,计算机视觉,机器学习和强化学习,E-mail: x.tan@nuaa.edu.cn。
  • 基金资助:
    国家自然科学基金资助项目(61976115,61672280,61732006); 南航人工智能+项目(56XZA18009)

Video Action Recognition in Complex Background Based on Deep Learning

  1. (1. College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing 211106, China;
    2. MIIT Key Laboratory of Pattern Analysis and Machine Intelligence, Nanjing 211106, China;
    3. Collaborative Innovation Center of Novel Software Technology and Industrialization, Nanjing 211106, China)
  • Online:2020-07-06 Published:2020-07-15

摘要: 快速有效地识别视频中的人体动作,具有广泛的应用前景及潜在的经济价值。但目前的视频动作识别方法易受到运动人体晃动、背景变化、摄相机抖动、运动人体阴影等背景因素影响。为解决上述问题,本文提出一种非局域时间段网络方法。该方法在双流网络的基础上,通过加入非局域计算使网络能关注到更大时空范围的信息,并进一步融入光流信息使网络更精确地将注意力放在动作区域,从而增强对视频复杂静态背景的鲁棒性。此外,为了融合双流分段网络的多路判别结果,本文使用可学习的加权平均取代简单平均来融合多模态信息。经过在TDAP数据集上的实验验证,本文的模型可在复杂背景下较为精确地识别出人体动作,与原有模型相比在几乎不增加时间复杂度的前提下提升了识别性能。

关键词: 动作识别, 非局域模块, 时间段网络, 复杂背景, 自注意力

Abstract: Recognizing human actions in videos has broad application prospects and great potential economic value. However, the accuracy of video action recognition is affected by a number of factors such as swaying, background changes, camera shaking and moving shadows. To reduce the influence of such complex background, we proposed non-local temporal segment networks (NLTSNet). The NLTSNet is based on the temporal segment network but is enhanced with non-local modules over the ResNet so as to capture the non-local spatial and temporal information contained in the video clips. To furthermore improve the network’s robustness against stationary cluttered background, we integrate the optical flow into the non-local module. Finally, we adopt a learnable ensemble network to fuse the prediction results from both the appearance and temporal modality. Extensive experimental results on the TDAP dataset show that our new method can recognize human actions with more accuracy in a complex background compared with several state of the art methods, without increasing the time complexity.

Key words: action recognition, non-local module, temporal segment network, complex background, self-attention

中图分类号: