基于交互关系分组建模融合的组群行为识别算法

摘要/Abstract

摘要： 组群成员间的交互关系建模是组群行为识别的核心技术。本文为解决复杂场景下组群关系繁琐、关系推理时复杂度高并存在信息冗余等问题,提出一种交互关系分组推理的模型。首先，利用CNN网络和RoIAlign提取视频帧中的场景信息和个人信息作为初始特征,利用个人空间坐标对人群进行二分组（例如：在Volleyball数据集中，利用参与者的bounding boxes的X坐标信息进行排序，然后为每个人建立序号ID,并从左到右将12名成员分为2组）；其次，将划分后的2个局部分组以及全局场景组群，分别利用图卷积网络（Graph Convolutional Network, GCN）进行组交互关系推理,并确定各自组内的关键人物；然后，以全局关系特征作为真实值，将二分组的局部关系特征合并作为预测值，构建两者之间的交叉熵损失函数反馈优化上一级分组交互关系GCN网络，旨在确保2个分组的关键人物与全局关键人物匹配成功。再以全局交互关系中的关键人物信息为指导，分别与2个分组的关键人物进行匹配，将匹配成功后2个小组中的关键人物作为目标节点，建立组间关系图，并经GCN推理得到组间的关系特征；最后，初始特征分别与组间和全局交互关系特征融合得到2个群组行为支路，经过决策融合得到最终的识别结果。实验表明，在Volleyball数据集和NBA数据集上分别取得93.1%和48.1%的准确率。

关键词: 分组交互关系融合, 关键人物匹配, 决策融合, 组群行为识别

Abstract: The modeling of interaction relationship between group members is the core technology of group activity recognition. High complexity and information redundancy in relational reasoning are tough problems in complex scenarios when modeling its group interactions. In order to solve these problems, we propose a model of grouping interactive relation. Firstly, CNN and RoIAlign are used to extract the scene information and personal information as initial features in each frame, and the whole group is divided into two subgroups by the personal spatial coordinates (For example, in the Volleyball data set, the X coordinates of participants’bounding boxes are used to rank, then, everyone set is set up an ordinal ID and 12 people are divided into two group from left to right). Secondly, the two local groups and the global scene groups are divided, the Graph Convolutional Network （GCN） is used to deduce their interaction relationship respectively, and the key persons in each group are determined. Then, we can regard global relationship features as the real value, and merge the characteristics of local relation of two groups as predicted value. In order to match the key figures of two groups with key figures from the whole group successfully, the cross-entropy loss function is built between the two and feedback to optimize the upper-level group GCN interaction relationship network. Next, with the information of key figures in the global interaction relationship as a guide, the key figures in the two subgroups are matched respectively. After successful matching, the matched key figures in the two subgroups are taken as the target nodes to establish a relationship graph between these two subgroups, and then it is deduced by GCN. Finally, the initial features are fused with intergroup and global interaction characteristics respectively to obtain two group behavior branches, and the final recognition result is obtained through decision fusion. The experiment shows that the accuracy is 93.1% on Volleyball data set and the accuracy is 48.1% on NBA data set.

Key words: grouping interaction relationship fusion, key person matching, decision fusion, group behavior recognition

王传旭, 刘冉. 基于交互关系分组建模融合的组群行为识别算法[J]. 计算机与现代化, 2022, 0(01): 1-9.

WANG Chuan-xu, LIU Ran. Group Activity Recognition Algorithm Based on Interaction Relationship Grouping Modeling Fusion[J]. Computer and Modernization, 2022, 0(01): 1-9.

参考文献［33］

［1］	CARREIRA J, ZISSERMAN A. Quo vadis, action recognition? A new model and the kinetics dataset［C］// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. 2017:4724-4733.
［2］	SUDHAKARAN S, ESCALERA S, LANZ O. LSTA: Long short-term attention for egocentric action recognition［C］// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019:9946-9955.
［3］	YUAN Y, WANG D, WANG Q. Memory-augmented temporal dynamic learning for action recognition［C］// The 33rd AAAI Conference on Artificial Intelligence. 2019:9167-9175.
［4］	WANG M S, NI B B, YANG X K. Recurrent modeling of interaction context for collective activity recognition［C］// 2017 IEEE Conference on Computer Vision and Pattern Recognition. 2017:7408-7416.
［5］	IBRAHIM M S, MORI G. Hierarchical relational networks for group activity recognition andretrieval［C］// Proceedings of the 2018 European Conference on Computer Vision. 2018:742-758.
［6］	HU G Y, CUI B, HE Y, et al. Progressive relation learning for group activity recognition［C］// 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2020:977-986.
［7］	SHU X B, TANG J H, QI G J, et al. Hierarchical long short-term concurrent memory for human interaction recognition［J］. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021,43(3):1110-1118.
［8］	EHSANPOUR M, ABEDIN A, SALEH F, et al. Joint learning of social groups, individuals action and sub-group activities in videos［C］// Proceedings of the 2020 European Conference on Computer Vision. 2020:177-195.
［9］	WU J C, WANG L M, WANG L, et al. Learning actor relation graphs for group activity recognition［C］// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2019:9956-9966.
［10］	TANG J H, SHU X B, YAN R, et al. Coherence constrained graph LSTM for group activity recognition［J］. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019. DOI: 10.1109/TPAMI.2019.2928540.
［11］	王传旭,胡小悦,孟唯佳,等. 基于多流架构与长短时记忆网络的组群行为识别方法研究［J］. 电子学报, 2020,48(4):800-807.
［12］	GAMMULLE H, DENMAN S, SRIDHARAN S, et al. Multi-level sequence GAN for group activity recognition［C］// 2018 Asian Conference on Computer Vision. 2018:331-346.
［13］	KONG L T, QIN J, HUANG D, et al. Hierarchical attention and context modeling for group activity recognition［C］// 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2018:1328-1332.
［14］	YAN R, XIE L X, TANG J H, et al. Social adaptive module for weakly-supervised group activity recognition［C］// 2020 European Conference on Computer Vision. 2020:208-224.
［15］	DENG Z W, VAHDAT A, HU H X, et al. Structure inference machines: Recurrent neural networks for analyzing relations in group activity recognition［C］// 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2016:4772-4781.
［16］	DENG Z W, ZHAI M Y, CHEN L, et al. Deep structured models for group activity recognition［C］// The 2015 British Machine Vision Conference (BMVC). 2015. DOI: 10.5244/C.29.179.
［17］	YANG F K, YIN W J, INAMURA T, et al. Group behavior recognition using attention-and graph-based neural networks［C］// The 24th European Conference on Artificial Intelligence (ECAI). 2020:1626-1633.
［18］	BISWAS S, GALL J. Structural recurrent neural network (SRNN) for group activity analysis［C］// 2018 IEEE Winter Conference on Applications of Computer Vision (WACV). 2018:1625-1632.
［19］	LU L H, LU Y, YU R Z, et al. GAIM： Graph attention interaction model for collective activity recognition［J］. IEEE Transactions on Multimedia, 2020,22(2):524-539.
［20］	RAMANATHAN V, HUANG J, ABU-EL-HAIJA S, et al. Detecting events and key actors in multi-person videos［C］// 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2016:3043-3053.
［21］	YAN R, TANG J H, SHU X B, et al. Participation-contributed temporal dynamic model for group activity recognition［C］// The 26th ACM International Conference on Multimedia. 2018:1292-1300.
［22］	QI M S, WANG Y H, QIN J, et al. StagNet: An attentive semantic RNN for group activity and individual action recognition［J］. IEEE Transactions on Circuits and Systems for Video Technology, 2020,30(2):549-565.
［23］	TANG Y S, LU J W, WANG Z A, et al. Learning semantics-preserving attention and contextual interaction for group activity recognition［J］. IEEE Transactions on Image Processing, 2019,28(10):4997-5012.
［24］	王传旭,薛豪. 基于GFU和分层LSTM的组群行为识别研究方法［J］. 电子学报, 2020,48(8):1465-1471.
［25］	陈慧杰,谢毅雄. 基于多分类器投票机的人体姿态识别算法［J］. 计算机与现代化, 2014(4):64-68.
［26］	朱坤. 基于高斯混合-贝叶斯模型的轨迹预测［J］. 计算机与现代化, 2019(2):72-81.
［27］	SZEGEDY C, VANHOUCKE V, IOFFE S, et al. Rethinking the inception architecture for computer vision［C］// 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2016:2818-2826.
［28］	HE K M, GKIOXARI G, DOLLAR P, et al. Mask R-CNN［C］// 2017 IEEE International Conference on Computer Vision (ICCV). 2017:2980-2988.
［29］	SIMONYAN K, ZISSERMAN A. Very deep convolutional networks for large-scale image recognition［J］. arXiv preprint arXiv:1409.1556, 2014.
［30］	BAGAUTDINOV T, ALAHI A, FLEURET F, et al. Social scene understanding: End-to-end multi-person action localization and collective activity recognition［C］// 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2017:3425-3434.
［31］	WANG L M, XIONG Y J, WANG Z, et al. Temporal segment networks: Towards good practices for deep action recognition［C］// 2016 European Conference on Computer Vision. 2016:20-36.
［32］	ZHOU B L, ANDONIAN A, OLIVA A, et al. Temporal relational reasoning in videos［C］// 2018 European Conference on Computer Vision. 2018:831-846
［33］	WANG X L, GIRSHICK R, GUPTA A, et al. Non-local neural networks［C］// 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018:7794-7803.