DNeStCount:数据相关的拆分注意力机制的编码器-解码器结构的人群计数方法

摘要/Abstract

摘要： 人群数量估计是人群管理系统的关键，对于预防踩踏事故和引导人群至关重要，已成为一个日益重要的任务和具有挑战性的研究方向。本文提出一种数据相关的拆分注意力机制的编码器-解码器结构的人群计数方法，称为DNeStCount。为应对视频监控的尺度变化和透视失真的挑战，将更密集的空洞采样比率应用到密集空洞空间金字塔池化模块DASPP设计中。为提升密度图估计的准确性，将可学习的、数据相关的上采样方法DUpsampling应用到特征聚合模块DFA设计中。为弥补欧几里德损失可能存在对离群值敏感、训练不稳定等缺点，采用Smooth L1损失设计损失函数。在具有挑战性的数据集上进行的实验和分析表明，本文提出的人群计数方法DNeStCount与其他主流方法相比更具有竞争力。

关键词: 人群计数, 编码器-解码器结构, 拆分注意力机制, 密集空洞空间金字塔池化, 数据相关上采样, Smooth L1损失

Abstract: Crowd count estimation is the linchpin of the crowd management system, which is very important to prevent stampede accident and guide crowd. It has become an increasingly important task and challenging research direction. This paper proposes a data-dependent encoder-decoder architecture with split-attention for crowd counting, called DNeStCount. In order to cope with the challenges of scale variation and perspective distortion of video surveillance, a more dense atrous ratio is applied to the design of the dense atrous spatial pyramid pooling block. In order to improve the accuracy of density map estimation, a learnable and data-dependent upsampling method DUpsampling is applied to the design of the data-dependent feature aggregation. In order to compensate outlier sensitive and untrainable Euclidean loss, Smooth L1 loss is used to the design of loss function. The experiments and analyses on challenging datasets show that DNeStCount is more competitive compared to thoughtful approaches.

Key words: crowd counting, encoder-decoder architecture, split-attention mechanism, dense atrous spatial pyramid pooling； data-dependent upsampling； Smooth L1 loss

孟晓龙, . DNeStCount:数据相关的拆分注意力机制的编码器-解码器结构的人群计数方法[J]. 计算机与现代化, 2022, 0(09): 68-77.

MENG Xiao-long, . DNeStCount: A Data-dependent Encoder-decoder Architecture with Split-attention for Crowd Counting#br#[J]. Computer and Modernization, 2022, 0(09): 68-77.

参考文献

［1］ GAO G H, GAO J Y, LIU Q J, et al. CNN-based density estimation and crowd counting: A survey［J］. arXiv preprint arXiv:2003.12783, 2020.
［2］ ENZWEILER M, GAVRILAD M. Monocular pedestrian detection: Survey and experiments［J］. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008,31（12）:2179-2195.
［3］ CHAN A B, VASCONCELOS N. Bayesian poisson regression for crowd counting［C］// Proceedings of the 2009 IEEE International Conference on Computer Vision. 2009:545-551.
［4］ TIAN Y, SIGAL L, BADINO H, et al. Latent gaussian mixture regression for human pose estimation［C］// Asian Conference on Computer Vision. 2010:679-690.
［5］ ZHANG C, LI H S, WANG X G, et al. Cross-scene crowd counting via deep convolutional neural networks［C］// Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. 2015:833-841.
［6］ ZHANG Y Y, ZHOU D S, CHEN S Q, et al. Single-image crowd counting via multi-column convolutional neural network［C］// Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. 2016:589-597.
［7］ SAM D B, SURYA S, BABU R V. Switching convolutional neural network for crowd counting［C］// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. 2017:5744-5752.
［8］ GAO J Y, WANG Q, LI X L. PCC Net: Perspective crowd counting via spatial convolutional network［J］. IEEE Transactions on Circuits and Systems for Video Technology, 2019,30（10）:3486-3498.
［9］ LI Y H, ZHANG X F, CHEN D M. CSRNET: Dilated convolutional neural networks for understanding the highly congested scenes［C］// Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition. 2018:1091-1100.
［10］JIANG X L, XIAO Z H, ZHANG B C, et al. Crowd counting and density estimation by trellis encoder-decoder networks［C］// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019:6133-6142.
［11］XU C F, QIU K, FU J L, et al. Learn to scale: Generating multipolar normalized density maps for crowd counting［C］// Proceedings of the 2019 IEEE International Conference on Computer Vision. 2019:8382-8390.
［12］XU C F, LIANG D K, XU Y C, et al. Autoscale: Learning to scale for crowd counting［J］. arXiv preprint arXiv:1912.09632, 2019.
［13］CENGGORO T W, ASLAMIAH A H, YUNANTOA. Feature pyramid networks for crowd counting［J］. Procedia Computer Science, 2019,157:175-182.
［14］SAM D B, PERI S V, SUNDARARAMAN M N, et al. Locate, size and count: Accurately resolving people in dense crowds via detection［J］. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021,43（8）:2739-2751.
［15］苏军雄,见雪婷,刘玮,等. 基于可变形卷积神经网络的手势识别方法［J］. 计算机与现代化, 2018（4）:62-67.
［16］陈凯,祖莉,欧屹. 基于YOLOv3与ResNet50的摄影机器人人脸识别跟踪系统［J］. 计算机与现代化, 2020（4）:30-36.
［17］SRIVASTAVA R K, GREFF K, SCHMIDHUBER J. Highway networks［J］. arXiv preprint arXiv:1505.00387, 2015.
［18］HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition［C］// Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. 2016:770-778.
［19］XIE S, GIRSHICK R, DOLLR P, et al. Aggregated residual transformations for deep neural networks［C］// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. 2017:1492-1500.
［20］SZEGEDY C, VANHOUCKE V, IOFFE S, et al. Rethinking the inception architecture for computer vision［C］// Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. 2016:2818-2826.
［21］HU J, SHEN L,ALBANIE S. Squeeze-and-excitation networks［C］// Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition. 2018:7132-7141.
［22］LI X, WANG W H, HU X L, et al. Selective kernel networks［C］// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019:510-519.
［23］ZHANG H, WU C R, ZHANG Z Y, et al. Resnest: Split-attention networks［J］.arXiv preprint arXiv:2004.08955, 2020.
［24］CHEN L C, PAPANDREOU G, KOKKINOSI, et al. Semantic image segmentation with deep convolutional nets and fully connected CRFs［J］. arXiv preprint arXiv:1412.7062, 2014.
［25］CHEN L C, PAPANDREOU G, KOKKINOS I, et al. DeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs［J］. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017,40（4）:834-848.
［26］CHEN L C, PAPANDREOU G, SCHROFF F, et al. Rethinking atrous convolution for semantic image segmentation［J］. arXiv preprint arXiv:1706.05587, 2017.
［27］CHEN L C, ZHU Y, PAPANDREOU G, et al. Encoder-decoder with atrous separable convolution for semantic image segmentation［C］// European Conference on Computer Vision. 2018:801-818.
［28］HE T, ZHANG Z, ZHANG H, et al. Bag of tricks for image classification with convolutional neural networks［C］// Proceedings of the 2019 IEEE Conference on Computer Vision and Pattern Recognition. 2019:558-567.
［29］TIAN Z, HE T, SHEN C H, et al. Decoders matter for semantic segmentation: Data-dependent decoding enables flexible feature aggregation［C］// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019:3126-3135.
［30］LIU W, ANGUELOV D, ERHAN D, et al. SSD: Single shot multibox detector［C］// European Conference on Computer Vision. 2016:21-37.
［31］吴水清,王宇,师岩. 基于SSD的车辆目标检测［J］. 计算机与现代化, 2019（5）:35-40.
［32］GAO J Y, LIN W, ZHAO B, et al. C3 framework: An open-source pytorch code for crowd counting［J］. arXiv preprint arXiv:1907.02724, 2019.
［33］IDREES H, TAYYAB M, ATHREY K, et al. Composition loss for counting, density map estimation and localization in dense crowds［C］// European Conference on Computer Vision. 2018:532-546.
［34］ONORO-RUBIO D, LPEZ-SASTRE R J. Towards perspective-free object counting with deep learning［C］// European Conference on Computer Vision. 2016:615-629.
［35］HINTON G E, SALAKHUTDINOV R R. Reducing the dimensionality of data with neural networks［J］. Science, 2006,313（5786）:504-507.
［36］HE K M, GIRSHICK R, DOLLR P. Rethinking imagenet pre-training［C］// Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. 2019:4918-4927.