基于音视一致性的音视人眼关注点检测

摘要/Abstract

摘要： 现有音视人眼关注点检测算法使用双流结构分别对音视信息进行特征提取，随后对音视特征融合得到最终的预测图。但数据集中的音频信息和视觉信息会有不相关的情况，因此在音视不一致时直接对音视特征进行融合会使得音频信息对视觉特征产生消极的影响。针对上述问题，本文提出一种基于音视一致性的音视人眼关注点检测网络（Audio-visual Consistency Network, AVCN）。为验证该网络的可靠性，本文在现有音视结合的人眼关注点检测模型上加入音视一致性网络，AVCN对提取的音、视频特征进行一致性二值判断，二者一致时，输出音视融合特征作为最终的预测图，反之则输出视觉占主导的特征作为最终结果。该算法在开放的6个数据集上进行实验，结果表明加入AVCN模型的整体指标会有所提高。

关键词: 计算机视觉, 人眼关注点检测, 音视一致性

Abstract: The existing audio-visual human eye fixation prediction algorithms use the double-stream structure to extract the features of audio-visual information respectively, and then fuse the audio-visual features to get the final prediction map. However, the audio information and visual information may not be correlated in the datasets. Therefore, when the audio and visual features are inconsistent, the direct fusion of audio and visual features will have a negative impact on the visual features. In view of the above-mentioned problems, this paper proposes an audio-visual consistency network (AVCN) for eye fixation prediction based on audio-visual consistency. In order to verify the reliability of the network, this paper adds an audio-visual consistency network to the existing audio-visual consistency human eye fixation detection model. AVCN carries out the consistency binary judgment on the extracted audio and video features. When the two are consistent, the audio-visual fusion features will be output as the final prediction map; otherwise, the visual dominant features will be output as the final result. The method is tested on six publicly available datasets, and the results show that the proposed AVCN model has better performance.

Key words: computer version, eye fixation prediction, audio-visual consistency

袁梦, 于小雨. 基于音视一致性的音视人眼关注点检测[J]. 计算机与现代化, 2022, 0(04): 103-109.

YUAN Meng, YU Xiao-yu. Audio-visual Eye Fixation Prediction Based on Audio-visual Consistency[J]. Computer and Modernization, 2022, 0(04): 103-109.

参考文献

［1］ SHIMOJO S, SHAMS L. Sensory modalities are not separate modalities: Plasticity and interactions［J］. Current Opinion in Neurobiology, 2001,11(4):505-509.
［2］ ZHAO J X, LIU J J, FAN D P, et al. EGNet: Edge guidance network for salient object detection［C］// Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. 2019:8778-8787.
［3］ PORZI L, HOFINGER M, RUIZ I, et al. Learning multi-object tracking and segmentation from automatic annotations［C］// Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020:6845-6854.
［4］ PARK D S, CHAN W, ZHANG Y, et al. SpecAugment: A simple data augmentation method for automatic speech recognition［C］// Proceedings of the 2019 Annual Conference of the International Speech Communication Association. 2019:2613-2617.〖HJ1mm〗
［5］ TSIAMI A, KOUTRAS P, MARAGOS P. STAViS: Spatio-temporal audiovisual saliency network［C］// Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020:4765-4775.
［6］ JAIN S, YARLAGADDA P, SUBRAMANIAN R, et al. AViNet: Diving deep into audio-visual saliency prediction［J］. arXiv preprint arXiv:2012.06170, 2020.
［7］ SIDATY N, LARABI M C, SAADANE A. Toward an audiovisual attention model for multimodal video content［J］. Neurocomputing, 2017,259:94-111.
［8］ MIN X K, ZHAI G T, GU K, et al. Fixation prediction through multimodal analysis［J］. ACM Transactions on Multimedia Computing, Communications, and Applications, 2017,13(1). DOI: 10.1145/2996463.
［9］ MIN X K, ZHAI G T, ZHOU J T, et al. A multimodal saliency model for videos with high audio-visual correspondence［J］. IEEE Transactions on Image Processing, 2020,29:3805-3819.
［10］TSIAMI A, KOUTRAS P, KATSAMANIS A, et al. A behaviorally inspired fusion approach for computational audiovisual saliency modeling［J］. Signal Processing: Image Communication, 2019,76:186-200.
［11］TAVAKOLI H R, BORJI A, RAHTU E, et al. DAVE: A deep audio-visual embedding for dynamic saliency prediction［J］. arXiv preprint arXiv:1905.10693, 2019.
［12］AYTAR Y, VONDRICK C, TORRALBA A. SoundNet: Learning sound representations from unlabeled video［C］// Proceedings of the 30th International Conference on Neural Information Processing Systems. 2016:892-900.
［13］MITAL P K, SMITH T J, HILL R L, et al. Clustering of gaze during dynamic scene viewing is predicted by motion［J］. Cognitive Computation, 2011,3(1):5-24.
［14］MIN X K, ZHAI G T, HU C J, et al. Fixation prediction through multimodal analysis［C］// Proceedings of the 2015 Visual Communications and Image Processing (VCIP). 2015. DOI: 10.1109/VCIP.2015.7457921.
［15］COUTROT A, GUYADER N. Toward the introduction of auditory information in dynamic visual attention models［C］// Proceedings of the 2013 14th International Workshop on Image Analysis for Multimedia Interactive Services (WIAMIS). 2013. DOI: 10.1109/WIAMIS.2013.6616164.
［16］COUTROT A, GUYADER N. How saliency, faces, and sound influence gaze in dynamic social scenes［J］. Journal of Vision, 2014,14(8). DOI: 10.1167/14.8.5.
［17］GYGLI M, GRABNER H, RIEMENSCHNEIDER H, et al. Creating summaries from user videos［C］// Proceedings of the 2014 European Conference on Computer Vision. 2014:505-520.
［18］KOUTRAS P, MARAGOS P. A perceptually based spatio-temporal computational framework for visual saliency estimation［J］. Signal Processing: Image Communication, 2015,38:15-31.
［19］RAJASHEKAR U, BOVIK A C, CORMACK L K. Visual search in noise: Revealing the influence of structural cues by gaze-contingent classification image analysis［J］. Journal of Vision, 2006,6(4):379-386.
［20］LEE S M, XIN J H, WESTLAND S. Evaluation of image similarity by histogram intersection［J］. Color Research and Application, 2005,30(4):265-274.
［21］PETERS R J, IYER A, ITTI L, et al. Components of bottom-up gaze allocation in natural images［J］. Vision Research, 2005,45(18):2397-2416.
［22］ZHANG L Y, TONG M H, MARKS T K, et al. SUN: A Bayesian framework for saliency using natural statistics［J］. Journal of Vision, 2008,8(7). DOI: 10.1167/8.7.32.
［23］JUDD T, DURAND F, TORRALBA A. A benchmark of computational models of saliency to predict human fixations［R］. MIT-CSAIL-TR-2012, 2012.
［24］ZHANG D W, HAN J W, ZHANG Y. Supervision by fusion: Towards unsupervised learning of deep salient object detector［C］// Proceedings of the 2017 IEEE International Conference on Computer Vision. 2017:4068-4076.
［25］ZHANG J, YU X, LI A X, et al. Weakly-supervised salient object detection via scribble annotations［C］// Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020:12543-12552.
［26］ZENG Y, ZHUGE Y Z, LU H C, et al. Multi-source weak supervision for saliency detection［C］// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019:6067-6076.
［27］JIANG L, XU M, LIU T, et al. DeepVS: A deep learning based video saliency prediction approach［C］// Proceedings of the 2018 European Conference on Computer Vision. 2018:625-642.
［28］WANG W G, SHEN J B, GUO F, et al. Revisiting video saliency: A large-scale benchmark and a new model［C］// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018:4894-4903.

[1]	赵晨阳, 薛涛, 刘俊华. 基于改进Stable Diffusion的时尚服饰图案生成[J]. 计算机与现代化, 2024, 0(12): 15-23.
[2]	王梦溪, 李峻. 老年人跌倒检测技术研究综述[J]. 计算机与现代化, 2024, 0(08): 30-36.
[3]	魏嘉焜, 王家润. 手势识别与交互综述[J]. 计算机与现代化, 2024, 0(08): 67-76.
[4]	梁天恺, 黄康华, 刘凯航, 兰岚, 曾碧. 基于双向同态加密的深度联邦图片分类方法[J]. 计算机与现代化, 2023, 0(12): 36-40.
[5]	吴松林, 张国伟, 卢秋红, 施建壮, 黄威. 融合视觉信息的激光SLAM[J]. 计算机与现代化, 2023, 0(02): 17-23.
[6]	黄延辉, 兰海, 魏宪. 基于可分离结构变换的轻量级Vision Transformer[J]. 计算机与现代化, 2022, 0(10): 75-81.
[7]	李伟强, 王东, 宁政通, 卢明亮, 覃鹏飞. 计算机视觉下的果实目标检测算法综述[J]. 计算机与现代化, 2022, 0(06): 87-95.
[8]	王天星, 袁家斌, 刘昕. 基于同等注意力图网络的视觉问答方法[J]. 计算机与现代化, 2021, 0(11): 1-6.
[9]	周经纬, 韩立新, 李晓双. 基于多模态数据的目标跟踪算法[J]. 计算机与现代化, 2020, 0(11): 16-22.
[10]	朱大庆, 曹国. 基于全卷积网络的砂石图像粒径检测[J]. 计算机与现代化, 2020, 0(07): 111-116.
[11]	陈川, 陈柘, 丁双惠. 深度学习发展形势下计算机视觉教学内容革新[J]. 计算机与现代化, 2020, 0(06): 107-.
[12]	吴世海，鲍义东，陈果，陈秋实. 基于随机Gabor特征的半参考农作物图像质量评价方法[J]. 计算机与现代化, 2020, 0(05): 70-.
[13]	曹燕,李欢,王天宝. 基于深度学习的目标检测算法研究综述[J]. 计算机与现代化, 2020, 0(05): 63-.
[14]	罗伟，梁世豪，姜鑫，安妮，杜锐. 基于深度学习的野外露头区岩石裂缝识别[J]. 计算机与现代化, 2020, 0(05): 56-.
[15]	李梵若，韩莹，代广斌，冯天歌，胡琳. 文字识别在虚拟导游系统中的应用研究与实现[J]. 计算机与现代化, 2019, 0(12): 83-.