Audio-visual Eye Fixation Prediction Based on Audio-visual Consistency

Abstract

Abstract: The existing audio-visual human eye fixation prediction algorithms use the double-stream structure to extract the features of audio-visual information respectively, and then fuse the audio-visual features to get the final prediction map. However, the audio information and visual information may not be correlated in the datasets. Therefore, when the audio and visual features are inconsistent, the direct fusion of audio and visual features will have a negative impact on the visual features. In view of the above-mentioned problems, this paper proposes an audio-visual consistency network (AVCN) for eye fixation prediction based on audio-visual consistency. In order to verify the reliability of the network, this paper adds an audio-visual consistency network to the existing audio-visual consistency human eye fixation detection model. AVCN carries out the consistency binary judgment on the extracted audio and video features. When the two are consistent, the audio-visual fusion features will be output as the final prediction map; otherwise, the visual dominant features will be output as the final result. The method is tested on six publicly available datasets, and the results show that the proposed AVCN model has better performance.

Key words: computer version, eye fixation prediction, audio-visual consistency

YUAN Meng, YU Xiao-yu. Audio-visual Eye Fixation Prediction Based on Audio-visual Consistency[J]. Computer and Modernization, 2022, 0(04): 103-109.

References

［1］ SHIMOJO S, SHAMS L. Sensory modalities are not separate modalities: Plasticity and interactions［J］. Current Opinion in Neurobiology, 2001,11(4):505-509.
［2］ ZHAO J X, LIU J J, FAN D P, et al. EGNet: Edge guidance network for salient object detection［C］// Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. 2019:8778-8787.
［3］ PORZI L, HOFINGER M, RUIZ I, et al. Learning multi-object tracking and segmentation from automatic annotations［C］// Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020:6845-6854.
［4］ PARK D S, CHAN W, ZHANG Y, et al. SpecAugment: A simple data augmentation method for automatic speech recognition［C］// Proceedings of the 2019 Annual Conference of the International Speech Communication Association. 2019:2613-2617.〖HJ1mm〗
［5］ TSIAMI A, KOUTRAS P, MARAGOS P. STAViS: Spatio-temporal audiovisual saliency network［C］// Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020:4765-4775.
［6］ JAIN S, YARLAGADDA P, SUBRAMANIAN R, et al. AViNet: Diving deep into audio-visual saliency prediction［J］. arXiv preprint arXiv:2012.06170, 2020.
［7］ SIDATY N, LARABI M C, SAADANE A. Toward an audiovisual attention model for multimodal video content［J］. Neurocomputing, 2017,259:94-111.
［8］ MIN X K, ZHAI G T, GU K, et al. Fixation prediction through multimodal analysis［J］. ACM Transactions on Multimedia Computing, Communications, and Applications, 2017,13(1). DOI: 10.1145/2996463.
［9］ MIN X K, ZHAI G T, ZHOU J T, et al. A multimodal saliency model for videos with high audio-visual correspondence［J］. IEEE Transactions on Image Processing, 2020,29:3805-3819.
［10］TSIAMI A, KOUTRAS P, KATSAMANIS A, et al. A behaviorally inspired fusion approach for computational audiovisual saliency modeling［J］. Signal Processing: Image Communication, 2019,76:186-200.
［11］TAVAKOLI H R, BORJI A, RAHTU E, et al. DAVE: A deep audio-visual embedding for dynamic saliency prediction［J］. arXiv preprint arXiv:1905.10693, 2019.
［12］AYTAR Y, VONDRICK C, TORRALBA A. SoundNet: Learning sound representations from unlabeled video［C］// Proceedings of the 30th International Conference on Neural Information Processing Systems. 2016:892-900.
［13］MITAL P K, SMITH T J, HILL R L, et al. Clustering of gaze during dynamic scene viewing is predicted by motion［J］. Cognitive Computation, 2011,3(1):5-24.
［14］MIN X K, ZHAI G T, HU C J, et al. Fixation prediction through multimodal analysis［C］// Proceedings of the 2015 Visual Communications and Image Processing (VCIP). 2015. DOI: 10.1109/VCIP.2015.7457921.
［15］COUTROT A, GUYADER N. Toward the introduction of auditory information in dynamic visual attention models［C］// Proceedings of the 2013 14th International Workshop on Image Analysis for Multimedia Interactive Services (WIAMIS). 2013. DOI: 10.1109/WIAMIS.2013.6616164.
［16］COUTROT A, GUYADER N. How saliency, faces, and sound influence gaze in dynamic social scenes［J］. Journal of Vision, 2014,14(8). DOI: 10.1167/14.8.5.
［17］GYGLI M, GRABNER H, RIEMENSCHNEIDER H, et al. Creating summaries from user videos［C］// Proceedings of the 2014 European Conference on Computer Vision. 2014:505-520.
［18］KOUTRAS P, MARAGOS P. A perceptually based spatio-temporal computational framework for visual saliency estimation［J］. Signal Processing: Image Communication, 2015,38:15-31.
［19］RAJASHEKAR U, BOVIK A C, CORMACK L K. Visual search in noise: Revealing the influence of structural cues by gaze-contingent classification image analysis［J］. Journal of Vision, 2006,6(4):379-386.
［20］LEE S M, XIN J H, WESTLAND S. Evaluation of image similarity by histogram intersection［J］. Color Research and Application, 2005,30(4):265-274.
［21］PETERS R J, IYER A, ITTI L, et al. Components of bottom-up gaze allocation in natural images［J］. Vision Research, 2005,45(18):2397-2416.
［22］ZHANG L Y, TONG M H, MARKS T K, et al. SUN: A Bayesian framework for saliency using natural statistics［J］. Journal of Vision, 2008,8(7). DOI: 10.1167/8.7.32.
［23］JUDD T, DURAND F, TORRALBA A. A benchmark of computational models of saliency to predict human fixations［R］. MIT-CSAIL-TR-2012, 2012.
［24］ZHANG D W, HAN J W, ZHANG Y. Supervision by fusion: Towards unsupervised learning of deep salient object detector［C］// Proceedings of the 2017 IEEE International Conference on Computer Vision. 2017:4068-4076.
［25］ZHANG J, YU X, LI A X, et al. Weakly-supervised salient object detection via scribble annotations［C］// Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020:12543-12552.
［26］ZENG Y, ZHUGE Y Z, LU H C, et al. Multi-source weak supervision for saliency detection［C］// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019:6067-6076.
［27］JIANG L, XU M, LIU T, et al. DeepVS: A deep learning based video saliency prediction approach［C］// Proceedings of the 2018 European Conference on Computer Vision. 2018:625-642.
［28］WANG W G, SHEN J B, GUO F, et al. Revisiting video saliency: A large-scale benchmark and a new model［C］// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018:4894-4903.