[1] SHIMOJO S, SHAMS L. Sensory modalities are not separate modalities: Plasticity and interactions[J]. Current Opinion in Neurobiology, 2001,11(4):505-509.
[2] ZHAO J X, LIU J J, FAN D P, et al. EGNet: Edge guidance network for salient object detection[C]// Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. 2019:8778-8787.
[3] PORZI L, HOFINGER M, RUIZ I, et al. Learning multi-object tracking and segmentation from automatic annotations[C]// Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020:6845-6854.
[4] PARK D S, CHAN W, ZHANG Y, et al. SpecAugment: A simple data augmentation method for automatic speech recognition[C]// Proceedings of the 2019 Annual Conference of the International Speech Communication Association. 2019:2613-2617.〖HJ1mm〗
[5] TSIAMI A, KOUTRAS P, MARAGOS P. STAViS: Spatio-temporal audiovisual saliency network[C]// Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020:4765-4775.
[6] JAIN S, YARLAGADDA P, SUBRAMANIAN R, et al. AViNet: Diving deep into audio-visual saliency prediction[J]. arXiv preprint arXiv:2012.06170, 2020.
[7] SIDATY N, LARABI M C, SAADANE A. Toward an audiovisual attention model for multimodal video content[J]. Neurocomputing, 2017,259:94-111.
[8] MIN X K, ZHAI G T, GU K, et al. Fixation prediction through multimodal analysis[J]. ACM Transactions on Multimedia Computing, Communications, and Applications, 2017,13(1). DOI: 10.1145/2996463.
[9] MIN X K, ZHAI G T, ZHOU J T, et al. A multimodal saliency model for videos with high audio-visual correspondence[J]. IEEE Transactions on Image Processing, 2020,29:3805-3819.
[10]TSIAMI A, KOUTRAS P, KATSAMANIS A, et al. A behaviorally inspired fusion approach for computational audiovisual saliency modeling[J]. Signal Processing: Image Communication, 2019,76:186-200.
[11]TAVAKOLI H R, BORJI A, RAHTU E, et al. DAVE: A deep audio-visual embedding for dynamic saliency prediction[J]. arXiv preprint arXiv:1905.10693, 2019.
[12]AYTAR Y, VONDRICK C, TORRALBA A. SoundNet: Learning sound representations from unlabeled video[C]// Proceedings of the 30th International Conference on Neural Information Processing Systems. 2016:892-900.
[13]MITAL P K, SMITH T J, HILL R L, et al. Clustering of gaze during dynamic scene viewing is predicted by motion[J]. Cognitive Computation, 2011,3(1):5-24.
[14]MIN X K, ZHAI G T, HU C J, et al. Fixation prediction through multimodal analysis[C]// Proceedings of the 2015 Visual Communications and Image Processing (VCIP). 2015. DOI: 10.1109/VCIP.2015.7457921.
[15]COUTROT A, GUYADER N. Toward the introduction of auditory information in dynamic visual attention models[C]// Proceedings of the 2013 14th International Workshop on Image Analysis for Multimedia Interactive Services (WIAMIS). 2013. DOI: 10.1109/WIAMIS.2013.6616164.
[16]COUTROT A, GUYADER N. How saliency, faces, and sound influence gaze in dynamic social scenes[J]. Journal of Vision, 2014,14(8). DOI: 10.1167/14.8.5.
[17]GYGLI M, GRABNER H, RIEMENSCHNEIDER H, et al. Creating summaries from user videos[C]// Proceedings of the 2014 European Conference on Computer Vision. 2014:505-520.
[18]KOUTRAS P, MARAGOS P. A perceptually based spatio-temporal computational framework for visual saliency estimation[J]. Signal Processing: Image Communication, 2015,38:15-31.
[19]RAJASHEKAR U, BOVIK A C, CORMACK L K. Visual search in noise: Revealing the influence of structural cues by gaze-contingent classification image analysis[J]. Journal of Vision, 2006,6(4):379-386.
[20]LEE S M, XIN J H, WESTLAND S. Evaluation of image similarity by histogram intersection[J]. Color Research and Application, 2005,30(4):265-274.
[21]PETERS R J, IYER A, ITTI L, et al. Components of bottom-up gaze allocation in natural images[J]. Vision Research, 2005,45(18):2397-2416.
[22]ZHANG L Y, TONG M H, MARKS T K, et al. SUN: A Bayesian framework for saliency using natural statistics[J]. Journal of Vision, 2008,8(7). DOI: 10.1167/8.7.32.
[23]JUDD T, DURAND F, TORRALBA A. A benchmark of computational models of saliency to predict human fixations[R]. MIT-CSAIL-TR-2012, 2012.
[24]ZHANG D W, HAN J W, ZHANG Y. Supervision by fusion: Towards unsupervised learning of deep salient object detector[C]// Proceedings of the 2017 IEEE International Conference on Computer Vision. 2017:4068-4076.
[25]ZHANG J, YU X, LI A X, et al. Weakly-supervised salient object detection via scribble annotations[C]// Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020:12543-12552.
[26]ZENG Y, ZHUGE Y Z, LU H C, et al. Multi-source weak supervision for saliency detection[C]// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019:6067-6076.
[27]JIANG L, XU M, LIU T, et al. DeepVS: A deep learning based video saliency prediction approach[C]// Proceedings of the 2018 European Conference on Computer Vision. 2018:625-642.
[28]WANG W G, SHEN J B, GUO F, et al. Revisiting video saliency: A large-scale benchmark and a new model[C]// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018:4894-4903.
|