计算机与现代化 ›› 2022, Vol. 0 ›› Issue (04): 103-109.

• 模式识别 • 上一篇    下一篇

基于音视一致性的音视人眼关注点检测

  

  1. (青岛大学计算机科学技术学院,山东青岛266071)
  • 出版日期:2022-05-07 发布日期:2022-05-07
  • 作者简介:袁梦(1994—),女,山东菏泽人,硕士研究生,研究方向:计算机视觉,E-mail: yuanmeng_926@163.com; 于小雨(1995—),女,山东济南人,硕士研究生,研究方向:图像质量评估,E-mail: yuxiaoyu_yu@163.com。

Audio-visual Eye Fixation Prediction Based on Audio-visual Consistency

  1. (College of Computer Science and Technology, Qingdao University, Qingdao 266071, China)
  • Online:2022-05-07 Published:2022-05-07

摘要: 现有音视人眼关注点检测算法使用双流结构分别对音视信息进行特征提取,随后对音视特征融合得到最终的预测图。但数据集中的音频信息和视觉信息会有不相关的情况,因此在音视不一致时直接对音视特征进行融合会使得音频信息对视觉特征产生消极的影响。针对上述问题,本文提出一种基于音视一致性的音视人眼关注点检测网络(Audio-visual Consistency Network, AVCN)。为验证该网络的可靠性,本文在现有音视结合的人眼关注点检测模型上加入音视一致性网络,AVCN对提取的音、视频特征进行一致性二值判断,二者一致时,输出音视融合特征作为最终的预测图,反之则输出视觉占主导的特征作为最终结果。该算法在开放的6个数据集上进行实验,结果表明加入AVCN模型的整体指标会有所提高。

关键词: 计算机视觉, 人眼关注点检测, 音视一致性

Abstract: The existing audio-visual human eye fixation prediction algorithms use the double-stream structure to extract the features of audio-visual information respectively, and then fuse the audio-visual features to get the final prediction map. However, the audio information and visual information may not be correlated in the datasets. Therefore, when the audio and visual features are inconsistent, the direct fusion of audio and visual features will have a negative impact on the visual features. In view of the above-mentioned problems, this paper proposes an audio-visual consistency network (AVCN) for eye fixation prediction based on audio-visual consistency. In order to verify the reliability of the network, this paper adds an audio-visual consistency network to the existing audio-visual consistency human eye fixation detection model. AVCN carries out the consistency binary judgment on the extracted audio and video features. When the two are consistent, the audio-visual fusion features will be output as the final prediction map; otherwise, the visual dominant features will be output as the final result. The method is tested on six publicly available datasets, and the results show that the proposed AVCN model has better performance.

Key words: computer version, eye fixation prediction, audio-visual consistency