计算机与现代化 ›› 2025, Vol. 0 ›› Issue (02): 86-93.doi: 10.3969/j.issn.1006-2475.2025.02.012

• 图像处理 • 上一篇    下一篇

基于孪生特征融合网络的自然场景文本图像超分辨率方法


  

  1. (西安工程大学计算机科学学院,陕西 西安 710060)
  • 出版日期:2025-02-28 发布日期:2025-02-28
  • 基金资助:
    陕西省青年计划项目(2022JQ-624); 中国高校产业研究创新基金资助项目(2021ALA02002); 中国纺织工业协会高等教育教学改革研究项目(2021BKJGLX004)

Twin Feature Fusion Network for Scene Text Image Super Resolution

  1. (School of Computer Science, Xi’an Polytechnic University, Xi’an 710060, China)
  • Online:2025-02-28 Published:2025-02-28

摘要: 自然场景文本图像超分辨率方法旨在提高文本图像的分辨率和可读性,进而提升下游高级文本识别任务的性能。现有研究通过引入文本先验信息能够更好地指导超分辨率重建过程。然而,这些方法未能有效利用文本先验信息并将其与图像特征充分融合,从而限制了超分辨率任务的性能。基于此,本文提出一种基于孪生特征融合网络(Twin Feature Fusion Network, TFFN)的自然场景文本图像超分辨率方法,该方法旨在最大化利用来自预训练文本识别器中的文本先验信息,使其专注于文本区域内容的恢复。首先,利用文本识别网络提取文本先验信息;其次,构建一种孪生特征融合模块,该模块采用孪生注意力机制促进图像特征和文本先验信息之间的双向交互,并利用融合模块进一步融合上下文增强的图像特征和文本先验信息;最后,提取序列特征并重建超分辨率图像。实验结果表明,本文方法在TextZoom数据集的不同难度等级中,ASTER、MORAN和CRNN这3个文本识别网络中的识别准确率分别提升了0.22~0.5、0.6~1.1以及0.33~1.1个百分点。

关键词: 图像超分辨率重建, 文本图像; 特征融合; 自注意力机制; 交叉注意力机制

Abstract: The aim of the scene text image super-resolution (STISR) method is to enhance the resolution and legibility of text images, thereby improving the performance of downstream text recognition tasks. Previous studies have shown that the introduction of text-prior information can better guide the super-resolution. However, these methods have not effectively utilized text-prior information and have not fully integrated it with image features, limiting super-resolution task performance. In this paper, we propose a Twin Feature Fusion Network (TFFN) to address this problem. The method aims to maximize the utilization of text-prior information from pre-trained text recognizers, with a focus on the recovery of text area content. Firstly, text-prior information is extracted using a text recognition network. Next, a twin feature fusion module is constructed, which employs a twin attention mechanism to facilitate bidirectional interaction between image features and text-prior information. The fusion module further integrates context-enhanced image features and text-prior information. Finally, sequence features are extracted to reconstruct the text image. Experiments on the benchmark TextZoom dataset show that the proposed TFFN improves the recognition accuracy of the ASTER, MORAN, and CRNN text recognition networks by 0.22~0.5, 0.6~1.1 and 0.33~1.1 percentage points, respectively.

Key words: super-resolution reconstruction, text images, feature fusion, self-attention mechanism, cross-attention mechanism

中图分类号: