计算机与现代化 ›› 2022, Vol. 0 ›› Issue (05): 28-32.

• 算法设计与分析 • 上一篇    下一篇

基于VOLT的藏汉双向机器翻译

  

  1. (1.西藏大学信息科学技术学院,西藏拉萨850000;2.西藏自治区藏文信息技术人工智能重点实验室,西藏拉萨850000;
    3.藏文信息技术教育部工程研究中心,西藏拉萨850000)
  • 出版日期:2022-06-08 发布日期:2022-06-08
  • 作者简介:孙义栋(1997—),男,安徽马鞍山人,硕士研究生,研究方向:自然语言处理,机器翻译,E-mail: syd@utibet.edu.cn; 通信作者:拥措(1974—),女(藏族),教授,博士,研究方向:藏语自然语言处理,人工智能,E-mail: yc@utibet.edu.cn; 杨丹(1997—),女,山西吕梁人,硕士研究生,研究方向:自然语言处理,机器翻译,E-mail: yangdan019@utibet.edu.cn。
  • 基金资助:
    国家重点研发计划项目(2017YFB1402200); 国家自然科学基金资助项目(61262086); 西藏大学高水平人才培养计划项目(2019-GSP-S117); 西藏自治区教育厅“计算机及藏文信息技术国家级团队和重点实验室建设”项目(藏教财指[2018]81号)

Tibetan-Chinese Bidirectional Machine Translation Based on VOLT

  1. (1. School of Information Science and Technology, Tibet University, Lhasa 850000, China;
    2. State Key Laboratory of Artificial Intelligence for Tibetan Information Technology in Tibet Autonomous Region, Lhasa 850000, China;
    3. Ministry of Education Engineering Research Center for Tibetan Information Technology, Lhasa 850000, China)
  • Online:2022-06-08 Published:2022-06-08

摘要: 藏汉词表的生成不仅是藏汉双向机器翻译任务开始的第一步,而且影响着藏汉双向翻译效果。本文通过改进生成藏汉词表来提升下游藏汉双向翻译性能。一方面从词表拼接入手,采用高频使用正常词表,低频使用字节对编码词表的思想,通过反复训练找到最佳词频阈值;另一方面通过最优传输的词汇学习方法学习生成藏汉词表,并针对藏语本身语言特点进行改进后应用到藏汉双向翻译上。实验结果表明,本文针对藏文语言特点提出的字节对编码加最优传输的词汇学习方法效果最佳,在藏汉翻译任务上BLEU值达到37.35,汉藏翻译任务上BLEU值达到27.60。

关键词: 藏文词表, 字节对编码, 藏汉双向翻译, VOLT

Abstract: The generation of Tibetan-Chinese vocabulary is not only the first step of Tibetan-Chinese bi-directional machine translation task, but also affects the performance of Tibetan-Chinese bi-directional machine translation. This paper improves the performance of downstream Tibetan-Chinese bidirectional translation by improving the generation of Tibetan-Chinese word lists. On the one hand, it starts with word list splicing, using normal word lists for high frequencies and byte pair encoding word lists for low frequencies, and finding the optimal word frequency threshold through iterative training; On the other hand, according to the optimal transport theory proposed by vocabulary learning approach, the Tibetan-Chinese vocabulary is generated, which is improved according to the characteristics of Tibetan language and applied to Tibetan-Chinese bidirectional translation. The experimental results show that, it is demonstrated that the byte pair encoding plus optimal transmission lexical learning method proposed in this paper for Tibetan language characteristics works best, reaching a BLEU value of 37.35 for the Tibetan-Chinese translation task and 27.60 for the Chinese-Tibetan translation task.

Key words: Tibetan word list, byte pair encoding, two-way Tibetan-Chinese translation, VOLT