Computer and Modernization ›› 2024, Vol. 0 ›› Issue (06): 19-24.doi: 10.3969/j.issn.1006-2475.2024.06.004

Previous Articles     Next Articles

Data Filtering Strategies for Tibetan-Chinese Neural Machine Translation

  


  1. (1. College of Information Science and Technology, Tibet University, Lhasa 850000, China;
    2. Key Laboratory of Tibetan Information Technology and Artificial Intelligence of Tibet Autonomous Region, Lhasa 850000, China;
    3. Engineering Research Center of Tibetan Information Technology, Ministry of Education, Lhasa 850000, China)
  • Online:2024-06-30 Published:2024-07-17

Abstract: Abstract:Data syntax and semantic losses arise in Tibetan-Chinese machine translation when traditional data augmentation methods are employed. To address this issue,this paper proposes a pseudo-data filtering method combining sentence confusion degree with semantic similarity degree on the basis of traditional data enhancement methods. This strategy effectively tackles challenges such as the inadequate quality and scarcity of parallel data, particularly in low-resource settings. The results of this study demonstrate that the pseudo data filtering approach significantly improves both Tibetan-Chinese and English-Chinese bidirectional language translation tasks.The proposed pseudo-data filtering method effectively improves the grammatical and semantic defects of the translation model, thus enhancing the performance of the translation system and the generalization ability of the translation model, and verifies the effectiveness of the proposed method

Key words: Key words: back translation, data selection, Tibetan Chinese neural machine translation, perplexity, semantic similarity

CLC Number: