计算机与现代化

• 人工智能 • 上一篇    下一篇

基Word Embedding的软件工程领域语义相关词挖掘方法

  

  1. 上海交通大学软件学院,上海  200240
  • 收稿日期:2016-12-01 出版日期:2017-09-20 发布日期:2017-09-19
  • 作者简介:胡望胜(1993-),男,湖南汨罗人,上海交通大学软件学院硕士研究生,研究方向:程序分析与测试。
  • 基金资助:
    国家自然科学基金资助项目(61572312,61572313); 上海市科委科研项目(15DZ1100305)

Learning Semantically Related Words in Software Through Word Embedding

  1. School of Software, Shanghai Jiao Tong University, Shanghai 200240, China
  • Received:2016-12-01 Online:2017-09-20 Published:2017-09-19

摘要: 软件的开发及维护过程中经常要对代码进行搜索。基于关键字匹配的代码搜索面临与传统信息检索一样的问题,即用户查询关键字与代码文本用词不匹配。为提高代码搜索精度,需要挖掘软件中的语义相关词进行查询扩展。本文针对软件工程领域设计了一种基于Word Embedding的语义相关词挖掘方法,并且采用IT技术问答网站Stack Overflow的文档作为语料库训练得到了共包含19332个单词的语义相关词表。与前人工作的对比实验验证了本文方法挖掘的语义相关词能有效提高代码搜索精度。

关键词: 代码搜索, 查询扩展, 语义相关词

Abstract: Searching for previously written code is important for software development and maintenance. The same as traditional information retrieval, the inherent difficulty of keyword based code search is vocabulary mismatch between user query and retrieved code. To improve the accuracy of code search, learning semantically related words in software for query expansion is needed. This paper designs a Word Embedding based method to learn semantically related words in software, and obtains semantically related words for 19332 words through training it on Stack Overflow documents. The experiment results show that the learned semantically related words can effectively improve code search accuracy.

Key words: code search, query expansion, semantically related words

中图分类号: