Computer and Modernization ›› 2017, Vol. 0 ›› Issue (10): 72-75.doi: 10.3969/j.issn.1006-2475.2017.10.015

Previous Articles     Next Articles

Web Text Classification and Prediction Based on Thesaurus Match

  

  1. School of Mechanical Engineering, Nanjing University of Science and Technology, Nanjing 210094, China
  • Received:2017-02-23 Online:2017-10-30 Published:2017-10-31

Abstract: In order to achieve accurate classification of Chinese text, a classification method based on thesaurus match is put forward. The vector space model is used to express the features in the test set, the principle component analysis based on term frequency-inverse document frequency is used to weight the feature items in the corpus, 47 industries index thesaurus are screened out and built. And then the text industry category is determined according to the cosine similarity, the auto-regressive integrated moving average model is established, and the development trend of the next 10 days is forecast. Experimental results show that, the average classification performance F of index thesaurus is 85.6%, the average relative error of prediction model is 3.41%, which proves the classification method to be effective.

Key words: text classification, vector space model, principal components analysis, cosine similarity, ARIMA