Computer Science & Electrical

Computer Science & Electrical

Text Classification using KNN with different Feature Selection Methods

Pages: 8  ,  Volume: 9  ,  Issue: 1 , July   2018
Received: 02 Aug 2018  ,  Published: 07 August 2018
Views: 67  ,  Download: 35

Authors

# Author Name
1 Rajshree Jodha
2 Gaur Sanjay B.C
3 K.R Chowdhary

Abstract

This paper presents a fast and efficient approach for text classification using KNN for different feature selection method. Typically, this approach evaluates the performance of the system for minimum number of features required to classify the text documents. 20 Newsgroup dataset collected by Ken Lang, have been taken to check performance of the KNN classifier algorithm. The above dataset is separated into two parts viz. training set (60%) and test set (40%).

  The KNN classifier has been implemented against the different number of stemmed and unstemmed features for CHI (Chi-Squared Statistic), IG (Information Gain) and MI (Mutual Information). The Accuracy, Precision, Recall and F1-Score are used to test the system.

Keywords

  • KNN
  • Text Classification
  • feature extraction
  • stemmed data
  • References

    Joachims, T.: Text Categorization with Support Vector Machines: Learning with Many Relevant Features. European Conference on Machine Learning (ECML) (1998)

    Ozgur, L., Gungor, T., Gurgen, F.: Adaptive Anti-Spam Filtering for Agglutinative Languages. A Special Case for Turkish, Pattern Recognition Letters, 25 no.16 (2004) p.g. 1819–1831.

    McCallum, A., Nigam, K.: A Comparison of Event Models for Nave Bayes Text Classification. Sahami, M. (Ed.), Proc. of AAAI Workshop on Learning for Text Categorization (1998), Madison, WI, p.g. 41–48.

    Yang, Y., Liu, X.: A Re-examination of Text Categorization Methods. In Proceedings of SIGIR-99, 22nd ACM International Conference on Research and Development in Information Retrieval, Berkeley, US (1996)

    Sebastiani, F.: Machine Learning in Automated Text Categorization. ACM Computing Surveys 34 no. 5 (2002), p.g. 1–47.

    Forman, G.: An Extensive Empirical Study of Feature Selection Metrics for Text Classification. Journal of Machine Learning Research 3 (2003), p.g. 1289–1305.

    Ozgur, A.: Supervised and Unsupervised Machine Learning Techniques for Text Document Categorization. Master’s Thesis (2004), Bogazici University, Turkey.

    Burges, C. J. C.: A Tutorial on Support Vector Machines for Pattern Recognition. Data Mining and Knowledge Discovery Vol. 2 No. 2 (1998) p.g. 121–167.

    Vishwanath Bijalwan, Vinay Kumar, Pinki Kumari, Jordan Pascual, “KNN based   Machine Learning Approach for Text and Document Mining”, International Journal of Database Theory and Applications, Vol.7, No.1, 2014, pp. 61-70.

    Ankit Basarkar, “Document Classification using Machine Learning”, MS Thesis, San Jose State University, 2017

    A. A. Hakim, A. Erwin, K. Eng, M. Galinium, W. Muliady, "Automated document classification for news article in bahasa Indonesia based on term frequency inverse document frequency (TF-IDF) approach", 6th International Conference on Information Technology and Electrical Engineering (ICITEE), pp. 1-4, 2014.

    Saniat Javid Sohrawardi, Iftekhar Azam, Shazzad Hosain, “A comparative study of text classification algorithms on user submitted bug reports”, 9th International Conference on Digital Information Management (ICDIM), IEEE (2014), pp. 242–247, 2014

    Gulden Uchyigit, “Experimental evaluation of feature selection methods for text classification”, 9th International Conference on Fuzzy Systems and Knowledge Discovery, 2012