News Classification using Apache Lucene
Keywords:
binary classification, learning algorithms, Apache LuceneAbstract
In this paper, we describe the binary classification of textual messages using Apache Lucene. We first describe the problem of “news classification” and the method to gather the dataset samples used for models’ training and testing. We focus on creating and training classifiers based on different indexes and models’ evaluation metrics. We choose three main attributes that affects Apache Lucene indexes, and as a consequence affects accuracy of classifiers. One of the attributes are words on which it is built. We use Ngram concept instead of words, where different N varies from one to five. Second attribute is text preprocessing method, such as stemming and stop words filtering. Third attribute is classification method. In this research we choose k-nearest neighbors and naive Bayes classification methods. Variation of this attributes produces various index types. As a result, we provide analysis of the correlation between index type and the accuracy of classifiers.
References
[2] David M W Powers. Evaluation: From Precision, Recall and F-Measure to ROC, Informedness, Markedness and Correlation. // Journal of Machine Learning Technologies. – 2011. – Vol. 2 (1). – P. 37–63.
[3] Landis J.R., Koch G.G. The Measurement of Observer Agreement for Categorical Data. // Biometrics. – 1977. – Vol. 33. – No 1. – P. 159–174.
[4] Saman M., Robert S. The Learning Curve and Optimal Production under Uncertainty. // Rand Journal of Economics. – 1987. – Vol. 20. – No 3. – P. 331–343.
[5] Christopher M., Bo T., David H. The Learning Curve Sampling Method Applied to Method Based Clustering. // Journal of Machine Learning Research 2S. – 2002. – P. 397–418.