News Classification using Apache Lucene

Authors

  • S. S. Aubakirov Al-Farabi Kazakh National University
  • D. Zh. Akhmed-Zaki Al-Farabi Kazakh National University
  • P.S. Trigo Instituto Superior de Engenharia de Lisboa, Biosystems and Integrative Sciences Institute Agent and Systems Modeling

Keywords:

binary classification, learning algorithms, Apache Lucene

Abstract

In this paper, we describe the binary classification of textual messages using Apache Lucene. We first describe the problem of “news classification” and the method to gather the dataset samples used for models’ training and testing. We focus on creating and training classifiers based on different indexes and models’ evaluation metrics. We choose three main attributes that affects Apache Lucene indexes, and as a consequence affects accuracy of classifiers. One of the attributes are words on which it is built. We use Ngram concept instead of words, where different N varies from one to five. Second attribute is text preprocessing method, such as stemming and stop words filtering. Third attribute is classification method. In this research we choose k-nearest neighbors and naive Bayes classification methods. Variation of this attributes produces various index types. As a result, we provide analysis of the correlation between index type and the accuracy of classifiers.

References

[1] Markus M., Matthias H., Ulrike H. Optimal construction of k-nearest neighbor graphs for identifying noisy clusters. // Theoretical Computer Science. – 2009. – No 410(19). – P. 1749–1764.
[2] David M W Powers. Evaluation: From Precision, Recall and F-Measure to ROC, Informedness, Markedness and Correlation. // Journal of Machine Learning Technologies. – 2011. – Vol. 2 (1). – P. 37–63.
[3] Landis J.R., Koch G.G. The Measurement of Observer Agreement for Categorical Data. // Biometrics. – 1977. – Vol. 33. – No 1. – P. 159–174.
[4] Saman M., Robert S. The Learning Curve and Optimal Production under Uncertainty. // Rand Journal of Economics. – 1987. – Vol. 20. – No 3. – P. 331–343.
[5] Christopher M., Bo T., David H. The Learning Curve Sampling Method Applied to Method Based Clustering. // Journal of Machine Learning Research 2S. – 2002. – P. 397–418.

Downloads

Published

2017-11-20