Named entity recognition for the Kazakh language

Authors

DOI:

https://doi.org/10.26577/JMMCS.2020.v107.i3.06

Keywords:

named entity recognition, conditional random field, long-term short-term memory, word embeddings

Abstract

Named Entity Recognition (NER) is considered one of the important tasks of natural language processing (NLP). This is a way of recognizing real world objects, such as geographical location, person's name, organization, etc., that are found in a sentence. There are several approaches based on manually created grammar rules and statistical models, such as machine learning and hybrid methods, to solve the problem of recognizing named entities. The aim of this work is to experiment with methods based on statistical approach and machine learning, and to check how they deal with agglutinative Kazakh language. This paper presents the recognition of named objects based on a machine learning approach called conditional random field (CRF) as a statistical method. We also use a hybrid approach combining a bidirectional neural network model with long-term short-term memory (LSTM) and a CRF model. This is a modern approach to the recognition of named objects. The cross-validated randomized search model shows an f1 score of 0.95. The hybrid LSTM-CRF model shows an f1 score of 0.88. The results look pretty good and it doesn't require any design specifics compared to the CRF model.
For the experiments, a corpus (kazNER) was created for the NER task with such marks as a person's name, location, organization and others. The corpus consists of 29,629 sentences that contain at least one proper noun containing only part of speech tags.

References

[1] Gislason P.O., Benediktsson J.S. and Johannes R. "Random forests for land cover classification" , Pattern Recognition Letters vol. 27 (2006): 294–300.

[2] Lample G., Ballesteros M., Subramanian S., Kawakami K. and Dyer C. "Neural architectures for named entity recognition" , arXiv vol. 1603.01360 (2016): 1–11.

[3] Makazhanov A., Sultangazina A., Makhambetov O. and Yessenbayev Z. "Syntactic annotation of kazakh: Following the universal dependencies guidelines. a report" , Proceedings of the 3rd International Conference on Turkic Languages Processing (TurkLang 2015) (Kazan, Russia, 2015): 338–350.

[4] Manning C., Surdeanu M., Bauer J., Finkel J., Bethard S. and McClosky D. "The Stanford CoreNLP natural language processing toolkit" , Proceedings of 52nd annual meeting of the association for computational linguistics: system demonstrations, Baltimore, Maryland, USA (2014): 55-60.

[5] Murphy K. "Naive bayes classifiers" , Journal of the University of British Columbia vol. 18 (2006): 1–60.

Downloads

Published

2020-09-30