NAMED ENTITY RECOGNITION FOR THE KAZAKH LANGUAGE

Abstract

Named Entity Recognition (NER) is considered one of the important tasks of natural language processing (NLP). This is a way of recognizing real world objects, such as geographical location, person's name, organization, etc., that are found in a sentence. There are several approaches based on manually created grammar rules and statistical models, such as machine learning and hybrid methods, to solve the problem of recognizing named entities. The aim of this work is to experiment with methods based on statistical approach and machine learning, and to check how they deal with agglutinative Kazakh language. This paper presents the recognition of named objects based on a machine learning approach called conditional random field (CRF) as a statistical method. We also use a hybrid approach combining a bidirectional neural network model with long-term short-term memory (LSTM) and a CRF model. This is a modern approach to the recognition of named objects. The cross-validated randomized search model shows an f1 score of 0.95. The hybrid LSTM-CRF model shows an f1 score of 0.88. The results look pretty good and it doesn't require any design specifics compared to the CRF model.For the experiments, a corpus (kazNER) was created for the NER task with such marks as a person's name, location, organization and others. The corpus consists of 29,629 sentences that contain at least one proper noun containing only part of speech tags.
Published
2020-09-30
How to Cite
KOZHIRBAYEV, Z. M.; YESSENBAYEV, Z. A.. NAMED ENTITY RECOGNITION FOR THE KAZAKH LANGUAGE. Journal of Mathematics, Mechanics and Computer Science, [S.l.], v. 107, n. 3, p. 57-66, sep. 2020. ISSN 2617-4871. Available at: <https://bm.kaznu.kz/index.php/kaznu/article/view/784>. Date accessed: 22 oct. 2020.