Creating the dataset of keywords for detecting an extremist orientation in web-resources in the Kazakh language


  • M. A. Bolatbek al-Farabi Kazakh National University, Almaty, Republic of Kazakhstan
  • Sh. Zh. Mussiraliyeva al-Farabi Kazakh National University, Almaty, Republic of Kazakhstan
  • U. A. Tukeyev al-Farabi Kazakh National University, Almaty, Republic of Kazakhstan

This paper is part of the research on creating semantic analysis models in web resources for defining
an extremist orientation in the text. To solve this task a model was created, which consists of five
stages: identifying websites of an extremist groups, preparing for data extraction, data extraction,
data analysis and classification. This work presents the results of data analysis stage of above
mentioned model. The purpose of this study is to identify keywords, often used by extremists,
which will later be used to classify texts to "extremist" and "neutral" categories using machine
learning methods. There is no such database for the Kazakh language. As the result of this study
an experimental corpus and list of keywords in Kazakh language was created. The keywords were
added to the database with various morphological variants. The program was built that checks
for the presence of extremist keywords in the given input text and displays the words found.


