Creating the dataset of keywords for detecting an extremist orientation in web-resources in the Kazakh language

Authors

  • M. A. Bolatbek al-Farabi Kazakh National University, Almaty, Republic of Kazakhstan
  • Sh. Zh. Mussiraliyeva al-Farabi Kazakh National University, Almaty, Republic of Kazakhstan
  • U. A. Tukeyev al-Farabi Kazakh National University, Almaty, Republic of Kazakhstan

DOI:

https://doi.org/10.26577/jmmcs-2018-1-492
        215 451

Keywords:

эмоциональные оценки

Abstract

This paper is part of the research on creating semantic analysis models in web resources for defining
an extremist orientation in the text. To solve this task a model was created, which consists of five
stages: identifying websites of an extremist groups, preparing for data extraction, data extraction,
data analysis and classification. This work presents the results of data analysis stage of above
mentioned model. The purpose of this study is to identify keywords, often used by extremists,
which will later be used to classify texts to "extremist" and "neutral" categories using machine
learning methods. There is no such database for the Kazakh language. As the result of this study
an experimental corpus and list of keywords in Kazakh language was created. The keywords were
added to the database with various morphological variants. The program was built that checks
for the presence of extremist keywords in the given input text and displays the words found.

References

[1] Al-Rowaily Kh., Abulaish M., Haldar N.,Al-Rubaian M., "BiSAL – A bilingual sentiment analysis lexicon to analyze
Dark Web forums for cyber security," Digital Investigation: The International Journal of Digital Forensics & Incident
Response archive 14 (2015): 53–62.
[2] Ananyeva M. I., Deviatkin D. A., Kobozeva M. V., Smirnov I. V., "Lingvostatisticheskii analiz tekstov ekstremistskoi
napravlennosti [Lingua-statistic analysis of extremist texts]" (paper presented at the proceedings of International Conference
on Situational Centers and Information-Analytical System 4i Class for Monitoring and Security Tasks, Russia,
TsarGrad, November 21-24, 2015-2016).
[3] Azizan S.A., Aziz I.A., "Terrorism detection based on sentiment analysis using machine learning," Journal of Engineering
and Applied Sciences 12(3)(2017):691–698.
[4] Chen Hsinchun, Dark Web Exploring and Data Mining the Dark Side of the Web(Springer, 2012), 3–105.
[5] Cohen K.,Johansson F.,Kaati L. and Mork, Clausen J., "Detecting Linguistic Markers for Radical Violence in Social
Media, Terrorism and Political Violence," Terrorism and Political Violence 26(2014): 246–253.
[6] Devyatkin D.A., Smirnov V., Ananyeva V.I., Kobozeva M.V., "Exploring linguistic features for extremist texts detection
(on the material of Russian-speaking illegal texts)" (paper presented at 2017 IEEE International Conference on Intelligence
and Security Informatics (ISI), Beijing, China, Jule 22-24, 2017).
[7] Enghin Omer, "Using machine learning to identify jihadist messages on Twitter" (Independent thesis, Uppsala University,
2015).
[8] Finlayson M.A., Halverson J.R., Corman S.R., "The N2 corpus: A semantically annotated collection of Islamist extremist
stories" (paper presented at the Proceedings of the Ninth International Conference on Language Resources and Evaluation
(LREC’14), Iceland, Reykjavik, May 26-31, 2014).
[9] "Google Online Translator." Accessed January 12, 2018, https://translate.google.com/#ru/kk/.
[10] Information Portal of Kazakhstan. "Okolo 500 kazakhstantsev voiuiut v Sirii i Irake na storone IGIL. [About 500 Kazakhstanis
are fighting in Syria and Iraq on the side of ISIS]." Accessed February 10, 2018, http://today.kz/news/mir/2017-
10-31/753466-okolo-500-kazahstantsev-voyuyut-v-sirii-i-irake-na-storone-igil/.
[11] "Integrated development environment Visual C." Accessed December 17, 2017, https://www.visualstudio.com.
[12] Johansson F., Kaati L., Sahlgren M., "Detecting Linguistic Markers of Violent Extremism in Online Environments,"
in Combating Violent Extremism and Radicalization in the Digital Era, ed. Khader, Majeed (Hershey PA: Information
Science Reference, 2016), 374–390.
[13] Scanlon J.R., "Automatic Detection and Forecasting of Violent Extremist Cyber-Recruitment" (Master thesis, University
of Virginia, 2014).
[14] Scrivens R., Frank R., "Sentiment Based on the classification of radical text on the Web" (paper presented at the
Proceedings of 2016 European Intelligence and Security Informatics Conference (EISIC), Uppsala, Sweden, August 17-19,
2016).
[15] "SQlite Expert Database." Accessed January 15, 2018, www.sqliteexpert.com.
[16] Targeir A., Perera S., "Mapping Extremist Forums using Text Mining" (Master thesis, University of Agder, 2013).
[17] Wikipedia. "tf–idf." Accessed November 12, 2017, https://ru.wikipedia.org/wiki/TF-IDF.
[18] Zhavoronkova T.V., "Ispol’zovanie seti Internet terroristicheskimi i ekstremistskimi organizatsiiami [Usage of the Internet
by terrorist and extremist organizations]." Orenburg State University bulletin 3 (178): 30.

Downloads

How to Cite

Bolatbek, M. A., Mussiraliyeva, S. Z., & Tukeyev, U. A. (2018). Creating the dataset of keywords for detecting an extremist orientation in web-resources in the Kazakh language. Journal of Mathematics, Mechanics and Computer Science, 97(1), 134–142. https://doi.org/10.26577/jmmcs-2018-1-492