Development of an error correction algorithm for Kazakh language
DOI:
https://doi.org/10.26577/JMMCS2024-v123-i3-8Keywords:
Kazakh language, dataset, monolingual datasets, leipzig corpora collection, Levenshtein distances, symspellpy, a multi-domain bilingual Kazakh datasetAbstract
This article discusses a method for correcting spelling errors in the Kazakh language using the advantages of morphological analysis and a model based on noisy channels. To achieve this goal, modern problems of automatic processing of Kazakh textual information
were analyzed, existing linguistic resources and processing systems of the Kazakh language were systematized, the basic requirements for the development of a system for analyzing Kazakh textual information based on machine learning were determined, and models and algorithms for extracting facts from unstructured and poorly structured text arrays were developed. The search function, an enhanced spelling correction algorithm, was utilized in this work and has the ability to recommend the proper spelling of the input text. The maximum editing distance, whether to include the original word when near matches are not found, and how to handle case sensitivity and exclusion based on regular expressions are all easily adjustable features of this functionality. Because of their adaptability, algorithms can be applied to a wide range of problems, from straightforward spell checks in user interfaces to intricate natural language processing assignments. Because of the way it’s designed, the search function finds possible corrections and verifies the context of words while accounting for user preferences like verbosity and ignore markers. Most modern multilingual natural language processing programs use only the graphical stage of text processing. On the other hand, semantic text analysis or analysis of the meaning of natural language is still an important problem in the theory of artificial intelligence and computational linguistics. But in order to process the grammar and semantics of multilingual information, precreated semantic and grammatical corpora of each natural language are necessary. To solve this problem, several tasks were considered and solved. These tasks included the analysis of research in the field of machine learning methods used in the processing of textual information, the existing problems of formalization and modeling of the Kazakh language, as well as the development and implementation of models, methods and algorithms for morphological and semantic analysis of texts of the Kazakh language.