Implementation of A Parallel Algorithm to Extract N-gram from Text in a Functional Language

Authors

DOI:

https://doi.org/10.26577/JMMCS.2020.v107.i3.05

Keywords:

parallel algorithm, functional language, LuNA, N-gram, fragmented programming

Abstract

This paper discusses the implementation of a parallel algorithm for extracting N-grams from a semi-structured text in the functional language of the fragmented programming LuNA system. The N-gram extraction algorithm relates to NLP tasks. The analysis of other considered implementations of the parallel algorithm using MPJ Express, Apache Spark and Apache Hadoop technologies were carried out. Based on the analysis, it is proposed to choose the LuNA system due to the fact that it is able to automatically configure the algorithm for a specific computer system due to the algorithm model used in the form of a set of sequential information-dependent tasks that are dynamically distributed among the processor and processor cores. The paper describes the implementation scheme of this algorithm using fragmented programming technology. In this paper the scheme of division into data fragments and fragments of calculations is described. The implementation scheme of the N-gram extraction algorithm is presented. Testing was conducted on a different number of processors to extract N-gram by words. When extracting tokens, all stop words that were set in advance in a separate text storage were deleted. Testing showed good efficiency of the proposed approach for the implementation of algorithms using the LuNA system.

References

[1] Malyshkin V.E., "Tehnologija fragmentirovanogo programmirovanija" , Vestnik JuUrGU. Ser. Vych. matem. inform no. 1 (2012): 45–55.

[2] Malyshkin V., Perepelkin V., Schukin G., "Scalable distributed data allocation in LuNA fragmented programming system" , Journal of Supercomputing 73, no. 2 (2016): 726-732.

[3] "Google Natural Language" , https://cloud.google.com/natural-language.

[4] Brants T., Popat A.C., Xu P., Och F.J., Dean J., "Large Language Models in Machine Translation" , Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (2007): 858-867.

[5] Young T., Hazarika D., Poria S., Cambria E., "Recent Trends in Deep Learning Based Natural Language Processing" , IEEE Computational Intelligence Magazine 13, no. 3 (2018): 55-75.

[6] Srinivasa K.G., Shree Devi B.N."GPU Based N-Gram String Matching Algorithm with Score Table Approach for String Searching in Many Documents" , Journal of The Institution of Engineers (India): Series B 98, No.5 (2017): 467-476.

[7] Banasiak D. "Statistical methods of natural language processing on GPU" , Advances in Intelligent Systems and Computing (2016): 595-604.

[8] Shaikh E., Mohiuddin I., Alufaisan Y., Nahvi I., "Apache Spark: A Big Data Processing Engine" , 2019 2nd IEEE Middle East and North Africa COMMunications Conference (2019).

[9] Bougar M., Ziyati E.H., "Addressing Stemming Algorithm for Arabic Text Using Spark Over Hadoop" , Advances in Intelligent Systems and Computing (2020): 74-82.

[10] Aubakirov S., Trigo P., Ahmed-Zaki D., "Comparison of distributed computing approaches to complexity of N-gram extraction" , Proceedings of the 5th International Conference on Data Management Technologies and Applications - Volume 1: DATA (2016): 25-30.

Downloads

Published

2020-09-30