Dictionary extraction based on statistical data

Authors

  • A. Mussina al-Farabi Kazakh National university, Almaty, Republic of Kazakhstan
  • S. Aubakirov al-Farabi Kazakh National university, Almaty, Republic of Kazakhstan

Keywords:

automatic extraction, key-words, N-gram

Abstract

Automatic text summarization is an actual problem when working with a large amount of information. Most of the algorithms that work on the basis of statistical data build a summary text content by counting the similarity of text units and units importance. Text unit could be a word, sentence or paragraph, in our case unit is a sentence. Similarity is considered the presence of key-words in the sentences. Key-words are words that indicate the topic of the text. In this research work we will describe an automatic extraction of key-words dictionary, where key-words are N-grams with N from 1 to 5. Two algorithms were implemented: getting of words that occur only in one of two different corpora and getting of words with high importance. Importance of N- gram denotes its belonging to the topic of the text. Used text languages are Russian and Kazakh. The algorithms show important results, both of them make sense in constructing of full key-words dictionary.

References

[1] Berberich K. and Bedathur S., "Computing n-gram statistics in MapReduce,"ICPS - International Conference Proceedings Series (2013): 101-112
[2] Chin-Yew Lin, "ROUGE: A Package For Automatic Evaluation Of Summaries,"ACL Anthology Network (2004): 74–81, accessed October 20, 2016
[3] Chuleerat Jaruskulchai and Canasai Kruengkrai, "A Practical Text Summarizer by Paragraph Extraction for Thai,"(paper presented at the Proceedings of the Sixth International Workshop on Information Retrieval with Asian Languages, Sappro, Japan, July 7, 2003)
[4] CraigTrim. "The Art of Tokenization."Accessed June 30, 2015, https://www.ibm.com/developerworks/community/blogs/nlp/entry/tokenization?lang=en.
[5] Elasticsearch. "Elasticsearch engine guide."Accessed October 25, 2015, https://www.elastic.co/guide/en/elasticsearch/reference/1.4/index.html.
[6] Federico Barrios, Federico Lopez, Luis Argerich, Rosita Wachenchauzer, "Variations of the Similarity Function of TextRank for Automated Summarization,"Cornell University Library (2016): 65-72, accessed November 14, 2016, arXiv:1602.03606.
[7] Fukumoto F., Suzuki Y., Fukumoto J., "An Automatic Extraction of Key Paragraphs Based on Context Dependency,"Natural language processing Vol. 4 (1997): 89-109, DOI:10.5715/jnlp.4.2_89.
[8] Iain. "Heavy Metal and Natural Language Processing - Part 1."Accessed September 20, 2016, http://www.degeneratestate.org/posts/2016/Apr/20/heavy-metal-and-natural-language-processing-part-1/.
[9] Mandar Mitrat, Amit Singhal, Chris Buckleytt, "Automatic Text Summarization by paragraph Extraction,"Intelligent Scalable Text Summarization (1997):39-46.
[10] Ngram count. "Ngram count."Accessed October 25, 2016, http://www.ling.ohio-state.edu/ bromberg/ngramcount/ngramcount.html.
[11] Riedl M. and Biemann C., "Text segmentation with topic models,"Journal for Language Technology and Computational Linguistics Vol.27 (2012):47-70
[12] Sandeep S. and Jagadeesh J., "Summarization Approaches Based on Document Probability Distributions,"(paper presented at Proceedings of the 23rd Pacific Asia Conference on Language, Information and Computation, Hong Kong, China, December 3-5, 2009).
[13] Srilm project. "Srilm project."Accessed October 25, 2015, http://www.speech.sri.com/projects/srilm/.
[14] Wikipedia. "Brown Corpus."Accessed September 20, 2016, https://en.wikipedia.org/wiki/Brown_Corpus.
[15] Wikipedia. "N-grams."Accessed September 20. 2015, https://en.wikipedia.org/wiki/N-gram.
[16] Yacko V.A., "Simmetrichnoe referirovanie: teoreticheskie osnovy i metodika,"Nauchno-tehnicheskaya informaciya Ser.2 (2002): 18-28

Downloads

Published

2018-07-16