Automatic document summarization based on statistical information

Authors

  • A. Mussina al-Farabi Kazakh National University
  • S. Aubakirov al-Farabi Kazakh National University
  • D. Ahmed-Zaki al-Farabi Kazakh National University
  • P. Trigo Instituto Superior de Engenharia de Lisboa, Lisbon, Portugal

Keywords:

summarization, automatic extraction, key-words, N-gram, TextRank

Abstract

Actual problem in nowadays is to efficiently process the large amount of data that pass through
our mind everyday. The object of study of this paper is automatic summarization algorithms.
The main goal is to implement and make comparison of different summarization techniques
on corpora of news articles parsed from the web. This research work contains the description
of three summarization techniques based on TextRank algorithm: General TextRank, BM25,
LongestCommonSubstring. It is specially noted the languages of used corpora: Russian and Kazakh
languages. The results of summarization processes and their comparison are provided. It should
be emphasized that used algorithms are well-known, but the way of their evaluation on defined
corpora is different from those which usually used in summary evaluation. The method of summary
evaluation proposed use the special dictionary of extracted key-words on the topic of corpora. As
the title implies the article describes applying statistical information. The semantic and syntactic
features of text are not examined.

References

[1] Chin-Yew Lin, "ROUGE: A Package For Automatic Evaluation Of Summaries,"ACL Anthology Network (2004): 74–81,
accessed October 20, 2016
[2] Chuleerat Jaruskulchai and Canasai Kruengkrai, "A Practical Text Summarizer by Paragraph Extraction for Thai,"(paper
presented at the Proceedings of the Sixth InternationalWorkshop on Information Retrieval with Asian Languages, Sappro,
Japan, July 7, 2003)
[3] Federico Barrios, Federico Lopez, Luis Argerich, RositaWachenchauzer, "Variations of the Similarity Function of TextRank
for Automated Summarization,"Cornell University Library (2016): 65-72, accessed November 14, 2016, arXiv:1602.03606.
[4] Fukumoto F., Suzuki Y., Fukumoto J., "An Automatic Extraction of Key Paragraphs Based on Context
Dependency,"Natural language processing Vol. 4 (1997): 89-109, DOI:10.5715/jnlp.4.2_89.
[5] Mandar Mitrat, Amit Singhal, Chris Buckleytt, "Automatic Text Summarization by paragraph Extraction,"Intelligent
Scalable Text Summarization (1997):39-46.
[6] Nagwani N.K., "Summarizing large text collection using topic modeling and clustering based on MapReduce
framework,"Journal of Big Data (2015): 18.
[7] Page L., Brin S., Motwani R., Winograd T., "The pagerank citation ranking: Bringing order to the web,"(paper presented
at the Proceedings of the 7th International World Wide Web Conference, Brisbane, Australia, 1998)
[8] Sandeep S. and Jagadeesh J., "Summarization Approaches Based on Document Probability Distributions,"(paper
presented at Proceedings of the 23rd Pacific Asia Conference on Language, Information and Computation, Hong Kong,
China, December 3-5, 2009).
[9] Wikipedia. "Automatic summarization."Accessed November 25, 2016,
https://en.wikipedia.org/wiki/Automatic_summarization.
[10] Wikipedia. "Stop words."Accessed June 30, 2015, https://en.wikipedia.org/wiki/Stop_words.
[11] Yacko V.A., "Simmetrichnoe referirovanie: teoreticheskie osnovy i metodika,"Nauchno-tehnicheskaya informaciya Ser.2
(2002): 18-28

Downloads

Published

2019-01-14