QURMA: A TABLE EXTRACTION PIPELINE FOR KNOWLEDGE BASE POPULATION

A. B. Nugumanova; K. S. Apayev; Y. M. Baiburin; M. Mansurova; A. G. Ospan

doi:10.26577/JMMCS.2022.v114.i2.08

Авторлар

A. B. Nugumanova СӘРСЕН АМАНЖОЛОВ АТЫНДАҒЫ ШЫҒЫС ҚАЗАҚСТАН МЕМЛЕКЕТТІК УНИВЕРСИТЕТІ http://orcid.org/0000-0001-5522-4421
K. S. Apayev D. Serikbayev East Kazakhstan Technical University http://orcid.org/0000-0001-9292-4785
Y. M. Baiburin
M. Mansurova Әл-Фараби атындағы Қазақ ұлттық университеті http://orcid.org/0000-0002-9680-2758
A. G. Ospan Әл-Фараби атындағы Қазақ ұлттық университеті http://orcid.org/0000-0002-1860-6997

DOI:

https://doi.org/10.26577/JMMCS.2022.v114.i2.08

Кілт сөздер:

web-tables, table extraction, table recognition, table understanding, knowledge base population

Аңдатпа

In this paper, we propose a pipeline aimed at automatically extracting tables from heterogeneous
Web sources, such as HTML pages, pdf files and images. Table extraction is one of the actively
developing areas of Information Extraction, for which many applications, libraries and frameworks
are currently being developed. Nevertheless, most of these tools are focused on solving some
specific tasks, for example, only on recognizing tables presented in the form of images. We
propose combining these tasks into a single pipeline that will support the full cycle of processing
tables – starting with the stages of their search, recognition and extraction and ending with the
stages of semantic analysis and interpretation, i.e. understanding tables (table understanding).
Understanding tables and replenishing (population) knowledge bases (knowledge graphs) with
meaningful information contained in these tables is the ultimate goal of our design. The first
part of the work presents methods for detecting tables on web pages, pdf documents, as well as
automatic detection of attributes and values of objects. The second part presents the architecture
of the Qurma tool and its structure. The results show the implementation of the parser for the
Almaty-Ust-Kamenogorsk air search theme.