QURMA: A TABLE EXTRACTION PIPELINE FOR KNOWLEDGE BASE POPULATION

A. B. Nugumanova; K. S. Apayev; Y. M. Baiburin; M. Mansurova; A. G. Ospan

doi:10.26577/JMMCS.2022.v114.i2.08

Authors

A. B. Nugumanova Sarsen Amanzholov East Kazakhstan University http://orcid.org/0000-0001-5522-4421
K. S. Apayev D. Serikbayev East Kazakhstan Technical University http://orcid.org/0000-0001-9292-4785
Y. M. Baiburin
M. Mansurova Al-Farabi Kazakh National University http://orcid.org/0000-0002-9680-2758
A. G. Ospan Al-Farabi Kazakh National University http://orcid.org/0000-0002-1860-6997

DOI:

https://doi.org/10.26577/JMMCS.2022.v114.i2.08

Keywords:

web-tables, table extraction, table recognition, table understanding, knowledge base population

Abstract

In this paper, we propose a pipeline aimed at automatically extracting tables from heterogeneous
Web sources, such as HTML pages, pdf files and images. Table extraction is one of the actively
developing areas of Information Extraction, for which many applications, libraries and frameworks
are currently being developed. Nevertheless, most of these tools are focused on solving some
specific tasks, for example, only on recognizing tables presented in the form of images. We
propose combining these tasks into a single pipeline that will support the full cycle of processing
tables – starting with the stages of their search, recognition and extraction and ending with the
stages of semantic analysis and interpretation, i.e. understanding tables (table understanding).
Understanding tables and replenishing (population) knowledge bases (knowledge graphs) with
meaningful information contained in these tables is the ultimate goal of our design. The first
part of the work presents methods for detecting tables on web pages, pdf documents, as well as
automatic detection of attributes and values of objects. The second part presents the architecture
of the Qurma tool and its structure. The results show the implementation of the parser for the
Almaty-Ust-Kamenogorsk air search theme.

QURMA: A TABLE EXTRACTION PIPELINE FOR KNOWLEDGE BASE POPULATION

Authors

DOI:

Keywords:

Abstract

Downloads

Published

Issue

Section

How to Cite

Language

Information

Links