QURMA: A TABLE EXTRACTION PIPELINE FOR KNOWLEDGE BASE POPULATION

Authors

DOI:

https://doi.org/10.26577/JMMCS.2022.v114.i2.08

Keywords:

web-tables, table extraction, table recognition, table understanding, knowledge base population

Abstract

In this paper, we propose a pipeline aimed at automatically extracting tables from heterogeneous
Web sources, such as HTML pages, pdf files and images. Table extraction is one of the actively
developing areas of Information Extraction, for which many applications, libraries and frameworks
are currently being developed. Nevertheless, most of these tools are focused on solving some
specific tasks, for example, only on recognizing tables presented in the form of images. We
propose combining these tasks into a single pipeline that will support the full cycle of processing
tables – starting with the stages of their search, recognition and extraction and ending with the
stages of semantic analysis and interpretation, i.e. understanding tables (table understanding).
Understanding tables and replenishing (population) knowledge bases (knowledge graphs) with
meaningful information contained in these tables is the ultimate goal of our design. The first
part of the work presents methods for detecting tables on web pages, pdf documents, as well as
automatic detection of attributes and values of objects. The second part presents the architecture
of the Qurma tool and its structure. The results show the implementation of the parser for the
Almaty-Ust-Kamenogorsk air search theme.

References

[1] Embley D.W., Tao C., Liddle S.W., "Automating the extraction of data from HTML tables with unknown structure" , Data & Knowledge Engineering, 54 (1) (2005): 3–28.
[2] Ell B., Hakimov S., Braukmann P., et al., "Towards a Large Corpus of Richly Annotated Web Tables for Knowledge Base Population" , 15th International workshop on Linked Data for Information Extraction (LD4IE) at ISWC2017, Vienna.
[3] Kruit B., Boncz P., Urbani J., "Extracting novel facts from tables for knowledge graph completion" , International Semantic Web conference. Springer. Cham., (2019): 364–381.
[4] Ros´en G., Analysis of Tabula: A PDF-Table extraction tool. (2019).
[5] Liu Y., TableSeer: automatic table extraction, search, and understanding. (2009).
[6] Perez-Arriaga M.O., Estrada T., Abad-Mota S., "TAO: system for table detection and extraction from PDF documents" , The Twenty-Ninth International Flairs Conference, (2016).
[7] Kruit B., Boncz P., Urbani J., "TAKCO: A Platform for Extracting Novel Facts from Tables" , Companion Proceedings of the Web Conference, (2021): 705–707.
[8] Paliwal S.S. et al., "TableNet: Deep learning model for end-to-end table detection and tabular data extraction from scanned document images" , International Conference on Document Analysis and Recognition (ICDAR). IEEE, (2019): 128–133.
[9] Wang N.X.R., Burdick D., Li Y., "TableLab: An Interactive Table Extraction System with Adaptive Deep Learning" 26th International Conference on Intelligent User Interfaces, (2021): 87–89.
[10] Mikhailov A.A., Shigarov A., Kozlov I.S., PyTabby: a Docreader’s module for extracting text and tables from PDF with a text layer. (2021).
[11] Camelot: PDF Table Extraction for Humans. Camelot 0.8.2 documentation (Jan. 29, 2021).
URL: https://camelot-py.readthedocs.io/en/master/ (visited on 10.10.2021).
[12] Limaye G., Sarawagi S., Chakrabarti S., "Annotating and SearchingWeb Tables Using Entities, Types and Relationships" , PVLDB, 3 (1-2) (2010): 1338–1347.
[13] Venetis P., Halevy A., Madhavan J., Paca M., Shen W., Wu F., Miao G., Wu C., "Recovering Semantics of Tables on the Web" , PVLDB, 4 (2011): 528–538.
[14] Wang J., Shao B., Wang H., "Understanding Tables on the Web" , In: ER, 1 (2010): 141–155
[15] Mulwad V., Finin T., Joshi A., "Semantic Message Passing for Generating Linked Data from Tables" , In: Proceedings of ISWC, (2013): 363–378.
[16] Hassanzadeh O., Ward M.J., Rodriguez-Muro M., Srinivas K., "Understanding a Large Corpus of Web Tables Through Matching with Knowledge Bases: an Empirical Study" , In: Proceedings of OM at ISWC, (2015): 25–34.
[17] Petar Petrovski, Christian Bizer, "Extracting Atribute-Value Pairs from Product Specifications on theWeb" , Web Intelligence (WI’17). Leipzig, Germany. ACM, (2017) 978-1-4503-4951-2/17/08. DOI: 10.1145/3106426.3106449.
[18] Gabor Melli, "Shallow Semantic Parsing of Product Offering Titles (for better automatic hyperlink insertion)" , In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, (2014): 1670–1678.
[19] Stefano Ortona, "An analysis of duplicate on web extracted objects" , In Proceedings of the companion publication of the 23rd international conference on World wide web companion, (2014): 1279–1284.
[20] Petar Ristoski and Peter Mika, "Enriching Product Ads with Metadata from HTML Annotations" , In Proceedings of the 13th Extended SemanticWeb Conference (To Appear), (2016).
[21] Ihler A.T. et al., "Loopy belief propagation: convergence and effects of message errors" , Journal of Machine Learning Research, 6 (5) (2005).
[22] https://aviapoisk.kz/raspisanie/aeroporta/ustkamenogorsk.

Downloads

Published

2022-06-24