X-CUT++: A LAYOUT-AWARE OCR PIPELINE FOR KAZAKH-LANGUAGE NEWSPAPERS

Authors

DOI:

10.26577/JMMCS1302202613

Keywords:

layout analysis, projection profiles, adaptive thresholding, HSV color segmentation, Hough transform, OCR, low-resource languages, Kazakh, Tesseract, document structure recovery

Abstract

 We address the problem of structural decomposition of complex multi-column Kazakh language newspaper pages prior to optical character recognition. We propose a hybrid, fully interpretable layout-aware pipeline named X-Cut++, which combines adaptive binarization, smoothed horizontal/vertical projection profiles, morphological dilation, colour-aware region detection in HSV space, a probabilistic Hough fallback for separator lines, and a rule-based post-OCR structural parser that reconstructs the canonical title/abstract/author/body article structure. The method is formulated as a cascade of one-dimensional projection cuts with recursive vertical and horizontal subdivision constrained by geometric and area-based thresholds, ensuring deterministic and reproducible segmentation. Experiments on a multi-issue dataset of the newspaper Egemen Qazaqstan (5 issues, Jan–Feb 2024, 72 editorial pages, 300 DPI) demonstrate that X-Cut++ consistently decomposes full pages into coherent article-level fragments. The system produces 230 fragments in total (3.19 per page on average). On a manually verified subset of 15 fragments, the structural parser achieves perfect extraction of titles and abstracts, and correctly identifies all present author lines, confirming the reliability of the post-OCR structural reconstruction.

Author Biographies

  • Assel Ospan, Farabi University, Almaty, Kazakhstan

    Assel Ospan – Doctor of Philosophy (PhD) in System Engineering, Senior Lecturer at the Department of Artificial Intelligence and Big Data, Al-Farabi Kazakh National University (Almaty, Kazakhstan, e-mail: assel.ospan@kaznu.edu.kz)

  • Madina Mansurova, Farabi University, Almaty, Kazakhstan

    Madina Mansurova – Candidate of Physical and Mathematical Sciences, Professor, Dean of the Faculty of Mechanics and Mathematics, Al-Farabi Kazakh National University (Almaty, Kazakhstan, e-mail: madina.mansurova@kaznu.edu.kz)

  • Aisha Sailau, Farabi University, Almaty, Kazakhstan

    Aisha Sailau – Researcher at Al-Farabi Kazakh National University (Almaty, Kazakhstan, e-mail: aishasailau3@gmail.com)

  • Akniyet Kalzhan, Farabi University, Almaty, Kazakhstan

    Talshyn Sarsembayeva – Doctor of Philosophy (PhD) in AI in medicine, Senior Lecturer and Deputy Head for Research and Innovation at the Department of Artificial Intelligence and Big Data, Al-Farabi Kazakh National University (Almaty, Kazakhstan, e-mail: talshyn.sagdatbek@kaznu.edu.kz)

Published

2026-06-20

How to Cite

X-CUT++: A LAYOUT-AWARE OCR PIPELINE FOR KAZAKH-LANGUAGE NEWSPAPERS. (2026). Journal of Mathematics Mechanics and Computer Science, 130(2), 181-198. https://doi.org/10.26577/JMMCS1302202613