X-CUT++: A LAYOUT-AWARE OCR PIPELINE FOR KAZAKH-LANGUAGE NEWSPAPERS
DOI:
10.26577/JMMCS1302202613Keywords:
layout analysis, projection profiles, adaptive thresholding, HSV color segmentation, Hough transform, OCR, low-resource languages, Kazakh, Tesseract, document structure recoveryAbstract
We address the problem of structural decomposition of complex multi-column Kazakh language newspaper pages prior to optical character recognition. We propose a hybrid, fully interpretable layout-aware pipeline named X-Cut++, which combines adaptive binarization, smoothed horizontal/vertical projection profiles, morphological dilation, colour-aware region detection in HSV space, a probabilistic Hough fallback for separator lines, and a rule-based post-OCR structural parser that reconstructs the canonical title/abstract/author/body article structure. The method is formulated as a cascade of one-dimensional projection cuts with recursive vertical and horizontal subdivision constrained by geometric and area-based thresholds, ensuring deterministic and reproducible segmentation. Experiments on a multi-issue dataset of the newspaper Egemen Qazaqstan (5 issues, Jan–Feb 2024, 72 editorial pages, 300 DPI) demonstrate that X-Cut++ consistently decomposes full pages into coherent article-level fragments. The system produces 230 fragments in total (3.19 per page on average). On a manually verified subset of 15 fragments, the structural parser achieves perfect extraction of titles and abstracts, and correctly identifies all present author lines, confirming the reliability of the post-OCR structural reconstruction.










