Project task: to build a framework for recognizing the structure and content of tables on scans
Project description: The task of recognizing the structure of tables is an important task in the field of text document processing. Particularly difficult cases are the recognition of the structure of tables with missing boundaries, which does not allow the use of most algorithmic approaches based on the detection of straight lines. Also quite complex, but common, is merging table cells. Our task was to create a neural network framework that solves the problem and also allows us to produce the correct result even in the complex cases mentioned above. In addition, to train and validate the models, it was necessary to collect a dataset containing the structure and content of tables on document scans.
Dataset: The dataset for solving this problem consisted of synthetic and real parts. The synthetic part is built based on the markup of the PubTabNet dataset. To generate table images, the HTML markup of tables from the dataset was visualized using random styles (different fonts, colors, border thickness). The real part of the dataset consisted of scans of documents, which were later marked up using Yandex.Toloka.
Our solution: To solve this problem, 2 approaches were considered:
End-to-end model
Two-stage model
The EDD (PubTabNet) model was chosen as the first approach. The model consists of a Visual Encoder and two associated Text Decoders. The first decoder extracted HTML markup from the image, the second decoder used the attention map of the first decoder to solve the OCR problem inside the predicted block. The second approach involved a separate solution to the problem of determining the document structure and OCR. Tesseract was used as the OCR engine. To extract table cells, a segmentation model was used to highlight table boundaries and then convert them into cell boxes. The FPN model was used as the segmentation model. To improve quality, a stack of 3 models was used to consistently improve quality, and self-distillation methods were also used to speed up the final solution. Also, to obtain clearer boundaries, the model was trained in the Pix2pix manner with Patch Discriminator.
Results and examples of work: For the two-stage approach, the quality of definition of table cells mAP~0.981 was obtained