Project task:
to build a framework for recognizing the structure and content of tables on scans
Project description:
The task of recognizing the structure of tables is an important task in the field of text document processing. Particularly difficult cases are the recognition of the structure of tables with missing boundaries, which does not allow the use of most algorithmic approaches based on the detection of straight lines. Also quite complex, but common, is merging table cells. Our task was to create a neural network framework that solves the problem and also allows us to produce the correct result even in the complex cases mentioned above. In addition, to train and validate the models, it was necessary to collect a dataset containing the structure and content of tables on document scans.
Dataset:
The dataset for solving this problem consisted of synthetic and real parts. The synthetic part is built based on the markup of the PubTabNet dataset. To generate table images, the HTML markup of tables from the dataset was visualized using random styles (different fonts, colors, border thickness). The real part of the dataset consisted of scans of documents, which were later marked up using Yandex.Toloka.
Our solution:
To solve this problem, 2 approaches were considered:
Results and examples of work:
For the two-stage approach, the quality of definition of table cells mAP~0.981 was obtained
Customer: ISP RAS
Technology stack: Python, PyTorch, opencv
to build a framework for recognizing the structure and content of tables on scans
Project description:
The task of recognizing the structure of tables is an important task in the field of text document processing. Particularly difficult cases are the recognition of the structure of tables with missing boundaries, which does not allow the use of most algorithmic approaches based on the detection of straight lines. Also quite complex, but common, is merging table cells. Our task was to create a neural network framework that solves the problem and also allows us to produce the correct result even in the complex cases mentioned above. In addition, to train and validate the models, it was necessary to collect a dataset containing the structure and content of tables on document scans.
Dataset:
The dataset for solving this problem consisted of synthetic and real parts. The synthetic part is built based on the markup of the PubTabNet dataset. To generate table images, the HTML markup of tables from the dataset was visualized using random styles (different fonts, colors, border thickness). The real part of the dataset consisted of scans of documents, which were later marked up using Yandex.Toloka.
Our solution:
To solve this problem, 2 approaches were considered:
- End-to-end model
- Two-stage model
Results and examples of work:
For the two-stage approach, the quality of definition of table cells mAP~0.981 was obtained
Customer: ISP RAS
Technology stack: Python, PyTorch, opencv