Task:
On the part of the customer, there was a need for very accurate detection of words in the document image, selection of key blocks of text (paragraphs, headings, columns), and selection of baselines.
Problem:
Word boxes must cover words accurately and be reproducible across different scans of the same document. The solution should be lightweight and run on machines without a GPU.
Solution:
Result:
For Text Detection, the F1-score metric (IOU with a high threshold for classifying a box as a TP) and a visual quality assessment were used. For Block Detection, the F1-score was assessed based on the correct assignment of word boxes to blocks.
Technologies:
Numpy, pandas, sklearn, torch, Ya. Toloka. The main model architectures that were used in this project were based on Unet and Stacked Unet. PSP-Net and HRNet architecture were also tested
On the part of the customer, there was a need for very accurate detection of words in the document image, selection of key blocks of text (paragraphs, headings, columns), and selection of baselines.
Problem:
Word boxes must cover words accurately and be reproducible across different scans of the same document. The solution should be lightweight and run on machines without a GPU.
Solution:
- Text detection. Similar to the previous project, we prepared and semi-automatically labeled a dataset agreed with the customer for testing and additional training of neural network models. The dataset was quite diverse and covered all the main cases. To train the models, an artificially generated DDI-100 dataset was used with additional additions of augmentations that corresponded to the customer’s cases.
- Block Detection. To identify blocks in image documents, we built a general heuristic algorithm based on the results of the Text Detection model, resistant to document tilt and documents of different structures, combining one- and two-column fragments, signatures, headings, etc. Inside each block Words were numbered line by line, with additional filtering in case of errors in the Text Detection model. To assess the quality of Block Detection, a specially prepared dataset was also used, including various images for each document to test the reproducibility of the approach.
- Baselines. For the baselines, a neural network approach was also used. The models were trained on the DDI-100 subset, with additional, automatically labeled word baseline information.
Result:
For Text Detection, the F1-score metric (IOU with a high threshold for classifying a box as a TP) and a visual quality assessment were used. For Block Detection, the F1-score was assessed based on the correct assignment of word boxes to blocks.
Technologies:
Numpy, pandas, sklearn, torch, Ya. Toloka. The main model architectures that were used in this project were based on Unet and Stacked Unet. PSP-Net and HRNet architecture were also tested