OCR Quality Comparison

Motivation for the customer to launch the project:
The customer’s need to process large volumes of documents revealed the following disadvantages: most of the available open solutions are too slow. In addition, the set of scenarios in which the solution ceases to provide acceptable quality of text recognition on a document is not defined.

Description of the initial situation:

there is a set of open solutions for the OCR task;
a set of documents and presentations on which text recognition is required is provided.

For a full comparison of the end2end OCR system, markup on documents not only with text content but also with bounding boxes is examined. hence the difficulty: manually marking hundreds of text documents with marking bounding boxes and text is time-consuming.

Project goals:
create tools to determine the best solution and the limits of its applicability.

MIL Team solution:
a set of tools has been created for testing TD+OCR solutions and effectively creating datasets consisting of documents in a “natural” environment. Using these tools, a team of 2 people created a dataset of 1000 images within two weeks, highlighting the boxes of individual words on the page (you can count the man-hours for n pages). The tools allow you to highlight images in which solutions show low accuracy, and attribute certain errors in the operation of the algorithms to the image parameters (rotation of the sheet, lighting, shadows, colored text and its background).

To build the model we used:

Dataset of electronic documents provided by the customer in pdf format;
Solutions for the TD+OCR problem in the public domain (Tesseract, EasyOCR).

Simulation results:

Testing tools for TD+OCR solutions;
Five datasets of varying complexity from photographs and scans of documents and presentations.

For comparison, the following metrics were implemented: Word Accuracy, per-word Levenstein distance, F1-score (IOU based) for box matching. The main comparison was between Tesseract, EasyOCR and our internal OCR model idog. According to the results, Tesseract is most effective when used for highly readable, aligned documents, otherwise EasyOCR and idog show the best results both in terms of word box detection and in terms of the final quality of character recognition.

Customer: ISP RAS
Technology stack: Python, OpenCV, Labelme