Zero-Shot image recognition

Description:
Currently, neural networks show high quality in most pattern recognition tasks. One of the problems with neural networks is the requirement for a large amount of labeled data for training. For most tasks, markup is used that limits the set of entities recognized by the neural network. At the same time, the Internet contains large amounts of poorly structured information about images posted by users. This information is expressed in pairs of images and text descriptions. Creation of multimodal models, i.e. performing joint processing of data from different domains, allows the use of such raw data to solve image recognition problems.

Our task was to study existing multimodal approaches to solving various computer vision problems (multi-class/multi-label classification, semantic segmentation, object detection), improve them, as well as develop new approaches using external knowledge (LLM, semantic networks, etc.) to improve the accuracy and generalization ability of the developed approaches.

Solution:
As part of the project, more than 20 different zero-shot models were implemented and tested to solve problems of multi-class/multi-label classification, semantic segmentation and object detection. A comparative analysis of these models was carried out, and the advantages and disadvantages of various approaches were identified. For identified weaknesses of existing approaches, quality improvement approaches were proposed and tested. The proposed modifications for the MaskCLIP and Zero-shot MaskFormer models make it possible to obtain SOTA quality for the task of zero-shot semantic segmentation.

Another focus of the project was to explore the possibility of using external knowledge, such as LLM or semantic networks, to improve the quality of predictions of zero-shot approaches. We investigated both the possibility of using such knowledge at the stage of model inference and during training. The result of the research is the development of an approach that allows increasing the quality of problem solving by increasing the diversity of the training sample. The greatest increase in quality was demonstrated when working with data containing detailed information about objects in the image (color, quantity, properties), which is a difficult case for such methods. Also, in the process of solving the project, the shortcomings of existing approaches to assessing the quality of models in the presence of an open set of classes for which prediction is carried out were identified. Classic metrics such as accuracy, precision or recall consider the model’s prediction to be erroneous if there is no complete match with the class of the object in the dataset. For models working with an open set of classes, this statement is incorrect, since the model can predict synonyms, hypernyms or hyponemes for a true label, which is an error from the point of view of classical metrics. To solve the problem of assessing the quality of models, an approach was developed that uses semantic networks with a given class hierarchy and (Markov random fields).

Results:
- SOTA model for solving the problem of semantic segmentation in annotation-free mode (does not require training on datasets for segmentation) and zero-shot
- An approach to assessing the quality of zero-shot models, taking into account the model’s ability to work on an open set of classes
- Fine-tuning methods for CLIP-like models using external knowledge to improve the quality of object recognition and their characteristics in complex scenes

Technology stack:
Python, PyTorch, Transformers