Multilingual topic model

Motivation for the customer to launch the project:
the customer needed to add new functionality to their own product - the ability to search for a translation of a scientific article among the most common languages.

Description of the initial situation:
Antiplagiarism did not have such functionality for searching translations of scientific articles; there was a need to add new functionality.

Project goals:
build a topic model that can be used to solve two problems with a high level of quality: the problem of semantic search for the translation of scientific articles, as well as the problem of classifying scientific articles relative to scientific headings.

MIL Team solution:
the team’s experience in the field of topic modeling and microservice architecture made it possible to create a service for searching translations of scientific articles and determining scientific headings of articles, which can be launched in a virtual machine.

To build the model we used:

A parallel corpus of scientific articles from the elibrary website;
A parallel corpus of Wikipedia articles in 100 languages;
Labels of belonging to scientific headings of different rubricators (UDC, OECD).

Simulation results:

Thematic model of scientific rubrics;
A virtual machine on which the model can be run.

Client: Antiplagiarism
Technology stack: grpc, Python, sklearn, BigARTM