Success Story - en

Finding the optimal number of topics

Motivation for launching a project by the customer:
topic modeling is used to study collections of text documents, namely to identify hidden topics as probability distributions on a set of words. However, the number of topics itself, as a rule, is a hyperparameter of the topic model, that is, it must be set based on some considerations before directly training the model. That is, topic models are not capable of determining the number of topics in a collection. Another disadvantage of topic models is that they are incomplete and unstable. Incompleteness refers to the fundamental inability of a single model to find all the topics that are represented in the test collection. Typically, a complete exploration of a collection requires training several (many) topic models. Instability means that the result of the model may significantly depend on some initial settings of the model or details of the model training algorithm. Thus, the final topics can be influenced by the initial initialization of the model, the number of topics set before training, the order of documents during training, the regularizers used and their order (in the case of training an ARTM model).

Description of the initial situation:
  • there are different topic models (PLSA, LDA, ARTM);
  • topic models are unstable and incomplete;
  • the number of topics is a hyperparameter of the topic model;
  • the final topics depend both on the number of topics specified before training and on the model used;
  • Among the topics that a topic model produces, there may be uninterpretable and recurring topics.

Project goals:
  • explore the possibility of determining the optimal number of topics in a collection of documents using a number of metrics and using a number of approaches presented in publications. Analysis is required using several publicly available collections of text documents and a range of topic models. It is also desirable that the collections of documents be different (either that documents in different collections be in different languages, or that articles in different collections differ significantly in length or language style).
  • propose a way to study document collections using topic models that takes into account and overcomes the incompleteness and instability of models.

MIL Team solution:
  • design of an experiment to compare approaches to determine the optimal number of topics in a collection of text documents;
  • preparing data sets for experiments;
  • implementation of popular topic models based on the TopicNet and BigARTM frameworks;
  • TopicBank is a wrapper around topic modeling, a tool that takes into account the incompleteness and instability of topic models.

To build the model, we used:
publicly available sets of text documents for training models (a collection of popular science articles from the PostNauka resource, popular NLP datasets: Twenty Newsgroups, Reuters and Brown, good articles from Russian Wikipedia, a collection of posts from the StackOverflow resource, WikiRef220 ).

Simulation results:
  • Several datasets are prepared for topic modeling experiments using the TopicNet and BigARTM libraries. Some of the datasets are made publicly available.
  • a system for analyzing collections of text documents using multiple learning of topic models. The basic version of the system, with an implemented algorithm for selecting topics using multiple training of topic models, is made publicly available. The other, closed one, also implements a user interface that provides the ability to conveniently and quickly explore the topics of the newly trained topic model.

Customer: Joint Stock Company "Information and Analytical Center", Nur-Sultan, Kazakhstan
Technology stack: TopicNet, BigARTM, Python
NLP Research
Made on