Thematic clustering of dialogues

Motivation for launching the project by the customer:
For CC analysts, it is important to quickly understand the composition of topics in the corpus of dialogues in order to quickly automate the work. Building such a taxonomy entirely manually is a very labor-intensive task that requires automation.

Description of the initial situation:
Automating responses from contact center operators requires having a taxonomy of issues that clients address. Such a taxonomy will allow categorization of requests and their subsequent processing. When collaborating with a large number of contact centers on various topics, a system for quickly analyzing a corpus of dialogues is needed. It is required to create a tool for automatically constructing ready-made taxonomies for dialogue corpora.

MIL Team solution:
We asked our partner for a labeled sample of synonymous dialogues, which helped us compare different models and configure its parameters to solve a specific problem.
We tested several methods for solving the problem: various neural network approaches to paraphrase retrieval and hierarchical multimodal topic models. Topic models performed better.
The final solution was packaged in a Docker container that implemented the business logic required by the partner.

Results:

Reducing the load on the analyst
Reduced time to identify new categories
Definition of new intents in the request flow

Allowed difficulties

Model resistant to changing themes
Stability of the model when changing the size of the text corpus
Correction of typos (including for a corpus with very specific vocabulary)

Customer: Telecom
Technology stack: TopicNet, BigARTM, Flask, Python, PyTorch, gensim, UMAP