Creating a thematic text segmentation model

Project goals:
The goals of the project were to study existing methods for text segmentation, implement baselines based on methods without model training and with topic modeling on a corpus of documents, develop and test generative summarization methods for segmenting dialogues, select suitable datasets for comparison, study the applicability of these approaches to Russian-language data, and also preparation of scientific publication.

MIL Team solution:
A segmentation algorithm based on neural network summarization, called SumSeg, has been developed, which includes generating a document summary, extracting simple sentences, obtaining embeddings and applying the TextTiling algorithm.

To build the model we used:
Various generative summarization models were used, including BART, FLAN-T5 and LED for English-speaking dialogues, and mBART, ruT5, ruGPT3 for Russian-speaking ones, as well as popular conversational datasets such as SuperDialSeg, TIAGE, QMSum and internal customer data. BERTopic and BigARTM were used for topic modeling in baselines. Additionally, the ChatGPT3.5 and ChatGPT4 models were tested.

Simulation results:

Most of the available datasets were studied (there are parsers for Wiki727k, AMI, SuperDialSeg, DialSeg711, Doc2Dial, TIAGE, QMSum), the most popular dialog data was selected (SuperDialSeg, TIAGE, QMSum), Russian-language dialogs (Sber) on banking and educational topics were used.
The most popular and high-quality baselines for comparison (BERTSeg, TextTiling+BigARTM) have been implemented in a library format in a modular form; for completeness, current methods based on neural network topic modeling (TextTiling+BERTopic) have been added.
In addition, LLM-based methods are shown as a family of ChatGPT models. A scientifically new approach (SumSeg) to segmentation based on generative summarization, which bypasses most existing methods for segmentation metrics, is proposed and investigated.
The approach works on any conversational data, is best suited for transcribed, that is, highly noisy data (QMSum), and can also be applied to texts of any length due to the proposed chunking approach.
The results of the study were recorded in the research article Leveraging summarization for unsupervised topic segmentation of long dialogues, sent to the EACL 2024 conference (core A), during the discussion with reviewers, the results were supplemented with new relevant baselines (CohereSeg, DialSTART, HyperSeg). Several summarization models in English and Russian were studied, among which BART-samsum, ruGPT3 and mBART stood out qualitatively.
The limits of applicability of the algorithms are presented in the form of a table with conclusions based on internal data.

Customer:
The results of the project can be used to work with internal data of various organizations, including interactive datasets in the banking area.

Technology stack:
Various text processing and data analysis tools were used, including NLTK, spaCy, and machine learning algorithms and techniques such as cosine proximity and the Savitzky-Golay filter.