Аннотации:
This article presents a comprehensive review of short text clustering using stateof-the-art methods: Bidirectional Encoder Representations from Transformers (BERT), Term
Frequency-Inverse Document Frequency (TF-IDF), and the novel hybrid method Latent
Dirichlet Allocation + BERT + Autoencoder (LDA + BERT + AE). The article begins by
outlining the theoretical foundation of each technique and its merits and limitations. BERT
is critiqued for its ability to understand word dependence in text, while TF-IDF is lauded
for its applicability in terms of importance assessment. The experimental section compares
the efficacy of these methods in clustering short texts, with a specific focus on the hybrid
LDA + BERT + AE approach. A detailed examination of the LDA-BERT model’s training
and validation loss over 200 epochs shows that the loss values start above 1.2 and quickly
decrease to around 0.8 within the first 25 epochs, eventually stabilizing at approximately
0.4. The close alignment of these curves suggests the model’s practical learning and
generalization capabilities, with minimal overfitting. The study demonstrates that the
hybrid LDA + BERT + AE method significantly enhances text clustering quality compared
to individual methods. Based on the findings, the study recommends the optimum choice
and use of clustering methods for different short texts and natural language processing
operations. The applications of these methods in industrial and educational settings, where
successful text handling and categorization are critical, are also addressed. The study ends
by emphasizing the importance of the holistic handling of short texts for deeper semantic
comprehension and effective information retrieval.