Evaluating Topic Models with OpenAI Embeddings: A Comparative Analysis on Variable-Length Texts Using Two Datasets

Wahbeh, Abdullah; Al-Ramahi, Mohammad; El-Gayar, Omar; Elnoshokaty, Ahmed; Nasralah, Tareq

Evaluating Topic Models with OpenAI Embeddings: A Comparative Analysis on Variable-Length Texts Using Two Datasets

Files

0154.pdf (402.07 KB)

Date

2025-01-07

Authors

Starting Page

1571

Abstract

Topic modeling is a crucial unsupervised machine learning technique for identifying themes within unstructured text. This study compares traditional topic modeling methods, like Latent Dirichlet Allocation (LDA), against advanced embedding-based models, specifically BERTopic-OpenAI. The analysis utilizes two distinct datasets: user reviews from the mental health app Replika and the 20newsgroup dataset. For the Replika dataset, both methods identified common themes, but BERTopic-OpenAI uncovered additional nuanced topics, demonstrating its enhanced semantic capabilities. Quantitative evaluation of the 20newsgroup dataset further highlighted BERTopic-OpenAI's advantage through achieving higher topic coherence and diversity than the best-performing LDA model. These results suggest that embedding-based models provide more coherent, interpretable, and diverse topics, making them valuable tools for extracting meaningful insights from extensive and variable-length text corpora. Future research should focus on refining these advanced techniques to improve their applicability and effectiveness in dynamic and varied textual environments.

Keywords

Natural Language Processing and Large Language Models Supporting Data Analytics for System Sciences, coherence, diversity, embeddings, interpretability, openai, topic models

URI

https://hdl.handle.net/10125/109030

Extent

10

Related To

Proceedings of the 58th Hawaii International Conference on System Sciences

Rights

Attribution-NonCommercial-NoDerivatives 4.0 International

Collections

Natural Language Processing and Large Language Models Supporting Data Analytics for System Sciences

Full item page

Email libraryada-l@lists.hawaii.edu if you need this content in ADA-compliant format.

Evaluating Topic Models with OpenAI Embeddings: A Comparative Analysis on Variable-Length Texts Using Two Datasets

Files

Date

Authors

Contributor

Advisor

Department

Instructor

Depositor

Speaker

Researcher

Consultant

Interviewer

Narrator

Transcriber

Annotator

Journal Title

Journal ISSN

Volume Title

Publisher

Volume

Number/Issue

Starting Page

Ending Page

Alternative Title

Abstract

Description

Keywords

Citation

URI

Extent

Format

Geographic Location

Time Period

Related To

Related To (URI)

Table of Contents

Rights

Rights Holder

Local Contexts

Collections