Topic Modeling on Arabic Language Dataset: Comparative Study

By

Abdelrazek A.

Medhat W.

Gawish E.

Hassan A.

Artificial Intelligence

Circuit Theory and Applications

Software and Communications

Topic modeling automatically infers the hidden themes in a collection of documents. There are several developed techniques for topic modeling, which are broadly categorized into Algebraic, Probabilistic and Neural. In this paper, we use an Arabic dataset to experiment and compare six models (LDA, NMF, CTM, ETM, and two Bertopic variants). The comparison used evaluation metrics of topic coherence, diversity, and computational cost. The results show that among all the presented models, the neural BERTopic model with Roberta-based sentence transformer achieved the highest coherence score (0.1147), which is 36% above Bertopic with Arabert (the second best in coherence). At the same time, the topic diversity is 6% lower than the CTM model (the second best in diversity) at the cost of doubling the computation time. © 2022, The Author(s), under exclusive license to Springer Nature Switzerland AG.