Modelos para análise de textos: um comparativo do número de tópicos

Carregando...
Imagem de Miniatura

Título da Revista

ISSN da Revista

Título de Volume

Editor

Universidade Federal de São Carlos

Resumo

Text modeling has gained significant visibility and popularity in recent years due to the large and ever-increasing amount of information present in daily life, consumed in various ways. For the efficiency and applicability of these models, the prior step of data preprocessing is of utmost importance, as it helps in the organization and treatment of texts. One branch within text analysis is topic modeling, whose methodologies aim to understand the topic structure that forms a document, segmenting multiple documents by their dominant topics (subjects) and thus simplifying the exploration of large volumes of textual data with the resulting dimensionality reduction. One of the pioneering methods in this context is the Mixture Model (MM), which assumes that each document will be composed of words from a single topic. Given this limitation, the technique of Latent Dirichlet Allocation (LDA) has gained considerable visibility due to its greater flexibility, as it allows each document to exhibit multiple topics. In both methodologies, model inference is generally given via a Bayesian approach. However, one of the characteristics of MM and LDA is the requirement that the user define the number of topics in the model from the outset. Therefore, the use of performance metrics becomes necessary after the application of the method, aiming to help in the definition and estimation of the best number of topics to be chosen. In this work, therefore, in addition to contrasting text analysis methodologies, we compare the metrics that measure the quality of the models and are used for choosing the number of topics. To do this, we apply the models and selection metrics to two sets of real data.

Descrição

Citação

COELHO FILHO, Edvaldo Capobiango. Modelos para análise de textos: um comparativo do número de tópicos. 2024. Dissertação (Mestrado em Estatística) – Universidade Federal de São Carlos, São Carlos, 2024. Disponível em: https://repositorio.ufscar.br/handle/20.500.14289/20846.

item.page.endorsement

item.page.review

item.page.supplemented

item.page.referenced

Licença Creative Commons

Exceto quando indicado de outra forma, a licença deste item é descrita como Attribution 3.0 Brazil