Modelos para análise de textos: um comparativo do número de tópicos
Carregando...
Data
Autores
Título da Revista
ISSN da Revista
Título de Volume
Editor
Universidade Federal de São Carlos
Resumo
Text modeling has gained significant visibility and popularity in recent years due to the large and ever-increasing amount of information present in daily life, consumed in various ways. For the efficiency and applicability of these models, the prior step of data preprocessing is of utmost importance, as it helps in the organization and treatment of texts. One branch within text analysis is topic modeling, whose methodologies aim to understand the topic structure that forms a document, segmenting multiple documents by their dominant topics (subjects) and thus simplifying the exploration of large volumes of textual data with the resulting dimensionality reduction. One of the pioneering methods in this context is the Mixture Model (MM), which assumes that each document will be composed of words from a single topic. Given this limitation, the technique of Latent Dirichlet Allocation (LDA) has gained considerable visibility due to its greater flexibility, as it allows each document to exhibit multiple topics. In both methodologies, model inference is generally given via a Bayesian approach. However, one of the characteristics of MM and LDA is the requirement that the user define the number of topics in the model from the outset. Therefore, the use of performance metrics becomes necessary after the application of the method, aiming to help in the definition and estimation of the best number of topics to be chosen. In this work, therefore, in addition to contrasting text analysis methodologies, we compare the metrics that measure the quality of the models and are used for choosing the number of topics. To do this, we apply the models and selection metrics to two sets of real data.
Descrição
Citação
COELHO FILHO, Edvaldo Capobiango. Modelos para análise de textos: um comparativo do número de tópicos. 2024. Dissertação (Mestrado em Estatística) – Universidade Federal de São Carlos, São Carlos, 2024. Disponível em: https://repositorio.ufscar.br/handle/20.500.14289/20846.
Coleções
item.page.endorsement
item.page.review
item.page.supplemented
item.page.referenced
Licença Creative Commons
Exceto quando indicado de outra forma, a licença deste item é descrita como Attribution 3.0 Brazil
