Agrupamento de embeddings: análise exploratória de datasets textuais

Marchiori, Lucas Henrique

Agrupamento de embeddings: análise exploratória de datasets textuais

Arquivos

TCC_Lucas_Marchiori.pdf (3.68 MB)

Data

2025-12-08

Autores

Marchiori, Lucas Henrique

Editor

Universidade Federal de São Carlos

Resumo

ABSTRACT The inadequate selection of clustering algorithms for word embeddings—vector representation techniques that capture semantic relationships in multidimensional spaces—can significantly compromise the quality of semantic pattern discovery, resulting in clusters with low internal cohesion and poor separation. The absence of clear guidelines forces researchers into extensive trial-and-error processes, consuming computational resources and time, which can lead to performance degradation in subsequent Natural Language Processing tasks. In this context, this work investigates the performance of four clustering algorithms (K-Means, Self-Organizing Maps (SOM), HDBSCAN, and Agglomerative Hierarchical Clustering (AHC)) applied to word embeddings generated by the Word2Vec and SBERT models across ten datasets from distinct domains. The quality of the clusters was measured by quantitative metrics, such as the Silhouette Score, the Davies-Bouldin Index (DBI),the Density-Based Clustering Validation Index (DBCV), and the Adjusted Rand Index(ARI). The results revealed that no universally superior combination exists, but rather systematic patterns that allow for a scientific prediction of the optimal methodology. The HDBSCAN + Word2Vec combination emerged as dominant in 66% of cases, proving especially effectivein domains with standardized vocabulary or structured informational redundancy, such as in Reuters-21578 and Steam Games. Conversely, the HDBSCAN + SBERT combination was superior in creative and semantically heterogeneous domains, such as Spotify and Amazon Reviews, which require greatercontextual understanding. The study establishes that the type of textual domain specialization—creative or technical—determines the optimal embedding. Additionally, HDBSCAN established itself as the most versatile algorithm, achieving the best performance in 72% of the analyzed scenarios, notably for its robustness in handling noise and variable-density clusters, which are intrinsic characteristics of real-world textual data.

Palavras-chave

Algoritmos de agrupamento, Word embeddings, Processamento de linguagem natural

Citação

MARCHIORI, Lucas Henrique. Agrupamento de embeddings: análise exploratória de datasets textuais. 2025. Trabalho de Conclusão de Curso (Graduação em Engenharia de Computação) – Universidade Federal de São Carlos, São Carlos, 2025. Disponível em: https://repositorio.ufscar.br/handle/20.500.14289/23572.

URI

https://hdl.handle.net/20.500.14289/23572

Coleções

TCC

Licença Creative Commons

Exceto quando indicado de outra forma, a licença deste item é descrita como Attribution 3.0 Brazil

Página do item completo

Agrupamento de embeddings: análise exploratória de datasets textuais

Arquivos

Data

Autores

Título da Revista

ISSN da Revista

Título de Volume

Editor

Resumo

Descrição

Palavras-chave

Citação

URI

Coleções

item.page.endorsement

item.page.review

item.page.supplemented

item.page.referenced

Licença Creative Commons