Organização de termos e documentos utilizando Co-clustering e agrupamento de Word Embeddings
Abstract
There is a large amount of text documents available on the web which increases as more devices and users connect to the network. Analyzing and organizing such documents considering characteristics such as subject and keywords becomes an increasingly expensive task, but indispensable, considering tasks such as text mining and information retrieval and, therefore, ways to improve the performance of such tasks are widely investigated. Most tasks aimed at organizing documents available today, such as clustering tasks, focus on only one dimension, that is, clustering only documents considering the occurrence of terms. However, an important aspect of clustering documents is finding topics that identify groups of documents by their content. Two-dimensional clustering strategies, which simultaneously group documents and terms, can be useful in this regard. However, the representation used is, in general, in the form of matrices of high dimensionality and sparsity, which does not include any semantic information. This work presents a new approach to organize documents using co-clustering and the representation of terms in the form of embeddings. The terms of the document collection are clustered in advance, allowing for the reduction of the sparsity and dimensionality of the matrix. In addition to the new representation, the proposed strategy includes contributions to assess the outcome of co-clustering that explore the association between groups of documents and terms. In co-clustering tasks, the results showed that the representation surpasses the traditional TF-IDF representation in specific cases.
Collections
The following license files are associated with this item: