Avaliação de métodos de construção de redes na classificação semi-supervisionada de textos
Abstract
Due to the shear amount of data produced daily in text format, being it publicly on social media or privately inside enterprises, there is a growing need to analyze and extract information from them. The objective is to transform this data into useful tools such as translation systems and virtual assistants. The area of Natural Language Processing, together with Machine Learning, provides the necessary technologies for such objective. One of the most explored tasks in this context is text classification. Between the diverse approaches existing in this area, semi-supervised learning algorithms stand out. Specifically, transductive algorithms, that receive as input data in the form of networks and return labeled data. This strategy needs the initial construction of a network based on the analyzed data, a task for which many algorithms can be used, producing networks with different topological characteristics, interfering directly in the classification accuracy. In this context, the objective of this study is to analyze the influence of network-building algorithms on semi-supervised text classification. An empirical evaluation on real document collections was carried out. The results point that algorithms that generate non-regular networks have a better overall performance, furthermore algorithms that allow the use of the cosine metric, more suitable for text-based data, performed better than those that don’t. These methods are: k-NN, Epsilon, GBLP and Mk-NN.
Collections
The following license files are associated with this item: