Árvore de predição semi-supervisionada para predição de localização subcelular de proteínas
Abstract
Protein subcellular localization is a really important classification task, because the location of proteins inside a cell is directly related to these protein’s functions. As there are a lot of proteins that reside at the same time in two or more locations in a cell or move between locations, usually supervised multi-label classification methods are designed to attack this problem. This approach is well-established in the literature; however, it presents some disadvantages such as: (i) the need for a large amount of labeled instances to train the classifier; (ii) this approach ignores the fact that unlabeled instances can provide valuable information for the classification; and (iii) there are a lot of areas in which unlabeled data is abundant but manually labelling an instance is too expensive and time-consuming. Semi-Supervised Learning (SSL) is a subfield of traditional machine learning, in which the learner tries to exploit both labeled and unlabeled data at the same time. Semi-Supervised Classification is a in a subcategory of SSL which uses the available unlabeled data to improve the classification prformance of a classification process that already uses labeled data. The main goal of this project was the develop a semi-supervised multi-label classifier able to use the abundant number of unlabeled proteins to improve the prediction of protein subcellular localization. The SSL algorithm developed in this work is based on the predictive clustering tree framework and it was constructed, tested and analysed in many SSL scenarios in order to test whether or not the classifier was able to use the unlabeled instances to help during the classification process in a set of Multi-Label protein subcellular localization datasets, from 3 different taxonomies: Viridiplantae, Virus and Fungi.
Collections
The following license files are associated with this item: