ESIREOS: Avaliação internal, eficiente e escalável de métodos não supervisionados de detecção de anomalias
Abstract
Anomaly (outlier) detection is one of the main problems in data mining. Since anomalies can translate into important information in numerous fields, several methods were developed to identify them, especially unsupervised methods, which is the focus of this work. To soften the need for studies on assessing and quantifying the quality of the result of these unsupervised methods, the IREOS index was proposed as the first internal evaluation technique for unsupervised anomaly detection methods. IREOS allows one to select the best algorithm and parameters for a given problem using only intrinsic information from the data. However, IREOS demands the training of many highly complex classifiers for each object in the dataset whose outlier detection solutions are being analyzed. This feature limits the application of IREOS to small datasets since the classifiers use all points in the dataset during its training. In the present work, we propose ESIREOS, the first version of IREOS that addresses its performance and processing deficiencies using Massive Parallel Computing techniques that efficiently implement horizontal computational scaling for many machine learning problems. ESIREOS also makes use of approximated Nearest Neighbor Graphs to reduce the volume of data and processing power demanded by IREOS without any significant loss in the quality of the results. We evaluate ESIREOS theoretically, estimating its asymptotic complexity and with experiments over real and synthetic datasets to attest to its effectiveness and performance compared to the original version, including large datasets. The results showed that ESIREOS resulted in a significant improvement in computational complexity when compared to the original IREOS while maintaining quality. ESIREOS showed to be capable of evaluating solutions for very large datasets, even those which IREOS was not capable of evaluating in a feasible time. Therefore, this efficient and scalable new version can be used in many scenarios, mainly, but not limited to, those with large or distributed data.
Collections
The following license files are associated with this item: