UNIVERSIDADE FEDERAL DE SA˜O CARLOS CENTRO DE CIEˆNCIAS EXATAS E DE TECNOLOGIA PROGRAMA DE PO´S-GRADUAC¸A˜O EM CIEˆNCIA DA COMPUTAC¸A˜O LEARNING CONCEPT DRIFT WITH OPTIMUM-PATH FOREST ADRIANA SAYURI IWASHITA ORIENTADOR: PROF. DR. JOA˜O PAULO PAPA Sa˜o Carlos – SP Dezembro/2020 UNIVERSIDADE FEDERAL DE SA˜O CARLOS CENTRO DE CIEˆNCIAS EXATAS E DE TECNOLOGIA PROGRAMA DE PO´S-GRADUAC¸A˜O EM CIEˆNCIA DA COMPUTAC¸A˜O LEARNING CONCEPT DRIFT WITH OPTIMUM-PATH FOREST ADRIANA SAYURI IWASHITA Tese apresentada ao Programa de Po´s-Graduac¸a˜o em Cieˆncia da Computac¸a˜o da Universidade Fed- eral de Sa˜o Carlos, como parte dos requisitos para a obtenc¸a˜o do tı´tulo de Doutor em Cieˆncia da Computac¸a˜o, a´rea de concentrac¸a˜o: Processamento de imagens e sinais Orientador: Prof. Dr. Joa˜o Paulo Papa Sa˜o Carlos – SP Dezembro/2020 UNIVERSIDADE FEDERAL DE SÃO CARLOS Centro de Ciências Exatas e de Tecnologia Programa de Pós-Graduação em Ciência da Computação Folha de Aprovação Defesa de Tese de Doutorado da candidata Adriana Sayuri Iwashita, realizada em 17/12/2020. Comissão Julgadora: Prof. Dr. João Paulo Papa (UFSCar) Prof. Dr. Alexandre Luis Magalhães Levada (UFSCar) Prof. Dr. Diego Furtado Silva (UFSCar) Prof. Dr. Kelton Augusto Pontara da Costa (Fatec) Prof. Dr. Fabricio Aparecido Breve (UNESP) O presente trabalho foi realizado com apoio da Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - Brasil (CAPES) - Código de Financiamento 001. O Relatório de Defesa assinado pelos membros da Comissão Julgadora encontra-se arquivado junto ao Programa de Pós-Graduação em Ciência da Computação. Aos meus pais. AGRADECIMENTOS Agradec¸o a Deus por ser o apoio em todos os momentos da vida, me dando forc¸a, coragem e sau´de para enfrentar as dificuldades. Agradec¸o aos meus pais que tanto amo. Ao meu pai Kussuo Iwashita, que me ensinou que sempre ha´ um lado bom nas coisas, e que mesmo nos momentos difı´ceis na˜o devo me deixar abater; e a` minha ma˜e Regina Iwashita, que atrave´s de suas ac¸o˜es me ensina como ter um bom cara´ter. Agradec¸o ao meu orientador Prof. Joa˜o Paulo Papa, por todo o apoio e incentivo que sempre me proporcionou durante todos esses anos, por ser esse professor que e´ tambe´m um amigo e uma pessoa de muito bom corac¸a˜o. Obrigada pela pacieˆncia nos momentos difı´ceis, pela disponibilidade nos momentos de du´vida e pelas orientac¸o˜es nos momentos de inseguranc¸a. Agradec¸o aos professores e funciona´rios do DC/UFSCar - Sa˜o Carlos, que auxiliaram na realizac¸a˜o deste trabalho direta ou indiretamente. Agradec¸o tambe´m a` Coordenac¸a˜o de Aperfeic¸oamento de Pessoal de Nı´vel Superior - Brasil (CAPES) pelo auxı´lio financeiro conce- dido. Agradec¸o ao Bruno Kochimizu, que esteve sempre ao meu lado me apoiando em tudo, em todos os momentos, nos altos e baixos. Agradec¸o a todos os membros do grupo RECOGNA, em especial aos amigos Rafael, Luis, Leandro, Clayton e Douglas pelo companheirismo e por toda a disposic¸a˜o em ajudar. Enfim, agradec¸o aos meus amigos, familiares e a todos que me ajudaram e que torceram por mim. Muito obrigada! If I have seen further it is by standing on the shoulders of Giants. Isaac Newton RESUMO Algoritmos de classificac¸a˜o baseiam sua decisa˜o de acordo com o seu aprendizado na base de dados sobre um conjunto de treinamento. Logo, os dados a serem classificados no con- junto de teste devem possuir distribuic¸a˜o igual do conjunto de treinamento para que se- jam corretamente identificados. Atualmente, as aplicac¸o˜es industriais e de empresas geram uma enorme quantidade de fluxo de dados, tais como os dados de uma rede de sensores, registros de chamadas, entre outros. Ainda, com as novas tecnologias sendo desenvolvi- das em servic¸os de internet, surgem fluxo de dados dos mais diversos domı´nios, incluindo transac¸o˜es de compras na internet e pesquisas na web. Esses fluxos de dados apresentam caracterı´sticas que os me´todos tradicionais em minerac¸a˜o de dados agora precisam lidar, que sa˜o bases de dados com grande volume e que esta˜o sujeitas a` mudanc¸a de conceito, a qual refere-se a um problema de aprendizagem na˜o estaciona´rio ao longo do tempo, ou seja, o classificador de determinado problema pode na˜o ser mais u´til apo´s decorrido algum tempo, por estar “desatualizado”. Isso ocorre pois um conceito pode sofrer modificac¸o˜es com o tempo. Por exemplo, um leitor pode gostar de artigos com notı´cias relativas a` “es- portes”; mas com o passar do tempo sua prefereˆncia de leitura pode mudar para “economia” e o to´pico anterior se tornar irrelevante para ele, ou seja, o conceito de artigo relevante para este leitor foi alterado. O presente trabalho de pesquisa propo˜e o estudo do classificador Optimum-Path Forest (OPF) em ambientes com mudanc¸a de conceito, tanto na abordagem supervisionada (utilizando alguns me´todos para lidar com mudanc¸a de conceito como o uso de janelas nos dados e comiteˆs de decisa˜o) como na abordagem na˜o supervisionada, e realizamos experimentos em bases de dados encontrados na literatura. Palavras-chave: OPF, Mudanc¸a de conceito ABSTRACT Classification algorithms take their decisions according to a learning process on the training set. Therefore, the data to be classified in the test set must have the same distribution as the training set to be correctly identified. Nowadays, industrial and enterprise applications generate a huge amount of data streams, such as sensor network data, and call records, among others. Also, with the new technologies being developed in internet services, data can stream from diverse domains, including internet transactions and web searches. These data streams present characteristics that traditional data mining methods have to deal with, which are databases with high volume and susceptible to concept drift, which refers to a non-stationary learning problem over time, i.e., the classifier of a certain problem may not be suitable as time goes by for being “outdated.” This occurs because a concept may change over time. For example, a reader might like news articles on “sports”; but over time your reading preference may change to “economy” and the previous topic becomes irrelevant, i.e., the concept of an article relevant to this reader has changed. The present research proposes the study of Optimum-Path Forest (OPF) classifier in dynamic environments, both in supervised approach (using some methods to deal with concept drift as data windows and decision committees) as in the unsupervised approach, and we conducted experiments on databases observed in the literature. Keywords: OPF, Concept drift LIST OF FIGURES 2.1 Types of concept drift, extracted from (WADEWALE; DESAI, 2015) . . . . . . . . 22 2.2 Full-memory approach. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.3 No-memory approach. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.4 Window-of-size-3 approach. . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.5 Percentage concerning different learning approaches used in concept drift. . . . 47 2.6 Percentage of drift detector mechanisms. . . . . . . . . . . . . . . . . . . . . . 48 2.7 Percentage of specific or any-classifier approach that can be used. . . . . . . . 48 2.8 Percentage of the classifiers used in the works considered in this survey. . . . . 49 2.9 Concept drift handling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 2.10 Online or Batch approaches. . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 2.11 Other issues addressed in the articles. . . . . . . . . . . . . . . . . . . . . . . 51 2.12 Number of classes recognized by the methods. . . . . . . . . . . . . . . . . . . 51 2.13 Datasets used in the articles. . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 2.14 Methods compared in the articles. . . . . . . . . . . . . . . . . . . . . . . . . 53 2.15 Articles separated by years. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.1 OPF training step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.1 Experimental results over Hyperplane dataset stream. . . . . . . . . . . . . . . 61 4.2 Experimental results over Usenet1 dataset stream. . . . . . . . . . . . . . . . . 62 4.3 Experimental results over Usenet2 dataset stream. . . . . . . . . . . . . . . . . 62 5.1 Experiment 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 5.2 Experiment 2 with the full-memory approach. . . . . . . . . . . . . . . . . . . 66 5.3 Experiment 2 with the no-memory approach. . . . . . . . . . . . . . . . . . . 66 5.4 Experiment 2 with the window-of-fixed-size three. . . . . . . . . . . . . . . . 66 5.5 Experiment 3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 5.6 Covertype dataset with respect to Experiment 1. . . . . . . . . . . . . . . . . . 67 5.7 Electricity dataset with respect to Experiment 1. . . . . . . . . . . . . . . . . . 68 5.8 Hyperplane dataset with respect to Experiment 1. . . . . . . . . . . . . . . . . 69 5.9 Poker dataset with respect to Experiment 1. . . . . . . . . . . . . . . . . . . . 69 5.10 SEA dataset with respect to Experiment 1. . . . . . . . . . . . . . . . . . . . . 70 5.11 Usenet1 dataset with respect to Experiment 1. . . . . . . . . . . . . . . . . . . 71 5.12 Usenet2 dataset with respect to Experiment 1. . . . . . . . . . . . . . . . . . . 71 5.13 Covertype dataset with respect to Experiment 2 in full-memory management. . 72 5.14 Electricity dataset with respect to Experiment 2 in no-memory management. . . 73 5.15 Hyperplane dataset with respect to Experiment 2 in window-3 management. . . 74 5.16 Poker dataset with respect to Experiment 2 in full-memory management. . . . . 74 5.17 SEA dataset with respect to Experiment 2 in no-memory management. . . . . . 75 5.18 Usenet1 dataset with respect to Experiment 2 in window-3 management. . . . . 76 5.19 Usenet2 dataset with respect to Experiment 2 in full-memory management. . . 76 5.20 Covertype dataset with respect to Experiment 3. . . . . . . . . . . . . . . . . . 77 5.21 Electricity dataset with respect to Experiment 3. . . . . . . . . . . . . . . . . . 78 5.22 Hyperplane dataset with respect to Experiment 3. . . . . . . . . . . . . . . . . 78 5.23 Poker dataset with respect to Experiment 3. . . . . . . . . . . . . . . . . . . . 79 5.24 SEA dataset with respect to Experiment 3. . . . . . . . . . . . . . . . . . . . . 80 5.25 Usenet1 dataset with respect to Experiment 3. . . . . . . . . . . . . . . . . . . 80 5.26 Usenet2 dataset with respect to Experiment 3. . . . . . . . . . . . . . . . . . . 81 5.27 Experimental results concerning Electricity dataset. . . . . . . . . . . . . . . . 83 5.28 Experimental results concerning Hyperplane dataset. . . . . . . . . . . . . . . 84 5.29 Experimental results concerning SEA dataset. . . . . . . . . . . . . . . . . . . 84 5.30 Experimental results concerning Usenet1 dataset. . . . . . . . . . . . . . . . . 84 5.31 Experimental results concerning Usenet2 dataset. . . . . . . . . . . . . . . . . 85 6.1 2-nearest neighbors graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 6.2 Density computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 6.3 Optimum-path forest generated after the competition process . . . . . . . . . . 91 6.4 New sample x′ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 6.5 Illustration of the situation described by Proposition 6.3.3. . . . . . . . . . . . 95 6.6 Experiment with Commercial dataset: recognition rates. . . . . . . . . . . . . . 99 6.7 Experiment with Commercial dataset: training time (s). . . . . . . . . . . . . . 99 6.8 Experiment with Industrial dataset: recognition rates. . . . . . . . . . . . . . . 100 6.9 Experiment with Industrial dataset: training time (s). . . . . . . . . . . . . . . 101 7.1 Experimental results in the first experiment: recognition rates. . . . . . . . . . 106 7.2 Experimental results in the second experiment: recognition rates. . . . . . . . . 106 7.3 Experimental results in the third experiment: recognition rates. . . . . . . . . . 107 7.4 Experimental results in the fourth experiment: recognition rates. . . . . . . . . 107 7.5 Experimental results in the first experiment: accuracy in drift detection. . . . . 108 7.6 Experimental results in the second experiment: accuracy in drift detection. . . . 109 7.7 Experimental results in the third experiment: accuracy in drift detection. . . . . 109 7.8 Experimental results in the fourth experiment: accuracy in drift detection. . . . 110 A.1 Covertype dataset with respect to Experiment 1. . . . . . . . . . . . . . . . . . 128 A.2 Hyperplane dataset with respect to Experiment 1. . . . . . . . . . . . . . . . . 129 A.3 SEA dataset with respect to Experiment 1. . . . . . . . . . . . . . . . . . . . . 129 A.4 Covertype dataset with respect to Experiment 2 in full-memory management. . 130 A.5 Hyperplane dataset with respect to Experiment 2 in no-memory management. . 131 A.6 SEA dataset with respect to Experiment 2 in window-3 management. . . . . . . 132 A.7 Covertype dataset with respect to Experiment 3. . . . . . . . . . . . . . . . . . 132 A.8 Hyperplane dataset with respect to Experiment 3. . . . . . . . . . . . . . . . . 133 A.9 SEA dataset with respect to Experiment 3. . . . . . . . . . . . . . . . . . . . . 134 LIST OF TABLES 2.1 Drift detector mechanism. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 2.2 Classifiers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 2.3 Concept drift handling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 2.4 Learning approaches. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 5.1 Synthetic Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 5.2 Real-world Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 5.3 Covertype dataset with respect to Experiment 1. . . . . . . . . . . . . . . . . . 68 5.4 Electricity dataset with respect to Experiment 1. . . . . . . . . . . . . . . . . . 68 5.5 Hyperplane dataset with respect to Experiment 1. . . . . . . . . . . . . . . . . 69 5.6 Poker Hand dataset with respect to Experiment 1. . . . . . . . . . . . . . . . . 70 5.7 SEA dataset with respect to Experiment 1. . . . . . . . . . . . . . . . . . . . . 70 5.8 Usenet1 dataset with respect to Experiment 1. . . . . . . . . . . . . . . . . . . 71 5.9 Usenet2 dataset with respect to Experiment 1. . . . . . . . . . . . . . . . . . . 71 5.10 Covertype dataset with respect to Experiment 2 in full-memory management. . 72 5.11 Electricity dataset with respect to Experiment 2 in no-memory management. . . 73 5.12 Hyperplane dataset with respect to Experiment 2 in window-3 management. . . 74 5.13 Poker dataset with respect to Experiment 2 in full-memory management. . . . . 75 5.14 SEA dataset with respect to Experiment 2 in no-memory management. . . . . . 75 5.15 Usenet1 dataset with respect to Experiment 2 in window-3 management. . . . . 76 5.16 Usenet2 dataset with respect to Experiment 2 in full-memory management. . . 76 5.17 Covertype dataset with respect to Experiment 3. . . . . . . . . . . . . . . . . . 77 5.18 Electricity dataset with respect to Experiment 3. . . . . . . . . . . . . . . . . . 78 5.19 Hyperplane dataset with respect to Experiment 3. . . . . . . . . . . . . . . . . 79 5.20 Poker dataset with respect to Experiment 3. . . . . . . . . . . . . . . . . . . . 79 5.21 SEA dataset with respect to Experiment 3. . . . . . . . . . . . . . . . . . . . . 80 5.22 Usenet1 dataset with respect to Experiment 3. . . . . . . . . . . . . . . . . . . 80 5.23 Usenet2 dataset with respect to Experiment 3. . . . . . . . . . . . . . . . . . . 81 6.1 Dataset description and kmax percentages. . . . . . . . . . . . . . . . . . . . . 97 6.2 Recognition rates using the configuration 50−50 . . . . . . . . . . . . . . . . 102 6.3 Training time (s) using the configuration 90−10 . . . . . . . . . . . . . . . . 103 A.1 Covertype dataset with respect to Experiment 1. . . . . . . . . . . . . . . . . . 128 A.2 Hyperplane dataset with respect to Experiment 1. . . . . . . . . . . . . . . . . 129 A.3 SEA dataset with respect to Experiment 1. . . . . . . . . . . . . . . . . . . . . 130 A.4 Covertype dataset with respect to Experiment 2 in full-memory management. . 130 A.5 Hyperplane dataset with respect to Experiment 2 in no-memory management. . 131 A.6 SEA dataset with respect to Experiment 2 in window-3 management. . . . . . . 132 A.7 Covertype dataset with respect to Experiment 3. . . . . . . . . . . . . . . . . . 133 A.8 Hyperplane dataset with respect to Experiment 3. . . . . . . . . . . . . . . . . 133 A.9 SEA dataset with respect to Experiment 3. . . . . . . . . . . . . . . . . . . . . 134 SUMMARY CHAPTER 1 – INTRODUCTION 16 1.1 Hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 1.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 CHAPTER 2 – LITERATURE REVIEW 21 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.2 Concept Drift . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.2.1 Types of Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.3 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.3.1 Datasets used in the literature . . . . . . . . . . . . . . . . . . . . . . 39 2.3.1.1 Synthetic Datasets . . . . . . . . . . . . . . . . . . . . . . . 40 2.3.1.2 Real-world Datasets . . . . . . . . . . . . . . . . . . . . . . 43 2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 2.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 CHAPTER 3 – OPTIMUM PATH FOREST 54 3.1 Supervised Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.1.1 Complete Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.1.2 K-nn Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 3.2 Unsupervised Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.3 Accuracy Measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 CHAPTER 4 – LEARNING CONCEPT DRIFT WITH OPTIMUM-PATH FOREST 59 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.3 Partial Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 CHAPTER 5 – LEARNING CONCEPT DRIFT WITH ENSEMBLES OF OPTIMUM- PATH FOREST-BASED CLASSIFIERS 63 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 5.2 Proposed Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 5.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 5.3.1 Experiment 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 5.3.2 Experiment 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 5.3.3 Experiment 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 5.3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 5.4 Partial Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 5.5 OPF with dynamic window . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 CHAPTER 6 – INCREMENTAL OPFKNN 86 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 6.2 Theoretical Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 6.2.1 Training Step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 6.2.2 Classification Step . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 6.3 Proposed Incremental Learning Approach . . . . . . . . . . . . . . . . . . . . 93 6.4 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 6.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 6.5.1 Non-Technical Losses Datasets . . . . . . . . . . . . . . . . . . . . . 98 6.5.2 General-Purpose Datasets . . . . . . . . . . . . . . . . . . . . . . . . 101 6.5.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 6.6 Partial Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 CHAPTER 7 – UNSUPERVISED OPF 105 CHAPTER 8 – CONCLUSIONS 112 REFERENCES 116 GLOSSARY 124 APPENDIX A – LEARNING CONCEPT DRIFT WITH ENSEMBLES OF OPTIMUM- PATH FOREST-BASED CLASSIFIERS 127 Chapter 1 INTRODUCTION “Concept drift” refers to a non-stationary learning problem over time (ZLIOBAITE, 2010), unlike traditional learning techniques where the static concept learned can be used to classify future occurrences indefinitely. In the real world, training and application data can present incompatible problems as their behavior can change over time. Sequential data distribution can also change, becoming the model built on the previous data inconsistent with the new one. Thus, a periodic update of the model becomes necessary (TSYMBAL, 2004). For example, personal preferences for products may change depending on the economy, the day of the week, and inflation, among others. When the conditions of an industry change, the validation process of a product also varies. Commonly, the cause of a change is not straightforward, leaving it to be deduced by the classifiers. Algorithms to deal with concept drift must be able to identify a change in the target concept without knowledge of the change in the subsequent distribution. If the concept changes over time, the classification problem becomes harder. Learning must continue as new instances arrive for the concept drift be traced (ZLIOBAITE, 2010). Several works about concept drift were proposed in the literature. Widmer and Kubat (WID- MER; KUBAT, 1996), for example, proposed a framework (FLOating Rough Approximation - FLORA) that consists of a family of algorithms with the following characteristics: FLORA1 is the simplest one, and the representation of the hypothesis is in the form of three sets of descriptors that summarize both positive and negative information; FLORA2 uses a flexible “forgetting” operator controlled by a time window in the input data; FLORA3 can deal with environments where a concept can reappear; while FLORA4 is concerned with noise. Klinken- berg and Joachims (KLINKENBERG; JOACHIMS, 2000) proposed a supervised method to recog- nize and treat concept drift using Support Vector Machines (SVMs), whose main idea is to automatically adjust the size of the training time window to minimize the estimated general- ization error. A large window provides lots of data, usually providing good generalization in 1 Introduction 17 environments where the concept does not change. However, it may contain old data that are no longer relevant in changing environments. Stanley (STANLEY, 2003) employs a weighted voting instead of the window method, named Concept Drift Committee (CDC). In this method, there is a committee composed of members whose vote is weighted according to their recent history. When a committee member’s performance is considerably affected, a new member replaces it. Each committee member has a decision tree as a concept learning algorithm, and the author states the possibility of using any other supervised learning algorithm. Harel et al. (HAREL et al., 2014) proposed a model to detect or predict concept drift by resampling, which investigates and identifies changes that occur along the data stream more robust to noise and invariant to the learning problem, assuming temporal independence of the samples. Masud et al. (MASUD et al., 2011b) proposed a work to deal with Concept Evolution in concept drift environments. Concept Evolution is the appearance of a new class in the data. The authors used the k-nearest neighbors and decision trees as classification algorithms and de- fined a new-class appearance when none of the classification models could classify an instance, named Foutlier. They also presented a coefficient to measure the cohesion and the separation of Foutliers: when this coefficient has a positive value for a q number of Foutliers, it indicates that these form a new class. Later on, Masud et al. (MASUD et al., 2011a) proposed a semi-supervised technique to deal with recurrent classes, i.e., instances that appear in the dataset and later reap- pear. The algorithm groups the training data using k-means, builds a decision boundary, stores the summary of each cluster, and comprises information such as centroid and radius. When an instance in the input data is considered as an outlier (if the instance is outside the decision boundary), it is called P-outlier (primary outlier). It is then verified if this instance is an outlier of all previous models. In the negative case, it is classified as an instance of a recurring class. In a positive case (S-outlier - second outlier), this instance is stored in a repository that will be an- alyzed after predefined stored instances. Using a neighborhood coefficient, we can declare that these instances compose a new class. Klinkenberg (KLINKENBERG, 2001) further proposed a method for recognizing and dealing with concept drift using SVM and unlabeled data to reduce the need for labeled data, which is an extension of Klinkenberg and Joachims (KLINKENBERG; JOACHIMS, 2000). Other authors proposed different alternatives to deal with concept drift. Liu et al. (LIU; LU; ZHANG, 2020) presented a cluster-based histogram named equal intensity k-means space partitioning (EI-kMeans) with a heuristic method to improve the sensitivity of drift detection. The authors used Pearson’s chi-square test as the statistical hypothesis test and implemented greedy centroids initialization algorithm, a cluster amplify-shrink algorithm, and a drift detec- tion algorithm. Heusinger et al. (HEUSINGER; RAAB; SCHLEIF, 2020) proposed adaptive ver- 1 Introduction 18 sions of Robust Soft Learning Vector Quantization (RSLVQ) and Generalized Learning Vector Quantization (GLVQ) using Adadelta and Adamax and applying momentum-based stochastic gradient descent techniques to tackle concept drift passively. Sethi and Kantardzic (SETHI; KAN- TARDZIC, 2017) proposed an unsupervised and incremental algorithm named Margin Density Drift Detection (MD3), which tracks the number of instances in the uncertainty region of a classifier (margin) as a drift detection metric. Gao et al. (GAO et al., 2020) proposed a graph- based clustering method to identify instances from novel classes in a semi-supervised manner named Semi-supervised Adaptive ClassifiCation Over data Stream (SACCOS). They perform an online normalization on samples along the stream to unify their scale and use an ensemble of clusters to detect novel classes. The supervised training phase obtains a C classifier from normalized data then a set of clusters is generated and added to the initial cluster ensemble M (the ensemble size is finite with a maximum of TM sets of clusters). For every new data instance, the method saves it in a temporary data buffer S, whose maximum size is given by C. Whenever S is full, SACCOS updates the normalization parameters, normalizes those instances in S, sends it to other modules for further processing, and cleans S. The clustering method aims to help to discover significant clusters of correlated samples in the feature space. They propose a Mutual Graph Clustering (MGC) algorithm based on mutual kNN for identifying significant clusters. These techniques aim at solving the classification problem when the trained model does not handle current data well, as there has been a concept drift over time. Examples of environments where such changes can occur: personal interest of a user for subjects of newsgroups (e.g., medicine, space, baseball); defining types of cover forest in a region; recognizing whether the price of electricity will increase or decrease according to demand and supply of products; cases of Non-technical losses (also known as commercial losses) which stand primarily for energy theft in power distribution systems but not limited to it; among others. Dealing with this type of situation is an important task. Papa et al. (PAPA; FALCA˜O; SUZUKI, 2009; PAPA et al., 2012) proposed a methodology for pattern classifiers based on the Optimum-Path Forest (OPF) classifiers. OPF-based classifiers reduce the pattern recognition problem to a graph partitioning in optimum path trees, which are rooted by prototype samples, and an element belonging to a given tree is more strongly connected to its root than any other. Such connectivity strength is established by a path-cost function. Currently, two variants were proposed: (i) the unsupervised version (ROCHA; CAP- PABIANCO; FALCA˜O, 2009) and (ii) the supervised variant, which is divided in the OPF with complete graph (PAPA; FALCA˜O; SUZUKI, 2009) and OPF with k-nearest neighbors graph (PAPA; FALCA˜O, 2008). Also, an optimization of the OPF training algorithm was proposed by Iwashita et al. (IWASHITA et al., 2014) based on the theoretical relationship between minimum spanning 1.1 Hypotheses 19 forests and optimum path forests. As optimum paths follow the form of the Minimum Span- ning Tree (MST), an optimum path forest can be built by removing arcs between prototypes, edge redirection, and label and path cost propagation using the fmax function, thus building the optimum-path forest at the same time as the MST is computed. 1.1 Hypotheses As far as we are concerned, OPF-based classifiers have never been evaluated in concept drift problems or handling situations with non-stationary environments. OPF classifier behavior is also unknown in such conditions. Thus, the hypotheses of this Ph.D. thesis are: i) OPF can be efficient to handle non-stationary environments, ii) the OPF classifier can be robust enough to address concept drift problem, iii) we can provide an effective OPF implementation to handle dynamic environments. 1.2 Objectives In this context, we proposed to apply supervised and unsupervised approaches based on the Optimum Path Forest algorithm in dynamic environments to deal with concept drift. Also, as main contributions, it will be conducted a study in incremental learning concerning the super- vised OPF. The next chapters present a literature review, theoretical reference on concept drift and the OPF classifier, as well as the conducted experiments: • Chapter 2 presents a bibliographical review of concept drift proposed by Iwashita and Papa. • Chapter 3 presents the Optimum Path Forest classifier in its supervised versions (with complete graph and k-nn graph), unsupervised version, and the accuracy measure em- ployed in the experiments. • Chapter 4 presents an initial evaluation of the OPF classifier in a concept drift environ- ment. • Chapter 5 evaluates an ensemble of OPF classifiers in the concept drift problem and also assesses OPF classifier with a drift detection method based on accuracy. 1.2 Objectives 20 • Chapter 6 presents an OPFknn approach with incremental learning to add new instances to the optimum path forest without re-training the entire database (Incremental OPFknn). This Chapter also presents the performed experiments. • Chapter 7 presents some preliminary results concerning drift detection with unsupervised OPF. • Chapter 8 presents conclusions and future works. Appendix A extends some experiments presented in Chapter 5. Chapter 2 LITERATURE REVIEW This chapter presents a literature review about techniques to deal with concept drift pre- sented by Iwashita and Papa (IWASHITA; PAPA, 2019). 2.1 Introduction Since concept drift is an important area that has gained the attention in the last years, this work presents a compilation of several works to foster the research in the area of knowledge. The main contribution of this survey is to study different techniques to detect and deal with concept drift, as well as to study public synthetic and real datasets in this area. The remainder of this work is organized as follows. Section 2.2 presents the main theoretical background regarding concept drift, types of algorithms to deal with (Subsection 2.2.1), and techniques in the literature that handles the concept drift issue (Section 2.3). Section 2.4 presents a summary of the articles studied, and Section 2.5 states conclusions and future directions of the area. 2.2 Concept Drift Concept drift happens when the target concept changes in a non-stationary environment. Let C1 and C2 be two target concepts, and I = {i1, i2, . . . , in} an ordered instance sequence. Instances prior to id have a stable concept C1 and does not change. After ∆x instances, the concept stabilizes once more, but in another target concept C2. The concept among instances id+1 and id+∆x is drifting between C1 and C2 (STANLEY, 2003). According to ∆x length, the drift can be called gradual or abrupt. In gradual drift, the two concepts slowly swap; whereas in abrupt drift it sudden occurs. 2.2 Concept Drift 22 Wadewale and Desai (WADEWALE; DESAI, 2015) classify the variations of the target concept drift in sudden, incremental, gradual, recurring, blip and noise drifts. Fig. 2.1 contains this different types of drifts. Fig. 2.1a shows a sudden drift that the data changes instantly and without alternation. In incremental and gradual drifts the changes occur slowly. Incremental drift (Fig. 2.1b) happens when the data values gradually change over time, whereas gradual drift (Fig. 2.1c) also includes changing in class distribution. Recurring drifts (Fig. 2.1d) occurs when instances of a concept temporary disappear and return after a while. Fig. 2.1e shows a rare event which in a static distribution can be considered as an outlier, and Fig. 2.1f shows random changes in instances (noise) that have to be filtered out (WADEWALE; DESAI, 2015). Kuncheva (KUNCHEVA, 2008) says that blip events represent a “rare event” and finding it in streaming data can indicate the beginning of a concept drift. Hence the methods for online detection of rare events can be a component of the novelty detection paradigm. Some authors point out that there is no universal definition of “outlier” and alternative terminologies permeate the literature, e.g., novelty detection, anomaly detection, noise detection, deviation detection, or exception mining (KUNCHEVA, 2008). (a) (b) (c) (d) (e) (f) Figure 2.1: Types of concept drift, extracted from (WADEWALE; DESAI, 2015): (a) sudden drift, (b) incremental drift, (c) gradual drift, (d) recurring drift, (e) outlier, and (f) noise. Some authors (GAMA et al., 2014) define the drifts in two types: real concept drift and 2.2 Concept Drift 23 virtual drift. Considering concept drift as changes in data distribution, the real concept drift occurs when the conditional distribution changes on the output whereas the input distribution remains unchanged. Virtual drift has different interpretations in the literature, as the changes in the distribution of incoming data, among others (GAMA et al., 2014). 2.2.1 Types of Algorithms The most usual ways to handle concept drift are the following three (FARID et al., 2013): (a) instance selection or window-based approach, (b) weight-based approach, and (c) ensemble of classifiers. Instance selection or window-based approaches select an appropriate set of previous data to train a classifier (KATAKIS; TSOUMAKAS; VLAHAVAS, 2010). It selects instances inside of a fixed or dynamic sliding window (FARID et al., 2013). These approaches assume older examples are incompatible with new data classification, so it has to forget old instances which are con- sidered useless to handle concept drift (KATAKIS; TSOUMAKAS; VLAHAVAS, 2010). Therefore, the training employs the last batch of information with the last training instances. The window of fixed size approach is the simplest rule and the window size is usually determined by the user. By having information on the time-scale of change, a window of fixed size approach is a good choice. Yet, the user is often caught in a trade-off: choosing a small window size (reflects the current distribution with fast adaptivity) or a large window size (having more instances in periods of stability with no concept drift may increase accuracy and have better generaliza- tion) (BIFET; GAVALDA`, 2007). The adaptive window approach adjusts the window size to the length of the drift (KLINKENBERG; JOACHIMS, 2000). It commonly maintains the examples up until the concept drift: in this case, it is not necessary to choose a priori and unknown parame- ter (BIFET; GAVALDA`, 2007). Other approaches maintain some of the examples in memory but select a training set of it for classification (KATAKIS; TSOUMAKAS; VLAHAVAS, 2010). A weight-based approach weights instances and deletes the outdated training ones based on their weights (FARID et al., 2013). As time passes, it considers old information increasingly irrel- evant. Despite considering all instances of learning, new instances have more relevance. Within this framework, an updateable classifier capable of weighted learning must be chosen (KATAKIS; TSOUMAKAS; VLAHAVAS, 2010). The ensemble of classifier combines several outputs from different learners to define a final classification (FARID et al., 2013). With a dynamic set of classifiers, performance (or another metric) is observed. If performance decreases, new classifiers replace the aged and poor per- forming classifiers on the ensemble. Outputs from learners are combined to classify instances 2.2 Concept Drift 24 on classification phase, commonly with a weighted-vote mechanism (KATAKIS; TSOUMAKAS; VLAHAVAS, 2010). Benefits of weighted ensembles have been studied empirically and theoreti- cally (KATAKIS; TSOUMAKAS; VLAHAVAS, 2010). Data streams can be divided into batches in window-based approaches; thus, it continu- ously receives batches of data. Some examples are described below (KLINKENBERG; JOACHIMS, 2000): • Full-memory: the learner employs all instances seen so far (batches), i.e., it can not “for- get” old data (Figure 2.2). • No-memory: the learner employs a single batch on training, i.e., the most recent of the stream (Figure 2.3). • Window of fixed size n: the learner employs n batches on training, e.g., a sliding window of size n = 3 is used with the most recent instances (Figure 2.4). Batch 2 Batch 3 Batch 4 Batch 5 Batch 6Batch order Time t Batch 1 t + 1 t + 2 t + 3 t + 4 Train batch Test batch Batch 2Batch 1 Batch 2 Batch 3 Batch 3 Batch 4 Batch 4 Batch 5 Batch 5 Batch 6 Batch 1 Batch 2Batch 1 Batch 3Batch 2 Batch 4Batch 3 Batch 1 Batch 2Batch 1 Figure 2.2: Full-memory approach. Batch 2 Batch 3 Batch 4 Batch 5 Batch 6Batch order Time t Batch 1 t + 1 t + 2 t + 3 t + 4 Train batch Test batch Batch 2Batch 1 Batch 2 Batch 3 Batch 3 Batch 4 Batch 4 Batch 5 Batch 5 Batch 6 Figure 2.3: No-memory approach. Ditzler and Polikar (DITZLER; POLIKAR, 2013) characterized various forms of concept drift handling algorithms: 2.2 Concept Drift 25 Batch 2 Batch 3 Batch 4 Batch 5 Batch 6Batch order Time t Batch 1 t + 1 t + 2 t + 3 t + 4 Train batch Test batch Batch 2Batch 1 Batch 2 Batch 3 Batch 3 Batch 4 Batch 4 Batch 5 Batch 5 Batch 6 Batch 1 Batch 2Batch 1 Batch 3Batch 2 Batch 4Batch 3 Figure 2.4: Window-of-size-3 approach. • Online or Batch approaches: determined by the amount of instances considered at training phase; • Single classifier or Ensemble-based approaches: determined by the number of learners employed to decide a classification; • Incremental or Non-incremental approaches: determined by the reusing a data or not; and • Active or Passive approaches: determined by the use of a drift detection mechanism or not. After receiving an instance, the online approach updates the classifier; whereas the batch ap- proach waits to receive plenty of instances to start learning (TSYMBAL, 2004). Single classifiers use one learner to decide on classification phase; whereas ensemble learning combines the re- sults of a set of concept learners with a single or weighted vote, or selecting the most significant result (TSYMBAL, 2004). Incremental learning behaves like online learning with the model up- date as instances arrive; whereas non-incremental reuses data on learning phase. Active drift detection observes the stream to search for changes and determine whether and when a drift occurs: after a drift, it warns the learner to take the correct action. Passive drift detection con- siders drift may occur constantly or occasionally, therefore continually updates the learner as data arrive (DITZLER; POLIKAR, 2013). Kuncheva (KUNCHEVA, 2008) considered that change detection from unlabeled data in clas- sification problems could be useful in three situations: i) be more sensitive to error, ii) missed opportunities, and iii) label delay. About being sensitive to error, it would be the case of detect- ing a change in the distribution to anticipate an error change; and regarding missed opportuni- ties, if a change is not detected, we are missing the opportunity to improve the classifier training a new one with a smaller error. Concerning label delay or label latency, occurs when the labels 2.3 Literature Review 26 come with a considerable delay and are not available at the time of classification, e.g., applica- tions where the true class label becomes known after some time (as an “intermediate latency”), after very little time (“zero latency”) or in unlabelled data where it is expensive or impossible to obtain the class labels (“infinite latency”). 2.3 Literature Review In 1986, Schlimmer and Granger (SCHLIMMER; GRANGER, 1986) proposed the first tech- nique to handle concept drift, a supervised learning method named STAGGER (KOLTER; MAL- OOF, 2007), which trains Boolean characterization via formula based on Bayesian statistics. It keeps a set of concept descriptions (produced by using feature construction and features from the method itself), and select those with better relevance to the present data (TSYMBAL, 2004). It starts with its own features and trains new characterizations by middle-out beam search through the space of possible conjunctive, disjunctive, and negated characterizations. STAGGER is a binary classification method which detects concept drift using backtracking and Bayesian weighting measures to discriminate concept drift from noise (SCHLIMMER; GRANGER, 1986). Salganicoff (SALGANICOFF, 1993), in 1993, proposed the DARLING method (Density- Adaptive Reinforcement Learning), a supervised density-adaptive forgetting technique that uses exponential weight-decay based on nearest neighbor criterium to exclude instances considered obsolete. The method excludes an example if its weight drops below a threshold and new examples take its place. In the space of attributes, the number of samples per unit volume is the local “density.” DARLING builds a kd-tree structure for the nearest-neighbor forgetting method in a binary classification way (SALGANICOFF, 1993). Kubat and Widmer (KUBAT; WIDMER, 1995) use Radial Basis Functions (RBF) to learn in non-stationary numeric domains. FRANN algorithm (Floating Rough Approximation in Neural Networks) uses hill-climbing to search for the best “sliding window,” and build an RBF network from it. This dynamic sliding window uses heuristics to delete older instances and to determine its window size (KUBAT; WIDMER, 1995). Widmer and Kubat (WIDMER; KUBAT, 1996) proposed a supervised framework named FLORA (FLOating Rough Approximation), which consists of a family of algorithms where each version is an anterior extension. In the FLORA framework, three description sets rep- resent a concept: the Accepted DEScriptors (ADES) used to classify new samples, represent the present (positive) hypothesis; the Negative DEScriptors (NDES) represent negative samples and employed to avoid over-generalization of ADES; and the Potential DEScriptors (PDES) 2.3 Literature Review 27 which is a set that might become pertinent in the future, used as a reservoir of general hypothe- ses. FLORA keeps the relevant descriptions items of the most recent instances in the window. FLORA2 has a “forgetting” operator, which dynamically adjust the window size in the training phase. Two heuristics are used to detect drift: the predictive performance of old classifica- tions, and some syntactic properties of the hypotheses. If performance drops significantly or a substantial difference appears in the number of items in ADES, the method signals a warning of concept drift. The window size decreases and the algorithm “forget” old instances if drift occurs. Otherwise, the window size enlarges to produce a stable concept. FLORA3 can deal with recurring contexts, storing old concepts for later reuse. When a stable concept reached, it stores the current hypothesis. When concept drifts, the algorithm verifies if there is any old descriptor that can describe the present instances. FLORA4 was proposed to deal with noise. Each description item has a classification record, and it builds statistical confidence intervals around these measures (WIDMER; KUBAT, 1996). Klinkenberg and Joachims (KLINKENBERG; JOACHIMS, 2000) use Support Vector Machines to deal with concept drift. The method keeps a window of examples, trying to minimize its generalization error discarding irrelevant data. The authors used the full-memory, no-memory, window of fixed and adaptive size data management approaches (KLINKENBERG; JOACHIMS, 2000). Street and Kim (STREET; KIM, 2001) proposed the SEA algorithm (Streaming Ensemble Al- gorithm), an ensemble of decision trees and each one created by one batch. It uses unweighted majority-vote, similar to bagging, to classify an instance. The ensemble has a maximum num- ber of classifiers, and once this number achieved, those new classifiers that reach certain criteria replace old classifiers. Performance estimates are computed on the next batch using the new tree, and the ensembles are built with Quinlan’s C4.5 (STREET; KIM, 2001). Stanley (STANLEY, 2003) proposed the CDC (Concept Drift Committee) algorithm, a super- vised method that employs a weighted committee of hypotheses. All committee members can access all features and when an older committee member’s voting value falls below a threshold, a new member replaces it. Each member gives a (weighted) vote and keeps a hypothesis based on instances seen in its lifetime, each other having a different amount of instances considered. An implicit window considers only the latest examples, and each committee member uses a de- cision tree (but according to authors any supervised learning algorithm can be used (STANLEY, 2003)). Scholz and Klinkenberg (SCHOLZ; KLINKENBERG, 2005) proposed the KBS (Knowledge- Based Sampling) algorithm, a boosting-like ensemble method. This supervised algorithm em- 2.3 Literature Review 28 ploys KBS and SVM with the linear kernel or Decision Tree in a binary classification way (“relevant” or “irrelevant” class). It considers the latest instances batch, inducing and reweight- ing base models continuously (SCHOLZ; KLINKENBERG, 2005). Bifet and Gavalda (BIFET; GAVALDA`, 2007) presented the ADWIN2 (ADaptive WINdow- ing) algorithm, an improved version of ADWIN algorithm. ADWIN2 has a variable sized win- dow: it grows or shrinks when no change or concept drift is detected, respectively. This super- vised method detects drifts using the average of the elements in the window (BIFET; GAVALDA`, 2007). Abdulsalam et al. (ABDULSALAM; SKILLICORN; MARTIN, 2011) combine the ideas of stream- ing decision trees and Random Forests in an incremental multiclass algorithm. It can adjust its parameters to handle drifts and uses the difference in entropy between two-windows – the cur- rent and the reference window – to detect drift. Each attribute has counters and the probabilities of occurrences are calculated, being the differences averaged and used to calculate entropy changes. When the method builds a new tree; it adds or replaces an old one depending on the number of existing trees in the forest (ABDULSALAM; SKILLICORN; MARTIN, 2011). With minimal distance classifier and a sliding window, Kurlej and Wozniak (KURLEJ; WOZ- NIAK, 2012) proposed an active learning approach that ponders if an outside expert has to label a new example as training instance or not. A heuristic algorithm defines if a new example is a good reference using two values obtained in this set: the distance to the closest point and the difference in the distance to two closest points belonging to distinct classes. If a new ex- ample achieves certain conditions, it replaces the oldest example of the reference set (KURLEJ; WOZNIAK, 2012). Vivekanandan and Nedunchezhian (VIVEKANANDAN; NEDUNCHEZHIAN, 2011) proposed an online Genetic Algorithm (GA) that takes small snapshots of the training sample and creates rules for all classes separately. For each class, it uses multiple windows of fixed size to keep the examples. The window size of each class varies depends on his overall distribution. The oldest examples will be replaced by new ones if the window of the class becomes full (VIVEKANAN- DAN; NEDUNCHEZHIAN, 2011). Sun and Li (SUN; LI, 2011) proposed the first study on Financial Distress Concept Drift (FDCD): if there is FDCD and in what way to discard it. To discard FDCD, they build a dy- namic FDP model, which embraces instance selection, FDP modeling, and future prediction. To deal with FDCD, the algorithm uses methods like the ones based on windows (full-memory, no-memory, fixed and adaptative size), and batch selection. The authors also integrate Maha- lanobis distance for feature selection and employed Fisher discriminant for classification (SUN; 2.3 Literature Review 29 LI, 2011). Hegedus et al. (HEGEDUS; ORMA´NDI; JELASITY, 2013) presented adaptative versions of GoLF (Gossip Learning Framework): ADAGoLF (with age-based drift handling) and CD- DGoLF (with concept drift detection), to handle concept drift in large networks. Adaptivity in AdaGoLF uses the models in the network, employing age distribution manipulation. It offers both young and old models at any moment, providing diversity of varied ages. CDDGoLF is employed to detect concept drift. It can estimate and monitor performance drifts to forget data with a performance decay (HEGEDUS; ORMA´NDI; JELASITY, 2013). Bertini et al. (BERTINI; NICOLETTI; ZHAO, 2013) proposed the Ensemble of CPp-AbDG (Complete P-partite Attribute-based Decision Graph), being the data representation based on graph structures. It can handle missing attribute values. CPp-AbDG extends other data graph constructor: AbDG (Attribute-based Decision Graph). The AbDG can be theoretically de- signed to a graph in which one vertice represents a subrange of the range value of one attribute. Graph-based model is employed as classifiers in the ensemble, since the authors justify the use by the advantages of representing data topologically, with arbitrary shapes, and hierarchi- cally (BERTINI; NICOLETTI; ZHAO, 2013). Escovedo et al. (ESCOVEDO et al., 2013) presented the NEVE (Neuro-EVolutionary Ensem- ble) algorithm, a supervised ensemble of weighted classifiers (neural networks), which employs QIEA (Quantum-Inspired Evolutionary Algorithm) to train. QIEAs can estimates class distri- bution, providing good performance. It also determines weights for each ensemble’s classifier when a new batch arrives. A new classifier is added to the ensemble once a new batch arrives, and all weights are updated to improve the performance (ESCOVEDO et al., 2013). Li et al. (LI et al., 2015) proposed the EDTC (Ensemble Decision Trees for Concept drifting data streams), an incremental method that defines cut-points in the growing tree with three different random feature selection. When an instance arrives, each growing node split-features randomly to avoid producing unnecessary branches. EDTC employs two thresholds and uses local data distributions to detect drift (LI et al., 2015). Loeffel et al. (LOEFFEL; MARSALA; DETYNIECKI, 2015) proposed an online algorithm (the Droplets algorithm) that can opt to abstain from predicting to handling drifts. Each instance is considered as a droplet falling on a plan (feature space), i.e., the “Droplets’ map.” Classes can be considered as a “chemical” composition of the droplets which are reciprocally repellent. Thus, two droplets from different classes will not mutual coverage parts of the map. The method does not employ a fixed threshold. According to authors, it is the first “algorithm to handle concept drift” proposed to abstaining from prediction (LOEFFEL; MARSALA; DETYNIECKI, 2015). 2.3 Literature Review 30 Chen et al. (CHEN et al., 2015) proposed a Genetic Algorithm to obtain fuzzy concept drift patterns. CDGFM (Concept Drift Genetic-Fuzzy Mining) handles drift with instance selection in a multiclass classification way. Membership functions of items are transformed into chromo- somes and obtained on GA process. Each example has a fitness value assessed by the amount of concept drift patterns (new concept, drift and added/expired rules) and the membership func- tion. The membership functions search for satisfying sets of fuzzy association rules to define drift (CHEN et al., 2015). ZareMoodi et al. (ZAREMOODI; BEIGY; SIAHROUDI, 2015) proposed LOCE (LOcal Classi- fier Ensemble), an algorithm capable to detect the appearance of new classes in streams. Each class has an ensemble of classifiers, which are updated by its own metrics and by a pruning phase (cutting classifiers to system update). If one class no longer exists, all classifiers belong- ing to the ensemble will be eliminated. It classifies existing classes’ examples and discerns between the novel and existing classes with local patterns. Using a neighborhood graph to de- tect new classes, it stores new class candidates to its nodes using cohesion and separation to determine the components of it (ZAREMOODI; BEIGY; SIAHROUDI, 2015). Diaz et al. (DI´AZ et al., 2015) proposed the FAE (Fast Adapting Ensemble) algorithm, a mul- ticlass ensemble algorithm which can deal with recurring concepts. It employs a batch scheme, but it does not need to wait for the entire batch to start classification. A drift detector (often Drift Detection Method - DDM) decides when to increase the number of classifiers to the ensemble. To deal with recurring concepts, it keeps several aged classifiers (former concepts) which are awakened if these concepts reoccur, avoiding unnecessarily inclusion of new classifiers. To make a global decision, it employs a weighted majority ensemble vote. It dynamically adjusts the base classifiers weights, which permits classifiers to stay longer on the ensemble (DI´AZ et al., 2015). Mirza et al. (MIRZA; LIN; LIU, 2015) proposed the ESOS-ELM (Ensemble of Subset Online Sequential Extreme Learning Machine), a drift detector which can deal with class imbalance on stream of data. The main ensemble represents the short-term memory, in which each classifier trains with a balanced selection of the original imbalanced data. It also has a concept drift detector, and the long-term memory keeps information. m classifiers process a minority class instance, being m the imbalance ratio; while a single classifier is employed to a majority class instance, which is processed in a round-robin fashion. According to the authors, the method can do both online and batch learning (MIRZA; LIN; LIU, 2015). ZareMoodi et al. (ZAREMOODI; SIAHROUDI; BEIGY, 2016) proposed the SVSCLASS (Sup- port Vector-based Stream CLASSifier), a support vector-based method that can deal with the 2.3 Literature Review 31 appearance of new classes. Using the support vector domain description, it maintains classes’ boundaries with spheres in the kernel space, which is used to label instances of the incoming batches. Instances located outside of the spheres may stand for an emergence of a new class. To detect a new class, it builds a neighborhood graph to analyze the cohesion together and sep- arated from existing classes. Using support vector clustering, the instances will be partitioned into clusters and analyzed to be labeled (the method acquires the true labels and the sphere is updated). To handle concept drift, the spheres boundaries shrink, enlarge and merge during learning (ZAREMOODI; SIAHROUDI; BEIGY, 2016). Klinkenberg and Renz (KLINKENBERG; RENZ, 1998) presented a method to detected concept drift using windows approach with fixed or adaptive size. The subject addressed in this paper was information filtering, having two concepts to deal with: relevant or irrelevant. Using some indicators (recall, precision, and accuracy, being the latter considered less important to detect drift according to authors), it detects concept drift and calculates heuristics to adapt the window size. If an abrupt change occurs, the window reaches its minimal size (one batch); whereas if a gradual change occurs, its size is decreased according to a reduction rate defined by the user. If no drift is detected, instances are stored until a maximum training set size to build a stable learner (KLINKENBERG; RENZ, 1998). Gama et al. (GAMA et al., 2004) proposed the DDM (Drift Detection Method), a drift detector that works with probability distribution using the online error-rate. It defines two levels: the warning and the drift level. If the error reaches respectively the warning level at instance iw and the drift level at instance id , it is considered the occurrence of a drift, and the method starts the training process with data from iw. According to the authors, it can operate with online and incremental methods, and also as a wrapper to batch classifiers (GAMA et al., 2004). Yang et al. (YANG; WU; ZHU, 2005) proposed a RePro system (REactive plus PROactive). RePro can conduct reactive prediction: it detects the concept drift and learns with new data; and also can be proactive: predict the next concept given the present one. They generate a concept history from the stream (which is more compact than raw data) to learn patterns of concept transitions. RePro employs a sliding window structure which starts with a misclassified example. Once the window is filled and its error rate stands above a threshold, the first example is considered a trigger; otherwise, the window pass to the next misclassified example, excluding previous examples. The proactive mode works after trigger detection and does not employ window warning. When a new trigger is identified, the predicted concept takes the lead. The reactive mode builds a model only after drift detection, using trigger examples. According to the authors, RePro can detect the emergence of a new concept and the reappearance of an old 2.3 Literature Review 32 one (YANG; WU; ZHU, 2005). Garcı´a et al. (GARCI´A et al., 2006) presented EDDM (Early Drift Detection Method), a drift detector that works with distance-error-rate (estimated distribution of the distance among clas- sification errors) rather than errors classifications. EDDM has two thresholds: the Warning level – exceeded this level, instances are kept; and the Drift level – the method considers concept has drifted and builds a new model with data from warning level. EDDM starts to search for a con- cept drift after 30 errors occurred. Otherwise, if the system has a rise of similarity after warning threshold, the instances are deleted and the system reverts to an “in-control level” (GARCI´A et al., 2006). Da Silva et al. (SILVA et al., 2006) proposed the RL-CD (Reinforcement Learning with Con- text Detection), which evaluates the prediction quality of partial model. It manages several partial models, creating, updating and selecting them. The method only select the partial model with the highest quality, each one specialized in a particular environment dynamics. A model replacement means a drift detection in the environment. A new model is created if the best model quality is below a threshold. The new model produced will consider dynamics predictor and the corresponding optimal policy (SILVA et al., 2006). Kolter and Maloof (KOLTER; MALOOF, 2007) proposed the DWM (Dynamic Weighted Ma- jority), an ensemble algorithm to deal with concept drift that creates and removes weighted experts according to global performance. If the main procedure gives a wrong weighted ma- jority answer, the method includes an expert to the ensemble. Otherwise, if an expert gives a wrong answer, its weight is decreased. If an expert gets his weight greatly decreased, it is deleted from the ensemble (KOLTER; MALOOF, 2007). Katakis et al. (KATAKIS; TSOUMAKAS; VLAHAVAS, 2010) proposed the CCP (Conceptual Clustering & Prediction) framework, an incremental ensemble to deal with recurring contexts which uses clusters and a transformation function to map batches into conceptual representation models. Each classifier in the ensemble represents a concept, and when a new batch comes from the stream, the clustering algorithm identifies its concept and CCP employs the corresponding classifier to label the instances (KATAKIS; TSOUMAKAS; VLAHAVAS, 2010). On the area of face detection, Susnjak et al. (SUSNJAK; BARCZAK; HAWICK, 2012) proposed an adaptive learning method for cascades of boosted ensembles. The algorithm employs a hybrid of active and passive updating approach. During training, classifiers belonging to the same layer are combined into ensemble-clusters in which its results are weighted based on their performance. During classification, the algorithm uses the deliberations of ensemble-clusters per layer to calculate a collective value, which is compared with the learned layer thresholds 2.3 Literature Review 33 to give a decisive answer. To detect drift it uses trigger mechanism with classification error rate (SUSNJAK; BARCZAK; HAWICK, 2012). Minku and Yao (MINKU; YAO, 2012) proposed the DDD (Diversity for Dealing with Drifts) online ensemble algorithm. The authors also analyzed the combination of low and high diversity ensembles, concluding that each diversity level reaches better prequential accuracy according to the drift type. The paper further reveals the possibility of old concept information benefit the training process of a new concept by using different levels of diversity on ensemble learning, which is a measure that minimizes the error (i.e., discordance) among classifiers. So, DDD keeps ensembles with different diversity levels to deal with drifts, taking old concept informa- tion to assist the learning (MINKU; YAO, 2012). Farid et al. (FARID et al., 2013) proposed an adaptive Ensemble Model (EM) able to detect new class. EM uses automatic decision trees with clustering, which classify an instance by majority weighted voting. Some instances are labeled to train a new classifier, and when it becomes competitive, it can replace an older one with the smallest weight (which means it has the minimum classification accuracy rate). To detect new class, the assumption is that instances should be closer to each other if they belong to the same class, otherwise should be distant from instances of other classes. If an instance is distant from the present clusters, it is considered as a new class example (FARID et al., 2013). Harel et al. (HAREL et al., 2014) proposed a method that analyzes the empirical loss distri- bution, whose statistics are acquired by reusing the data multiple times via resampling. The method employs random permutations to detect drift, creating multiple train-test data from the stream. If no drift occurred, the ordered data prediction should not quite differ from the shuf- fled data when the algorithm achieves stability. Receiving time indices when concept drift, the method uses it to modify the windows size and initiate a training phase, and also provid- ing information to ensemble learners. Instances suppose to have temporal independence, but according to authors, it can apply tactics to maintain exchangeability (HAREL et al., 2014). Raza et al. (RAZA; PRASAD; YUHUA, 2014) proposed the ALCSD (Adaptive Learning with Covariate Shift-Detection), an adaptive algorithm that employs dataset shift-detection with exponential-weighted moving average (EWMA) model. An EWMA model-based shift-detection test supervises the covariate shift and initiates adaptation with the shift-detection point: it re- trains classifiers with the updated knowledge base and different adaptation methods (RAZA; PRASAD; YUHUA, 2014). Lu et al. (LU; ZHANG; LU, 2014) proposed a method to detect drift in case-based reasoning (CBR) environment. They presented a competence model to detect drifts (using two sliding 2.3 Literature Review 34 windows, it compares the data distribution through competence instead of the feature space). Competence measures the success of CBR system to accomplish its goals. Prior knowledge of case distribution is not necessary. It presents secure statistical of the drift detected and also maintains records and drift quantifications (LU; ZHANG; LU, 2014). Wang and Abraham (WANG; ABRAHAM, 2015) proposed the LFR (Linear Four Rates) frame- work, a drift detector that can deal with batch and stream approaches, binary class problems and imbalanced data. LFR requires user-specific parameters but does not depend on the underlying statistical-model, and the drift detector is not dependent on the classifier (WANG; ABRAHAM, 2015). Khamassi et al. (KHAMASSI et al., 2015) proposed the EDIST2, a drift detector with self- adaptive windowing to deal with different types of drift. EDIST2 supervises performance and detects drift with two sliding windows: a global and a current one. The global window (GW ) increases in stable environments and shrinks if a drift is detected. The current window (CW ) comprises only the current batch instances. EDIST2 computes the error distance distribution of GW and CW , and compares the difference between their error distance averages. It has three thresholds: In-Control level – assumes that there are no changes, therefore CW ’s instances are added to GW ; Warning level – starts storing instances in an auxiliary window for a potential change; and Drift level – drift is detected and only instances kept from Warning level remains in GW . Drift threshold is automatically calculated with statistical proofs (KHAMASSI et al., 2015). Gao et al. (GAO et al., 2007) argued that descriptive model similar to posterior probability is preferable for real-stream classification. They proposed the UCB (Uncorrelated Bagging), a framework to deal with imbalanced data that apply sampling and ensemble methods to skewed stream mining problem. It makes a balanced training data by maintaining all minority instances and under sampled-majority instances (minority class is assumed as stationary) (DITZLER; PO- LIKAR, 2013). The averaged probability calculated on application data by several models are the final outputs (GAO et al., 2007). Yalcin et al. (YALCIN; ERDEM; GURGEN, 2007) employed Support Vector Machines in an ensemble-based incremental learning algorithm. In previous works, they integrate SVM and an ensemble framework with Learn++ to create an incremental learning algorithm (SVMLearn++). A forgetting approach is employed on SVMLearn++ to eliminate the effects of redundant data, and in this work SVMLearn++ was assessed in the following learning approaches: Learn++ without pruning, with top N highest performance classifiers, and with replacing the looser de- vice. The authors use SVM with RBF kernel in two-class problems (YALCIN; ERDEM; GURGEN, 2007). 2.3 Literature Review 35 Chen and He (CHEN; HE, 2009) proposed the SERA (SElectively Recursive Approach) al- gorithm to handle imbalanced data. It selectively retains minority instances in the current batch and it assigns the sampling probabilities proportionally to the majority and minority instances to increase minority instances performance. The amount of minority instances is limited by the authors to be proportional to the size of majority data, and a Mahalanobis distance decides the priority order of acceptance. They also elaborated a biased bagging approach (BBagging) to boost the performance of a single classifier focusing on minority instances on imbalanced datasets (CHEN; HE, 2009). Elwell and Polikar (ELWELL; POLIKAR, 2011) presented the Learn++.NSE algorithm, an in- cremental ensemble algorithm depicted by NonStationary Environments (NSEs). Learn++.NSE receives the batches and incrementally learn from them without requesting access to previous data and handles concept drift with a passive approach. One classifier is trained at each batch and the results are obtained by combining the ensemble dynamically-weighted-majority vote based on their time-adjusted errors. Learn++.NSE can handle the appearance of a new class and the deletion of an old one (ELWELL; POLIKAR, 2011). Ditzler et al. (DITZLER; POLIKAR; CHAWLA, 2010) proposed the Learn++.SMOTE, a hy- brid algorithm containing Learn++.NSE and SMOTE approaches to deal with class imbalance. This hybrid algorithm can boost the recall of the minority class, and any supervised learning algorithm can be used as base classifier (DITZLER; POLIKAR; CHAWLA, 2010). Ditzler and Polikar (DITZLER; POLIKAR, 2010) proposed a method based on Learn++.NSE algorithm: the Learn++.NIE (Nonstationary and Imbalanced Environments). Differences in NSE and NIE are i) for each batch NIE generates a sub-ensemble (rather than an individual classifier), and ii) another metric (not a classification error) is employed as an evaluation mea- sure. Compared to Learn++.NSE, Learn++.NIE has slow recovery but can boost minority class performance. It can separately weight the average error of the majority and minority class recall and can also reward classifiers with good performance not only in the majority class but both minority and majority classes (DITZLER; POLIKAR, 2010). In (DITZLER; POLIKAR, 2013), Ditzler and Polikar proposed two incremental ensemble al- gorithm to deal with imbalanced data. The first method (Learn++.CDS) is a logical union of Learn++.NSE algorithm and the Synthetic Minority class Oversampling TEchnique (SMOTE) to learn in imbalanced environments. The second method is Learn++.NIE. Learn++.CDS em- ploys SMOTE to readjust the class balance with synthetic minority class examples, then uses Learn++.NSE on this balanced data. Both Learn++.CDS and Learn++.NIE reveal to perform better on fixed ensemble size (DITZLER; POLIKAR, 2013). 2.3 Literature Review 36 Klinkenberg (KLINKENBERG, 2001) employed Support Vector Machines to handle concept drift. Exploiting the work in (KLINKENBERG; JOACHIMS, 2000), they reduce the need for labeled data using unlabeled instances in a transductive way. It discards irrelevant data minimizing the generalization error with the use of automatically adjusted size windows. The method employs some properties of SVMs, adapting εα-estimates (a particular process to assess SVM perfor- mance based on the idea of leave-one-out estimation) to select the window size (KLINKENBERG, 2001). Widyantoro (WIDYANTORO, 2007) proposed a framework that employs unlabeled data on information-filtering domains. A concept hierarchy is incrementally built in an unsupervised way to be used on classification. To deduce concepts is employed a persistence assumption in temporal reasoning. The method permits classifier to be tailored to target applications. The performance depends on the method for identifying classes, the method to build the hierarchy, and the chosen classifier (WIDYANTORO, 2007). Spinosa et al. (SPINOSA et al., 2008) proposed the OLINDDA (OnLIne Novelty and Drift Detection Algorithm), a cluster-based algorithm that can detect new concepts and assessed in intrusion detection domain. Initially, it creates a normal profile of a single class, and to iden- tify a new concept it uses cohesive sets of clusters, merging similar concepts during learning. OLINDDA stores information in three hypersphere-based models, which is about normal pro- file, extended concepts of normal profile and new concepts. The normal model is static and used as a reference. The extension and new concepts are continuously created and updated. It has both supervised and unsupervised learning, where the first occurs building the normal model and the latter treating unlabeled instances and detecting novel concepts (SPINOSA et al., 2008). Masud et al. (MASUD et al., 2011b) proposed the ECSMiner (Enhanced Classifier for data Streams with novel class Miner), a method to detect the appearance of a new class. To handle drift, it uses M classifiers in an ensemble and the majority vote, which is continuously updated: when a new model is trained, it can replace an older one with the highest error. Each classifier has a class detector and if all of them declare a novel class, a new class is determined. To do such thing, it verifies cohesion and separation, and if an instance is isolated from training examples, it is recognized as an Foutlier. If an instance is not an Foutlier, ECSMiner uses ensemble voting to classify it. If a number of cohesive Foutlier is reached, a new class is determined (MASUD et al., 2011b). Masud et al. (MASUD et al., 2011a) proposed the SCANR (Stream Classifier And Novel and Recurring class detector), a method to deal with recurring class and to detect new ones in multi-class problems. It maintains a primary and auxiliary ensembles to store old classification 2.3 Literature Review 37 models. After building a new primary model, it substitutes the model in primary ensemble with the worst prediction error. The primary ensemble analyses instances to detect outliers and starts classification if the instance is not an outlier. If it is classified as an outlier (primary outlier), the auxiliary ensemble verifies it again. If it is not classified as an outlier, it is assumed to be a recurring class and classified by the auxiliary ensemble. Otherwise, it is named as secondary outlier, and it is provisionally added into a buffer. The new class detection module starts if buffer instances reach a certain number (MASUD et al., 2011a). Li et al. (LI; WU; HU, 2012) proposed the SUN method, a Semi-supervised classification algorithm for data streams with concept drifts and UNlabeled data. It creates a decision tree generating concept clusters at leaves with k-Modes. It detects drifts from noise with a bottom- up search and the deviation among history concepts and new ones. The concept clusters classify instances, and the labeled examples are used again trying to reduce the drift rate (LI; WU; HU, 2012). Katakis et al. (KATAKIS; TSOUMAKAS; VLAHAVAS, 2008b) proposed the CCP (Conceptual Clustering and Prediction) framework, a probabilistic representation model for stream learning employing incremental cluster algorithms. It maps batches into “Conceptual Vectors” (CV) containing conceptual information, and those vectors geometrically close do always belong to the same conceptual theme. Clustering works in the stream of CV, summarizing batches into concepts. For each batch, CCP assigns the concept (cluster) and employs the specific classifier. One advantage of CPP is having to store only the clusters’ centers and the classifiers for every cluster, with no need to store old batches or CV (KATAKIS; TSOUMAKAS; VLAHAVAS, 2008b). Sethi and Kantardzic (SETHI; KANTARDZIC, 2017) proposed the MD3 (Margin Density Drift Detection), a drift detector algorithm for unlabeled stream of data. The number of instances in the area of uncertainty of the classifier (margin) is used as a metric to detect drift. If a variation in margin density occurs, the classifier needs labeled instances to be retrained. Tennant et al. (TENNANT et al., 2017) proposed the MC-NN (Micro-Cluster Nearest Neigh- bour) method that calculates statistics from the stream and employs nearest neighbour algo- rithm, which can characterize MC-NN in parallel. The serial MC-NN data has no need to be in memory and is incrementally processed. A statistical summary is built as a set of variance based Micro-Clusters (MC). MC handles concept drift through statistics update of new instances. The parallel MC-NN distribute MC to computational nodes in a computer cluster. Each node of parallel version adapts to drift likewise serial version with a voting mechanism to classify in- stances (TENNANT et al., 2017). Liu et al. (LIU et al., 2017) proposed a drift detector in sensor network domain based on 2.3 Literature Review 38 angle optimized global embedding (AOGE) and principal component analysis (PCA). The drift detector decreases time processing due to the compatibility between the detector and data pro- cessing and also improves performance through dimension reduction (with PCA). PCA and AOGE examine the projection variance and angle, respectively. Combined, they are used to identify changes on objective function. The authors used Extreme Learning Machine (ELM) and SVM as classifiers (LIU et al., 2017). Silva et al. (SILVA; HRUSCHKA; GAMA, 2017) proposed the FEAC-Stream (Fast Evolution- ary Algorithm for Clustering data Streams) algorithm, which uses k-means clustering with k automatically estimated from the stream. FEAC-Stream uses the Page-Hinkley Test to identify decreases in clusters quality to start the re-estimation of k with an evolutionary algorithm. It considers that partially unknown data can afford valid stream knowledge (SILVA; HRUSCHKA; GAMA, 2017). Xu and Wang (XU; WANG, 2017) proposed the DELM (Dynamic Extreme Learning Ma- chine) to classify online data stream employing ELM as classifier. With the use of thresholds to detect drift, it employs a double hidden layer structure to train and improve the performance: when an alert of drift is issued, additional hidden layer nodes are included on neural network; once the drift is detected, a new classifier replaces an older classifier with low performance (XU; WANG, 2017). Arabmakki and Kantardzic (ARABMAKKI; KANTARDZIC, 2017) proposed the RLS-SOM (Reduced labeled Samples-Self Organizing Map) framework for imbalanced stream. An en- semble classifies with DWM and retrains a new model using partial labeled samples when a drift is detected. The method uses the global answer and each one of individual answers: if an individual model has higher performance than the ensemble’s performance, it is chosen instead of the others. After drift detection, the method selects majority and minority class instances to train a new model. If it is a conditional drift, SVM is employed to choose the closest in- stances of decision boundary (margin). However, if no minority instance is found on decision boundary, SOM algorithm maps the batch to search for minority instances in the whole feature space (ARABMAKKI; KANTARDZIC, 2017). Zhang et al. (ZHANG et al., 2017) proposed a three-layered drift detection technique in text data streams domain, where each layer denotes, respectively: the layer of label space, the layer of feature space, and the layer of the mapping relationships between labels and features. Ac- cording to authors, it can employ any classifier and can measure changes in each layer to detect different types of drift (ZHANG et al., 2017). The methods are generally evaluated with metrics as accuracy, precision, recall, and error. 2.3 Literature Review 39 The metrics less used include time, little detection delay, detection delay, false alarm, pre- quential error, the total number of changes detected, prequential accuracy, Monte Carlo error, predictive accuracy, prediction errors, Gmean, among others. Table 2.1 categorizes the techniques according to the drift detector mechanism, and Ta- ble 2.2 presents the classifiers employed in the articles considered in this work. Table 2.3 depicts the type of concept drift handling used: instance selection (i.e., a window of fixed size, window with automatically adjustment size, etc), instance weighting, ensemble learning, clus- tering or sampling. Table 2.4 categorizes the techniques according to the learning approach, i.e., supervised, unsupervised or semi-supervised. Table 2.1: Drift detector mechanism. Drift detector (SCHLIMMER; GRANGER, 1986; WIDMER; KUBAT, 1996; BIFET; GAVALDA`, 2007; ABDULSALAM; SKILLICORN; MARTIN, 2011; HEGEDUS; ORMA´NDI; JELASITY, 2013; LI et al., 2015; DI´AZ et al., 2015; MIRZA; LIN; LIU, 2015; KLINKENBERG; RENZ, 1998; GAMA et al., 2004; YANG; WU; ZHU, 2005; GARCI´A et al., 2006; SILVA et al., 2006; SUSNJAK; BARCZAK; HAW- ICK, 2012; MINKU; YAO, 2012; KATAKIS; TSOUMAKAS; VLAHAVAS, 2010; HAREL et al., 2014; RAZA; PRASAD; YUHUA, 2014; LU; ZHANG; LU, 2014; WANG; ABRAHAM, 2015; KHAMASSI et al., 2015; LI; WU; HU, 2012; KATAKIS; TSOUMAKAS; VLAHAVAS, 2008b; SETHI; KANTARDZIC, 2017; LIU et al., 2017; SILVA; HRUSCHKA; GAMA, 2017; XU; WANG, 2017; ARAB- MAKKI; KANTARDZIC, 2017; ZHANG et al., 2017) No drift detector (SALGANICOFF, 1993; KUBAT; WIDMER, 1995; KLINKENBERG; JOACHIMS, 2000; STREET; KIM, 2001; STANLEY, 2003; SCHOLZ; KLINKENBERG, 2005; KURLEJ; WOZNIAK, 2012; VIVEKANANDAN; NE- DUNCHEZHIAN, 2011; SUN; LI, 2011; HEGEDUS; ORMA´NDI; JELASITY, 2013; BERTINI; NICOLETTI; ZHAO, 2013; ESCOVEDO et al., 2013; LOEFFEL; MARSALA; DETYNIECKI, 2015; CHEN et al., 2015; ZAREMOODI; BEIGY; SIAHROUDI, 2015; ZAREMOODI; SIAHROUDI; BEIGY, 2016; KOLTER; MAL- OOF, 2007; FARID et al., 2013; GAO et al., 2007; YALCIN; ERDEM; GUR- GEN, 2007; CHEN; HE, 2009; DITZLER; POLIKAR, 2010; DITZLER; PO- LIKAR; CHAWLA, 2010; DITZLER; POLIKAR, 2013; ELWELL; POLIKAR, 2011; KLINKENBERG, 2001; WIDYANTORO, 2007; SPINOSA et al., 2008; MASUD et al., 2011b, 2011a; TENNANT et al., 2017) 2.3.1 Datasets used in the literature In this subsection, we summarize some real-world and synthetic datasets used in the liter- ature to test and simulate concept drift environments. Synthetic datasets are very significant as we can affirm that concept drift really exist and specify which type of change is (i.e., gradual or abrupt). 2.3 Literature Review 40 Table 2.2: Classifiers. SVM (KLINKENBERG; JOACHIMS, 2000; SCHOLZ; KLINKENBERG, 2005; ZAREMOODI; SIAHROUDI; BEIGY, 2016; KLINKENBERG; RENZ, 1998; RAZA; PRASAD; YUHUA, 2014; WANG; ABRAHAM, 2015; YALCIN; ER- DEM; GURGEN, 2007; ELWELL; POLIKAR, 2011; SETHI; KANTARDZIC, 2017; LIU et al., 2017; ARABMAKKI; KANTARDZIC, 2017; ZHANG et al., 2017; KLINKENBERG, 2001) Decision trees (STREET; KIM, 2001; STANLEY, 2003; SCHOLZ; KLINKENBERG, 2005; ABDULSALAM; SKILLICORN; MARTIN, 2011; DI´AZ et al., 2015; LI et al., 2015; MINKU; YAO, 2012; FARID et al., 2013; GAO et al., 2007; DIT- ZLER; POLIKAR, 2010; MASUD et al., 2011b; LI; WU; HU, 2012; SETHI; KANTARDZIC, 2017; ARABMAKKI; KANTARDZIC, 2017; ZHANG et al., 2017; KLINKENBERG; RENZ, 1998; GAMA et al., 2004; GARCI´A et al., 2006; YANG; WU; ZHU, 2005; KHAMASSI et al., 2015) Naive Bayes (BIFET; GAVALDA`, 2007; LI et al., 2015; KLINKENBERG; RENZ, 1998; KOLTER; MALOOF, 2007; MINKU; YAO, 2012; KATAKIS; TSOUMAKAS; VLAHAVAS, 2010; FARID et al., 2013; RAZA; PRASAD; YUHUA, 2014; GAO et al., 2007; ELWELL; POLIKAR, 2011; KATAKIS; TSOUMAKAS; VLAHAVAS, 2008b; ZHANG et al., 2017) k-means clusterer (BIFET; GAVALDA`, 2007; SPINOSA et al., 2008; MASUD et al., 2011a; SILVA; HRUSCHKA; GAMA, 2017; ZAREMOODI; BEIGY; SIAHROUDI, 2015) knn (KLINKENBERG; RENZ, 1998; HAREL et al., 2014; RAZA; PRASAD; YUHUA, 2014; LU; ZHANG; LU, 2014; DITZLER; POLIKAR; CHAWLA, 2010; MASUD et al., 2011b) Logistic Regression (HEGEDUS; ORMA´NDI; JELASITY, 2013; GAO et al., 2007) Neural Networks (ESCOVEDO et al., 2013; CHEN; HE, 2009; DITZLER; POLIKAR, 2010; MINKU; YAO, 2012; GAMA et al., 2004; TENNANT et al., 2017; DITZLER; POLIKAR; CHAWLA, 2010) Regression tree (DITZLER; POLIKAR, 2013; ELWELL; POLIKAR, 2011) ELM (LIU et al., 2017; XU; WANG, 2017; MIRZA; LIN; LIU, 2015) Others (SCHLIMMER; GRANGER, 1986; SALGANICOFF, 1993; KUBAT; WID- MER, 1995; WIDMER; KUBAT, 1996; KURLEJ; WOZNIAK, 2012; VIVEKANANDAN; NEDUNCHEZHIAN, 2011; SUN; LI, 2011; BERTINI; NICOLETTI; ZHAO, 2013; LOEFFEL; MARSALA; DETYNIECKI, 2015; CHEN et al., 2015; KLINKENBERG; RENZ, 1998; GARCI´A et al., 2006; SILVA et al., 2006; KOLTER; MALOOF, 2007; SUSNJAK; BARCZAK; HAW- ICK, 2012; WIDYANTORO, 2007; ARABMAKKI; KANTARDZIC, 2017) 2.3.1.1 Synthetic Datasets Below, some synthetic datasets used in literature: • SINE1: abrupt concept drift, noise-free instances. Two relevant attributes, each one with a uniformly distributed value in [0,1]. In the first concept, it classifies as positive if a 2.3 Literature Review 41 Table 2.3: Concept drift handling. Instance selection (KUBAT; WIDMER, 1995; WIDMER; KUBAT, 1996; KLINKENBERG; JOACHIMS, 2000; STANLEY, 2003; BIFET; GAVALDA`, 2007; KURLEJ; WOZNIAK, 2012; VIVEKANANDAN; NEDUNCHEZHIAN, 2011; SUN; LI, 2011; HEGEDUS; ORMA´NDI; JELASITY, 2013; LOEFFEL; MARSALA; DETYNIECKI, 2015; CHEN et al., 2015; ZAREMOODI; SIAHROUDI; BEIGY, 2016; KLINKENBERG; RENZ, 1998; GAMA et al., 2004; YANG; WU; ZHU, 2005; GARCI´A et al., 2006; SILVA et al., 2006; HAREL et al., 2014; RAZA; PRASAD; YUHUA, 2014; LU; ZHANG; LU, 2014; WANG; ABRAHAM, 2015; KHAMASSI et al., 2015; KLINKENBERG, 2001; SETHI; KANTARDZIC, 2017; TENNANT et al., 2017; LIU et al., 2017; XU; WANG, 2017) Instance weighting (SCHLIMMER; GRANGER, 1986; SALGANICOFF, 1993) Ensemble learning (STREET; KIM, 2001; STANLEY, 2003; SCHOLZ; KLINKENBERG, 2005; ABDULSALAM; SKILLICORN; MARTIN, 2011; BERTINI; NICO- LETTI; ZHAO, 2013; ESCOVEDO et al., 2013; LI et al., 2015; ZARE- MOODI; BEIGY; SIAHROUDI, 2015; DI´AZ et al., 2015; MIRZA; LIN; LIU, 2015; KOLTER; MALOOF, 2007; SUSNJAK; BARCZAK; HAWICK, 2012; MINKU; YAO, 2012; KATAKIS; TSOUMAKAS; VLAHAVAS, 2010; FARID et al., 2013; GAO et al., 2007; YALCIN; ERDEM; GURGEN, 2007; DITZLER; POLIKAR, 2010; DITZLER; POLIKAR; CHAWLA, 2010; DITZLER; POLIKAR, 2013; ELWELL; POLIKAR, 2011; MASUD et al., 2011b, 2011a; ARABMAKKI; KANTARDZIC, 2017; ZHANG et al., 2017) Clustering (WIDYANTORO, 2007; SPINOSA et al., 2008; LI; WU; HU, 2012; KATAKIS; TSOUMAKAS; VLAHAVAS, 2008b; SILVA; HRUSCHKA; GAMA, 2017) Sampling (GAO et al., 2007; CHEN; HE, 2009) value stands below the curve y = sin(x); is classified as negative otherwise. After drift, the classification is reversed (GAMA et al., 2004). • SINE2: has two relevant attributes like SINE1. Classification function is y < 0.5+0.3× sin(3pix). After drift, the concept is inverted (GAMA et al., 2004). • SINIRREL1: has classification function as SINE1, the instances, however, present two irrelevant attributes (GAMA et al., 2004). • SINIRREL2: has classification function as SINE2, but like SINIRREL1, instances present two irrelevant attributes (GAMA et al., 2004). • CIRCLES: presents gradual drift and instances without noise. It has four classification function determined by four circles (four concepts). Instances are classified according to its location: if it is inside the circle defined by circular function, it is classified as positive; otherwise is negative. The gradual change occurs by modifying the center and radius size 2.3 Literature Review 42 Table 2.4: Learning approaches. Supervised (SCHLIMMER; GRANGER, 1986; SALGANICOFF, 1993; KUBAT; WIDMER, 1995; WIDMER; KUBAT, 1996; KLINKENBERG; JOACHIMS, 2000; STREET; KIM, 2001; STANLEY, 2003; SCHOLZ; KLINKENBERG, 2005; BIFET; GAVALDA`, 2007; ABDULSALAM; SKILLICORN; MARTIN, 2011; KURLEJ; WOZNIAK, 2012; VIVEKANANDAN; NEDUNCHEZHIAN, 2011; SUN; LI, 2011; HEGEDUS; ORMA´NDI; JELASITY, 2013; BERTINI; NICOLETTI; ZHAO, 2013; ESCOVEDO et al., 2013; LI et al., 2015; LOEFFEL; MARSALA; DE- TYNIECKI, 2015; CHEN et al., 2015; ZAREMOODI; BEIGY; SIAHROUDI, 2015; DI´AZ et al., 2015; MIRZA; LIN; LIU, 2015; ZAREMOODI; SIAHROUDI; BEIGY, 2016; KLINKENBERG; RENZ, 1998; GAMA et al., 2004; YANG; WU; ZHU, 2005; GARCI´A et al., 2006; SILVA et al., 2006; KOLTER; MALOOF, 2007; SUSNJAK; BARCZAK; HAWICK, 2012; MINKU; YAO, 2012; KATAKIS; TSOUMAKAS; VLAHAVAS, 2010; FARID et al., 2013; HAREL et al., 2014; RAZA; PRASAD; YUHUA, 2014; LU; ZHANG; LU, 2014; WANG; ABRAHAM, 2015; KHAMASSI et al., 2015; GAO et al., 2007; YALCIN; ERDEM; GURGEN, 2007; CHEN; HE, 2009; DITZLER; POLIKAR, 2010; DITZLER; POLIKAR; CHAWLA, 2010; DITZLER; POLIKAR, 2013; ELWELL; POLIKAR, 2011; TEN- NANT et al., 2017; LIU et al., 2017; XU; WANG, 2017; ARABMAKKI; KAN- TARDZIC, 2017; ZHANG et al., 2017) Unsupervised (SETHI; KANTARDZIC, 2017; SILVA; HRUSCHKA; GAMA, 2017) Semisupervised (KLINKENBERG, 2001; WIDYANTORO, 2007; SPINOSA et al., 2008; MASUD et al., 2011b, 2011a; LI; WU; HU, 2012; KATAKIS; TSOUMAKAS; VLAHAVAS, 2008b) of the circle (GAMA et al., 2004). • GAUSS: has abrupt drift and instances with noise. Domain R× R with two relevant attributes. Positive instances are located as a normal distribution with center [0,0] and standard deviation as of 1. Negative instances have center [2,0] and standard deviation as of 4. After drift, the classification is reversed (GAMA et al., 2004). • SINE1G: presents very slow gradual drift and instances without noise. Similar to Sine1, but the gradual drift is reached by selecting instances from the past and the current concept (transition time among concepts). To select an instance from the past concept and the new one has, respectively, gradually lower probability and gradually higher probability as time passes (GARCI´A et al., 2006). • STAGGER: has abrupt drift and instances without noise. Instances have three symbolic attributes – size (small, medium, large), color (red, green), and shape (circular, non- circular). In the first concept, instances are positive if size = small ∧ color = red. In second concept the instances are defined by color = green∨ shape = circular. In third concept the instances are defined by size = medium∨ size = large (GAMA et al., 2004). 2.3 Literature Review 43 • MIXED: has abrupt drift and instances without noise. Instances have four attributes: two boolean v, w and two numeric between [0,1]. If an instance has the following two of three conditions satisfied, it is classified as positive: v,w,y < 0.5+0.3× sin(3pix). After drift, classification is reversed (GAMA et al., 2004). • Rotating Hyperplane Dataset: designed by (FAN, 2004). The (k, t) pairs details of each concept and all files are available in the Internet 1 (WANG; ABRAHAM, 2015). • Usenet1 and Usenet2: used in (KATAKIS; TSOUMAKAS; VLAHAVAS, 2008a), are available in the Internet 2. They collected reports from several newsgroups (e.g., medicine, space, baseball) of a user. The distinction among datasets is the drift dimension: the Usenet1 dataset has a sharper topic drift (WANG; ABRAHAM, 2015). • Usenet: text dataset inspired by Katakis et al. (KATAKIS; TSOUMAKAS; VLAHAVAS, 2010), available in the Internet 3. Usenet simulates a news filtering from 20 Newsgroups with change of interest of a user (concept drift). There are six topics, the user is interested in two, but is subscribed in four. It also simulates recurring concepts repeating topics of interest (three concepts in training data and three recurring concepts in testing data). The dataset has 5,931 instances and 659 attributes, which are binary values (presence or absence of the respective word). • SEA Concepts Dataset: proposed by Street and Kim (STREET; KIM, 2001), this dataset has 60,000 instances, 3 attributes, and 3 classes; with 10% of noise. The numeric attributes are between 0 and 10 with two relevant attributes. Instances are divided into groups of 15,000 into four concepts. Each concept has different thresholds values (8, 9, 7, and 9.5), and the concept function to determine 0 to a class instance is relevant f eature1+ relevant f eature2 > T hreshold. Dataset is available in the Internet 3, and is quite used by concept drift handling algorithms (WANG; ABRAHAM, 2015). 2.3.1.2 Real-world Datasets In regard to the real-world datasets, UCI machine learning repository (LICHMAN, 2013) are also cited in concept drift literature: • KDD Cup 1999: the Knowledge Discovery and Data mining 1999 (KDD99) competition data contains simulated invasions in a military network domain. The complete dataset 1http://www.win.tue.nl/~mpechen/data/DriftSets/ 2http://mlkd.csd.auth.gr/concept_drift.html 3http://www.liaad.up.pt/kdus/products/datasets-for-concept-drift 2.3 Literature Review 44 has 5,000,000 instances, and the dataset available in the Internet 3 contains only 10% of the size. The original dataset has 24 attack types, and to simplify it into a binary-class problem, it changes the labels of attack types to abnormal and normal. To detect intrusion, it has to differentiate between attacks and normal connections. It includes a wide type of intrusions, so the attack is not a minority class. The dataset has 494,020 instances, where each one stands for a connection with 41 attributes (i.e., connection length, protocol type, network service on the destination, etc). This dataset is tested in many concept drift handling algorithm (MINKU; YAO, 2012). • NSL-KDD Database: it is a KDD99 dataset version that solves issues of the anterior one: it does not include redundant and duplicate instances. Each instance has 41 attributes, and the 25,192 instances are distributed into 23 classes, consisting of normal classes (13,449 instances) and intrusion classes, which can be types of: Denial of service - DoS (9,234), Remote to user - R2U (209), User to root - U2R (11) or Probing (2,289) (FARID et al., 2013). • Large Soybean Database: there are 683 instances and 19 classes, each instance consisting of 35 attributes, some nominal and some ordered (LICHMAN, 2013) (some authors uses all attributes nominalized, e.g., employing string rather numerical values (FARID et al., 2013)). • Image Segmentation Database: this dataset emphasizes image segmentation and bound- ary detection domain. There are 2,310 instances with 19 attributes and 7 classes (outdoor images) including brickface, sky, foliage, etc (FARID et al., 2013). It creates a classification for every pixel with image handsegmention, and each instance is a 3x3 region (LICHMAN, 2013). • Adult: extracted from U.S. Census Bureau with the aim to predict if a person achieves an amount of around $50,000 per year by using 14 demographic features (i.e., age, level of education, marital status, occupation, gender). This dataset has 44,848 people and 29.3% of them belongs to “over 50k” class (STREET; KIM, 2001). • SEER Breast Cancer: used in (STREET; KIM, 2001), it contains 44,000 breast cancer pa- tients accompaniment from the Surveillance, Epidemiology, and End Results (SEER) program of the National Institutes of Health. They consider that patients of class 1 died of breast cancer in between five years of surgery and patients of class 2 have lived at least five years. The resulting dataset has 37,715 instances, 25.7% of it classified as class 1 (STREET; KIM, 2001). 2.3 Literature Review 45 • Covertype: comprises types of cover forest from US Forest Service. It has 581,000 in- stances, 54 attributes, and 7 classes. The authors in (MASUD et al., 2011b) and (ZARE- MOODI; BEIGY; SIAHROUDI, 2015) normalized the data to have two or three classes in each batch, with the appearance of new random classes in some of them. • Poker: each instance represents a hand, which is five cards from a deck of 52. An instance has 10 attributes, in which each card having two attributes (suit and rank). “Poker Hand” is a class attribute. The order of cards matters and it provides 480 Royal Flush hands in contrast to 4 (one for each suit) (LICHMAN, 2013). Another real-world dataset frequently tested in literature is Electricity Market Dataset (ELEC2), first described by Harries (HARRIES, 1999). The goal of this dataset is to recognize if the electricity price will increase or decrease (KOLTER; MALOOF, 2007). Data was collected from TransGrid, an Australian New South Wales Electricity Market, in which the demand and supply of products affect its prices. Harries (HARRIES, 1999) presents the seasonality and the sensitivity of the price and short-term events (like weather variations), respectively. Electric- ity market was extended to nearby areas: the excess production of one area can be sold in the adjacent one, which can dampener the extreme prices. The ELEC2 dataset comprises 45,312 in- stances from 7 May 1996 to 5 December 1998. Each instance assigns to a 30 minutes duration, and has 5 attributes: the weekday (an integer between [1,7]); the time stamp (a day period, a number between [1,48]); the New South Wales electricity demand (numeric attribute); the Vic- toria electricity demand (numeric attribute); the programmed electricity transfer between states (numeric attribute) and the class label (a binary value between up or down that recognizes price changes of the last 24 hours). The dataset attractive is the real-world data characteristics: not knowing if there is a drift and when it occurs (GARCI´A et al., 2006). Real-world datasets less used can be referred as well: • Calendar Apprentice (CAP): dataset used to predict user preferences for scheduling ap- pointment in an academic institution (MITCHELL et al., 1994). The users preference to be predicted is the local of the appointment, duration, starting time, and weekday. An instance has 34 features – such as the type and scope of the meeting, kind of participants, and if it happens during lunchtime – with combinations of these features. There are 12 features for places, 11 for duration, 15 for start time, and 16 for the day of week (KOLTER; MALOOF, 2007). • PAKDD 2009: consist of data from private label credit card operation on stable inflation condition of a major Brazilian retail chain. It has 50,000 instances of a one-year period, 2.4 Discussion 46 in which each instance represents a client by the use of 27 attributes, such as sex, age, marital status, profession, income, etc. Class identifies if the client is a “good” or a “bad” one, being the last a minority class composed by 19.5% of the data (MINKU; YAO, 2012). The next two high dimensional datasets were from e-mail filtering domain. The former depicts sudden drift and recurring contexts, and the latter depicts gradual drift. Both datasets are accessible in Weka (ARFF) format in Boolean bag-of-words vector representation at http: //mlkd.csd.auth.gr/conceptdrift.html (KATAKIS; TSOUMAKAS; VLAHAVAS, 2010): • Emailing List (elist) Dataset: consists of e-mail messages simulating some topics, la- beled as interesting or junk depending on the user interest: the goal is to train and classify messages with user feedback. It collects messages from usenet posts of 20 Newsgroup collection (KLINKENBERG; JOACHIMS, 2000). The selected topics are: science/medicine, science/space, and recreation/sports/baseball. The dataset contains 1,500 instances and 913 attributes with words found at least 10 times on the message (boolean bag-of-words). 300 instances are assigned in five time periods: In the first period, medicine is the inter- esting topic, and the topic of interest at the end of each period changes to simulate concept drift (KATAKIS; TSOUMAKAS; VLAHAVAS, 2010). • Spam Filtering Dataset: consists of e-mail messages from the Spam Assassin Collection. It has four segments: spam, spam2, ham (legitimate), and easy ham, which is quickly identified legitimate messages. Spam ratio of the original set is nearly 20%, and to trans- form it into a longitudinal data, the email sent date and time is extracted and converted into the format yyyyMMddhhmmss (yyyy: year, MM: month, dd: day, hh: hours, mm: minutes, ss: seconds). It maintains all copies, even if the user has more than one of the same e-mail, but the attachments are removed. It employs the boolean bag-of-words approach to represent e-mails. The dataset contains 9,324 instances and 500 attributes (words acquired by employing feature selection with the X2 measure). This dataset has gradual concept drift (KATAKIS; TSOUMAKAS; VLAHAVAS, 2010). In the Internet 3 is avail- able a spam dataset consisting of 9,324 instances and 40,000 attributes with gradual drift and binary class (legitimate and spam) with nearly 20% spam ratio. 2.4 Discussion In this section, a summarization concerning all works are presented and further discussed. Figure 2.5 shows the percentage of each learning method used in the articles separated by su- pervised, unsupervised or semi-supervised learning. Clearly, one can observe that supervised 2.4 Discussion 47 learning is by far the widest methodology employed in the context of concept drift. Such num- bers can be explained by the preference for using supervised classifiers in most articles, and by the advantage of detecting changes in data distribution when one has the class information of a sample. 85% 3% 12% Supervised Unsupervised Semisupervised Figure 2.5: Percentage concerning different learning approaches used in concept drift. Figure 2.6 shows the percentage of drift detection mechanism usage, i.e, whether the method is an active or a passive approach. If it is an active approach, it has a drift detection method that informs whether and when a drift occurs. If it is a passive approach, it does not have a drift detection mechanism and the algorithm assumes the drift may occur at any time and updates the model independently. The use of a drift detection mechanism requires a metric to determine that there is a drift, which can influence the performance. Therefore, the lack of adoption of a drift detection mechanism can influence the training time, i.e., the method has to determine when to update its model, which can be costly to do regularly. It is not clear to assume which is the best one since these methods were implemented in almost equivalent quantities, being the passive approach a little more used. Figure 2.7 shows the percentage of classifiers the methods can handle, i.e., whether the method is specific for one classifier or it can be used with any other classifier. Notice the methods use a specific classifier to handle concept drift mostly, which means the techniques are usually designed with a specific technique in mind, and in some cases taking advantage of the characteristics of the classifier on the method. Figure 2.8 shows the percentage of classifiers used in the articles, being the most used 2.4 Discussion 48 48% 52% Drift detector No drift detector Figure 2.6: Percentage of drift detector mechanisms. 72% 28% Specific classifier Any classifier Figure 2.7: Percentage of specific or any-classifier approach that can be used. one the decision trees (DTs), followed by SVM and naı¨ve Bayes. DTs are commonly used in ensemble learning due to their efficiency, thus turning out to be pretty much suitable for handling data streams efficiently. Since SVM and naı¨ve Bayes are very popular classifiers in the community, it is expected they have been employed more regularly. Figure 2.9 shows the percentage of types of concept drift handling used in the articles. The most used approaches are the instance selection and ensemble learning. Instance selection can be easier to implement, i.e., it selects instances within a fixed or dynamic sliding window considering recent samples that are more significant. Ensemble updating is easier to perform as well since it maintains a dynamic set of classifiers that are updated according to some criteria. Figure 2.10 shows the percentage of methods using online learning (which evolves and updates a model as instances are processed) and batch learning (which learns by examining a 2.4 Discussion 49 15% 23% 14% 6% 7% 3% 8% 2% 3% 19% SVM Decision trees Naive Bayes K-means clusterer K-Nearest Neighbor (k-NN) Logistic Regression Neural Networks Regression tree ELM Others Figure 2.8: Percentage of the classifiers used in the works considered in this survey. collection of instances at once), being the batch learning approaches employed a little more in the evaluated methods. The approach selection depends on the classifier and the method implementation, among other details. Figure 2.11 shows the percentage of other issues addressed in articles in addition to con- cept drift handling, being the most addressed drawbacks the imbalanced data (when the class distribution is imbalanced, i.e., having minority and majority classes), followed by new con- cept handling (or concept evolution – emergence of a new concept in the environment mainly in unsupervised learning), and reoccurring concept (when old concepts may re-appear in the future). Figure 2.12 shows the percentage of binary classifiers (when the problem has only two classes, i.e., relevant or irrelevant class, positive or negative class, among others) and multiclass classifiers used in the articles, being both nearly equally used (multiclass classifiers are a little more used). Figure 2.13 shows the percentage of each dataset used in the articles, being the most used dataset the “KDD Cup 1999”; followed by “STAGGER,” “Electricity,” “Hyperplane” and “SEA” datasets; and then “Gauss” and “Forest Cover type” datasets. Figure 2.14 shows the percentage of methods compared in the articles, being the most used methods the Window-of-fixed-size, DDM, Learn++ family, and methods that do not handle con- cept drift; followed by Full-memory methods, EDDM, CVFDT, and DWM. The Window-of- fixed-size is widely used due to its implementation simplicity for new classifiers in nonstation- ary environments; the DDM method is a popular drift detector method for active approaches; and the Learn++ family has different techniques to deal with concept drift. Finally, Figure 2.15 shows the percentage of articles published by year. Notice that 2015 2.5 Conclusions 50 44% 4% 41% 8% 3% Instance selection Instance weighting Ensemble learning Clustering Sampling Figure 2.9: Concept drift handling. 47% 53% Online learning Batch learning Figure 2.10: Online or Batch approaches. had more articles published, followed by the year of 2017. However, one can observe that such area of research has been focused even more yearly, thus showing the increasing interest by the scientific community. Some observations to make is that the analyzed articles were collected until the year 2017, the review was written in 2018 and published between the end of that year and the beginning of 2019. 2.5 Conclusions In this work, we presented the concept drift problem, classifying the variants of target con- cept in different forms. We also named some types of algorithms to deal with concept drift, 2.5 Conclusions 51 10% 11% 13% 5% 1% 2% 58% Reocuring concept New concept Imbalanced data Tolerate noise Missing attribute values Diversity None Figure 2.11: Other issues addressed in the articles. 51% 49% Multiclass classification Binary classification Figure 2.12: Number of classes recognized by the methods. including instance selection or window-based approaches, weight-based approaches, and en- semble of classifiers. The window-based approaches can be full-memory, no-memory, window- of-fixed-size, or window-of-adapting-size, depending on the treatment of the batches. In (DIT- ZLER; POLIKAR, 2013), they characterized concept drift algorithms in others ways such as online vs. batch approaches; single classifier vs. ensemble-based approaches; incremental vs. non- incremental approaches; and active vs. passive approaches. We also summarized some techniques available in the literature that detect or deal with concept drift, like Learn++ family, DDM, DWM; in addition to some real-world and synthetics datasets used in the literature to test and simulate concept drift environment like Electricity, KDD Cup 1999, STAGGER, Hyperplane and SEA datasets. Finally, we summarized in per- centage charts the articles considered in this overview. 2.5 Conclusions 52 5% 1% 1% 3% 4% 2% 2% 3% 4% 5% 5% 5% 6% 2% 1%1% 2% 1% 1% 2% 2% 2%1% 1% 1% 1% 1% 2% 2% 29% STAGGER SINE SINE1 CICLES GAUSS MIXED TREC Adult Forest cover type Hyperplane SEA Electricity KDD99 WINE Nursery Network intrusion Image ssegmentation Thyroid Optical digits LED display Waveform Letter Poker Shuttle Connect-4 Weather Temperatures SynCN SPAM Checkerboard Others Figure 2.13: Datasets used in the articles. Future directions concerning the area may be related to other issues in addition to concept drift handling like imbalanced data, new concept handling and reoccurring concept, as well as more studies in unsupervised environments. 2.5 Conclusions 53 4% 3% 6% 2% 1% 2% 6% 4% 4% 1%3% 4% 7% 1%1% 2% 1%1% 1% 38% 6% Full memory No memory Window of fixed size Adaptive window size Flora family Decision tree DDM EDDM CVFDT GoLF SEA DWM Learn++ family AUE CLAM ExMiner OLINDDA-WCE SERA None Others No Concept Drift Figure 2.14: Methods compared in the articles. Figure 2.15: Articles separated by years. Chapter 3 OPTIMUM PATH FOREST In this chapter, we present the theoretical background regarding the OPF classifier, which comprises two different learning procedures: a supervised and an unsupervised one. 3.1 Supervised Classification The supervised OPF has two approaches: one that uses a complete graph (PAPA; FALCA˜O; SUZUKI, 2009; PAPA et al., 2012) and another that uses a k-nn graph (PAPA; FALCA˜O, 2008, 2009; PAPA; FERNANDES; FALCA˜O, 2017), being the former the most used and widespread. We first present the supervised OPF with complete graph and further the supervised OPF with k-nn graph. 3.1.1 Complete Graph Regarding to OPF with complete graph, let Z =Z1∪Z2 be a labeled dataset, where Z1 andZ2 stand for the training and test sets, respectively. Let λ (s) be the function that associates the correct label to any sample s ∈ Z1 ∪Z2, such that λ (s) ∈ {1,2, . . . ,c} and s ∈ ℜn. Let S ∈Z1 be the set of prototype samples (i.e., the most important samples that better represent the classes), and d(s, t) the distance between two samples s and t. The problem consists in using S, d andZ1 to project an optimal classifier that can predict the correct label λ (s) of any sample s ∈Z2. Roughly speaking, the OPF classifier models the problem of pattern recognition as a graph partition task, where each node is encoded by a dataset sample that is connected to others by means of a predefined adjacency relation. The partition process is ruled by a competition process among prototype samples, which try to conquer the remaining ones offering to them 3.1 Supervised Classification 55 optimum-path costs. When a sample is conquered, it receives the label of its conqueror together with an updated cost. At the final of the process, we have a collection of optimum-path trees (OPTs), which are rooted at each prototype. Let (Z1,A ) be a complete graph in which nodes are samples inZ1, and any pair of samples defines an edge in A (i.e., a complete graph). Also, the edges are weighted by the distance d among their corresponding nodes (Figure 3.1a). A path in Z1 is a sequence of samples pi = 〈s1,s2, . . . ,sk〉, where (si,si+1) ∈ A for 1 ≤ i ≤ k− 1. Additionally, a path is said to be trivial when pi = 〈s1〉. A given path-cost function f (•) is associated to each path pi , being denoted by f (pi), and a given path pi ′ is said to be optimum if f (pi ′) ≤ f (τ) for any path τ , where pi ′ and τ end at the same sample s, regardless of their origin. We also denote pi · (s, t) the concatenation of the path pi ending at s and the edge (s, t) ∈A . The OPF algorithm can be used with any smooth path-cost function, which can combine samples with similar properties (FALCA˜O; STOLFI; LOTUFO, 2004). Papa et al. (PAPA; FALCA˜O; SUZUKI, 2009) opted to use the fmax cost function due to its theoretical properties to estimate optimum prototypes, which can be defined as follows: fmax(〈s〉) = { 0 if s ∈S , +∞ otherwise fmax(pi · (s, t)) = max{ fmax(pi),d(s, t)}. (3.1) Notice fmax(pi) computes the maximum distance among adjacent samples in pi , when pi is not a trivial path. The OPF comprises a training and a classification step, being the former in charge of build- ing the optimum-path forest over the training set using fmax andS , and the classification phase simply computes the training sample that will conquer that specific testing node (details below). Papa et al. (PAPA; FALCA˜O; SUZUKI, 2009) proposed to composeS with the nearest nodes from different classes in Z1, since such samples fall in the boundary of the classes, thus being in- formative enough to the learning process. In order to find them, one just needs to compute a minimum spanning tree inZ1 (Figure 3.1b), and then select the connected samples with differ- ent labels as the prototype nodes (Figure 3.1c). Roughly speaking, the OPF training algorithm associates an optimal path P∗(s) fromS to all samples s ∈ Z1, thus building an optimum path forest P (a function without cycles which 3.1 Supervised Classification 56 associates to all s ∈Z1 its predecessor P(s) in P∗(s), or assigns nil when s ∈S ). Let R(s)∈S be the root of P∗(s) that can be achieved using P(s). The OPF algorithm computes, for each s ∈Z1, the cost C(s) of P∗(s), the label L(s) = λ (R(s)), and its predecessor P(s). 0.2 0.4 0.8 0.5 1.0 0.9 0.5 0.3 0.7 0.6 1.0 0.7 0.7 0.8 0.8 0.2 0.4 0.5 0.3 0.5 0.2 0.4 0.5 0.3 0.5 (a) (b) (c) (0.0,1) (0.4,1) (0.5,1) (0.0,2) (0.3,2) (0.5,2) (0.0,1) (0.4,1) (0.0,2) (0.3,2) (0.5,2) 0.5 0.4 0.80.9 0.6 (0.5,1) 0.3 (?,?) (0.0,1) (0.4,1) (0.5,1) (0.0,2) (0.3,2) (0.5,2) 0.9 (0.4,2) (d) (e) (f) Figure 3.1: OPF training step: (a) Complete and weighted graph representing the training set and its (b) MST, (c) prototypes highlighted and (d) resulting optimum-path forest. OPF classification step: (e) a testing sample (triangle) is added into the graph and connected to all training samples, and (f) the testing sample is conquered by the sample that offered the lowest cost. The classification step is straightforward, i.e., given a test sample t ∈ Z2, we connect it to all training nodes (Figure 3.1e) of the optimum-path forest generated in the training phase (Figure 3.1d), and we evaluate which node p ∈Z1 minimizes the equation: C(t) = min{max{C(p),d(p, t)}},∀p ∈Z1. (3.2) Thus, the node p ∈Z1 that minimizes C(t) will be the one that conquer t (Figure 3.1f). 3.1.2 K-nn Graph The OPF with k-neighborhood (OPFknn) differs in some points with respect to the OPF with complete graph variant (PAPA; FALCA˜O; SUZUKI, 2009; PAPA; FALCA˜O, 2009; PAPA; FERNANDES; FALCA˜O, 2017). The adjacency relation (Ak), in this case, connects each sample to its k-nearest neighbors; the prototypes are now estimated as the nodes located at the highest density regions; and the path-cost function aims at maximizing the cost of every sample. Roughly speaking, OPFknn comprises two phases: training and classification. The former is responsible for com- puting the density of each training node using Ak∗ , being the k∗ ∈ [1,kmax] the best value of 3.2 Unsupervised Classification 57 k (size of the neighbrohood) that maximizes some criterion, and then to start the competition process among prototypes. In order to compute the probability density function (pdf) ρ(s) of each node s, one can use the following formulation: ρ(s) = 1√ 2piσ2 |A ∗k (s) | ∑ ∀t∈A ∗k (s) exp (−d2(s, t) 2σ2 ) , (3.3) where |A ∗k (s) |= k∗, σ = d f 3 , and d f is the maximum arc weight in training set. The competition process among prototypes is conducted in order to conquer the training samples by offering to them optimum-paths according to a path-cost function fmin, given by: fmin(〈t〉) = { ρ(t) if t ∈S , ρ(t)−1 otherwise fmin(pis · 〈s, t〉) = min{ fmin(pis),ρ(t)}. (3.4) where ρ(t)−1 is employed to avoid plateaus nearby the maxima of the pdf. The classification process picks up a sample from the test set, connects it to its k∗-nearest neighbors in the optimum-path forest generated by the training phase, and then uses the same OPFknn rule employed in the competition process to conquer that sample. 3.2 Unsupervised Classification The unsupervised OPF was proposed by Rocha et al. (ROCHA; CAPPABIANCO; FALCA˜O, 2009) as a data clustering method based on optimum-path forest. Let D be an unlabeled dataset consisting of samples from a given application. Each sample s ∈ D is usually represented by a feature vector v(s) and the distance between samples s and t is given by a function d(s, t). The unsupervised OPF models these samples s as nodes from graph (D,Ak), being the arcs (s, t)∈ A a connection of k-nearest neighbors in the feature space. The nodes s ∈ D are weighted by a probability density value ρ(s) given by: ρ(s) = 1√ 2piσ2 |Ak(s) | ∑∀t∈Ak(s) exp (−d2(s, t) 2σ2 ) , (3.5) where |Ak(s) |= k, σ = d f3 , and d f is the maximum arc weight in (D,Ak). The chosen of best value of k (k∗) within [kmin, kmax], 1 ≤ kmin < kmax ≤| D |, considers the minimum graph cut among all clustering results for k ∈[1,kmax] (kmin = 1). The path pit is a 3.3 Accuracy Measure 58 sequence of adjacent samples starting from a root R(t) and ending at a sample t, being pit = 〈t〉 a trivial path and pis · 〈s, t〉 the concatenation of pis and arc (s, t). It assigns to each path pit a value f (pit) given by a connectivity function f . A path pit is considered optimum if f (pit)≥ f (τt) for any other path τt and f (pit) is defined by: f (〈t〉) = { ρ(t) if t ∈ R, ρ(t)−δ otherwise f (pis · 〈s, t〉) = min{ f (pis),ρ(t)}. (3.6) for δ = min∀(s,t)∈Ak|ρ(t)6=ρ(s) | ρ(t)− ρ(s) | and R being the root set (one element per each maximum of the pdf discovered on-the-fly). 3.3 Accuracy Measure In the works developed in this thesis, we considered the accuracy measure proposed by Papa et al. (PAPA; FALCA˜O; SUZUKI, 2009) that takes into account unbalanced datasets. The accuracy Acc is measured as: Acc = 2c−∑ci=1 E(i) 2c = 1− ∑ c i=1 E(i) 2c , (3.7) being N(i), i = 1,2, . . . ,c, the number of samples in the dataset Z from class i, and E(i) the partial sum error of class i. Chapter 4 LEARNING CONCEPT DRIFT WITH OPTIMUM-PATH FOREST This chapter presents the work presented by Iwashita and Papa (IWASHITA; PAPA, 2018) that evaluates the supervised OPF in different approaches to deal with concept drift. 4.1 Introduction Our proposal is to assess the OPF robustness in concept drift environments and verify how it behaves in such situation. The main contribution of this work is to evaluate two different OPF classifiers in this context under three different perspectives (i.e., no-memory, full-memory and windowed-based approach) in public datasets. 4.2 Experiments We evaluated three different approaches to address the problem of concept drift with tra- ditional OPF (i.e., with complete graph) and OPFknn, as described in Subsection 2.2.1: “full- memory,” “no-memory” and “window-of-fixed-size three.” The synthetic datasets used in this work are: Hyperplane dataset consisting of 10 features and 90,000 samples with a drift at every 10,000 samples; and Usenet 1 and Usenet 2 datasets consisting of 99 features and 1,500 samples with a drift at every 300 samples. As the databases are labeled, all experiments performed in this work and the following chapters are considered to have “zero latency”. We chose synthetic public datasets which were quite used by the articles observed in our survey and other datasets with interesting subjects. 4.2 Experiments 60 In order to evaluate the techniques, we divided each dataset into 30 batches (equally) aiming to simulate a stream of data. For the sake of explanation, consider the Hyperplane dataset that contains 90,000 samples, which means each batch comprises 3,000 samples. Since the concept drifts after every 10,000 samples, we have a change in the stream behavior every 3.3 batches (highlighted regions in Figure 4.1). Additionally, naı¨ve OPF (i.e., without support to handle concept drift) is compared against OPF-nomemory, OPF-fullmemory, and OPF-window3. The latter classifier works with a window of 3 batches. Since OPFknn requires one additional step to set up kmax parameter, which limits the extent of the neighborhood size when computing the k-nearest neighbors, we employed the following methodology: for almost all datasets, we set kmax as being 10% of their sizes. The only excep- tion concerns in Hyperplane full-memory version experiment, in which we used 0.1% of the dataset size. These differences in the kmax values are due to the size of the datasets, and for the experiment’s time exceeding the expectation that we had to abort. Since the full-memory exper- iment keeps adding new samples for training, it becomes too costly in terms of computational burden for training purposes. Although the idea is to assess the OPF robustness under concept drift, we also included naı¨ve SVM with a Radial Basis Function kernel optimized through cross-validation for com- parison purposes. The “traditional SVM” parameters were optimized as follows: we divided the first batch in 50% to compose the training set and the remaining 50% to compose the vali- dating set (used to optimize kernel parameters). Also, we consider the following proportions for the training and validating sets: 60%− 40%, 70%− 30%, 80%− 20% and 90%− 10%. Such procedure was performed for all datasets, and we used the parameters that maximized SVM accuracy over the validating set of the aforementioned configurations (i.e., training and validat- ing set percentages). The “no-memory,” “full-memory” and “window-3” SVM versions use a grid search parameter estimation methodology with 90%−10% of training and validating sets, respectively. In regard to SVM implementation, we used scikit-learn (PEDREGOSA et al., 2011), and with respect to OPF, we employed LibOPF1. Note that the comparison may not be very fair because the libraries are in different languages (the first in Python and the other in C). Finally, we employed an accuracy measure proposed by Papa et al. (PAPA; FALCA˜O; SUZUKI, 2009). Figures 4.1 to 4.3 depict the accuracy rates of all aforementioned techniques considering the datasets employed in this work. Let us have a look at Figure 4.1, which stands for the results considering Hyperplane dataset. SVM has a higher accuracy than OPF-based classifiers. Also, we observe that OPFknn obtained better results than OPF with complete graph. The OPFKnn 1https://github.com/jppbsi/LibOPF 4.2 Experiments 61 version is the most time-consuming one regarding training, being SVM variants slightly more time consuming training than OPF-based classifiers. Figure 4.1: Accuracy over Hyperplane dataset stream. Some concept drifts are hilighted. Figure 4.2 depicts the results over Usenet1 dataset stream. Traditional SVM did better in batches #1, #7 (same as traditional OPF) and #25. The concept drift-oriented SVM versions performed better in many batches, including batches #4 to #6, #10, #12, #15, #17, #20, #22 to #24 and #27 to #30. The SVM-window3 had the best performance among them, and traditional OPF obtained the best results in batches #7 and #13. The concept drift-oriented OPF versions achieved better results in batches #9, #16, #18 (same as OPFknn-window3), #19 and #26. After the last concept drift in batch #25, OPF-fullmemory obtained the worst accuracy until batch #29. It ends up that OPF-fullmemory is being penalized by a large training set with fuitless information. The concept drift-oriented OPFknn versions performed better in batches #3, #8, #11, #14, #18 (same OPF-window3) and #21. As a matter of fact, OPFknn showed to be way competitive when compared to standard OPF, being the only drawback the kmax parameter, which needs to be fine-tuned. Figure 4.3 depicts the results over Usenet2 dataset stream. SVM performed well in batches #1, #8, #10, #23 and #27. The concept drift-oriented SVM versions performed better on batches #3, #4, #11, #12, #14, #16, #20, #26, #28 and #30. It had the same performance as OPF and OPFknn versions in batches #5, #6 and #17. OPF did not obtain good results, the OPF concept drift-oriented versions did much better results; besides stands out in batches like #6, #13, #15 and #22. On OPFknn versions, we have better accuracies in batches #7, #9, #18, #19, #21, #24, #25 and #29, and same performance in batches #5, #6 and #17. On batches #7, #13, #19 and #25 in which a concept drift occurs, the accuracy of all versions drops, but on batches #19 and #25 the OPFknn-fullmemory continues with similar performance. 4.3 Partial Considerations 62 Figure 4.2: Accuracy over Usenet1 dataset stream. Some concept drifts are hilighted. Figure 4.3: Accuracy over Usenet2 dataset stream. Some concept drifts are hilighted. 4.3 Partial Considerations In this work, we dealt with the problem of concept drift concerning the OPF classifier. The experiments over synthetic datasets compared standard OPF, OPFknn, SVM, and three distinct versions to address concept drift on each technique. We have shown that OPF is suitable to work under these dynamic scenarios since its recognition rates were considerably better when adapted to address concept drift. Regarding to SVM versions, the OPF versions have lower time-consuming training time, and in some bases they also have a lower time-consuming testing time, which is a useful benefit in this type of environment. OPF versions have been consistently more efficient than SVMs for training purposes. The main problem related to SVM still relies on the fine-tuning step, which is critical when one considers situations that require an on-line training. In this case, the classifier needs to be retrained every time a new batch comes to the game, and the response needs to be efficient as well, since we have a stream of data that needs to be processed on-the-fly. Chapter 5 LEARNING CONCEPT DRIFT WITH ENSEMBLES OF OPTIMUM-PATH FOREST-BASED CLASSIFIERS This chapter discusses the work presented by Iwashita et al. (IWASHITA; ALBUQUERQUE; PAPA, 2019), which proposes an ensemble-based OPF to deal with concept drift. This chapter also presents experiments concerning OPF dynamic-window approach to detecting drift based on accuracy values. 5.1 Introduction The main contribution of this paper is to propose an ensemble-based approach composed of a committee of OPF classifiers to cope with the problem of concept drift handling. We considered three different perspectives with variations of streaming management and classifiers in public datasets, being the results compared to the ones obtained by standard OPF and OPF with simple concept drift handling (no-memory, full-memory, and windowed-based approach). 5.2 Proposed Approach In this section, we evaluated three different ensemble-based approaches in three different scenarios to address the problem of concept drift with the data management described in Subsec- tion 2.2.1: “full-memory” (OPF-fullmemory), “no-memory” (OPF-nomemory), and “window- of-fixed-size three” (OPF-window3). Also, the experiments used real-world and synthetics datasets, being the latter of great importance, since they shape an environment in which we know that concept drift really occurs, as well as which kind of change is occurring (i.e., gradual or abrupt). Although ensembles of OPF-based classifiers have been evaluated before (RIBEIRO; 5.2 Proposed Approach 64 PAPA; ROMERO, 2017; FERNANDES; PAPA, 2017; FERNANDES et al., 2017), they were not con- sidered in the context of concept drift. Table 5.1 presents the main information concerning the synthetic datasets used in this work. Table 5.1: Synthetic Datasets Dataset # of samples drift time # of features Hyperplane 90,000 at every 10,000 samples 10 Usenet 1 1,500 at every 300 samples 99 Usenet 2 1,500 at every 300 samples 99 SEA 60,000 at every 15,000 samples 3 In regard to the real-world dataset, we used the following (Table 5.2): Table 5.2: Real-world Datasets Dataset # of samples # of features Forest Covertype 581,012 54 Electricity 45,312 8 Poker Hand 829,201 10 We chose synthetic and real-world datasets which were quite used by the articles observed in our survey and other interesting datasets available on the internet. In order to evaluate the techniques, we divided each dataset into 30 batches (equally) aiming to simulate a stream of data. For the sake of explanation, consider the SEA dataset that contains 60,000 samples, which means each batch comprises 2,000 samples. Since the concept drifts after every 15,000 samples (Table 5.1), we have a change in the stream behavior every 7.5 batches. Therefore, the main idea of this work is to evaluate whether an ensemble version of OPF is robust enough to handle such situations or not. Additionally, naı¨ve OPF (i.e., without sup- port to handle concept drift) is compared against OPF-nomemory, OPF-fullmemory, and OPF- window3 (of 3 batches). With respect to the OPF implementation, we employed LibOPF1. Ad- ditionally, we considered the accuracy measure proposed by Papa et al. (PAPA; FALCA˜O; SUZUKI, 2009). The OPF ensemble employs three base classifiers and combines their results in three differ- ent ways (three voting mechanisms): • Combined: the classification result is obtained by the most voted result among base clas- sifier outputs, in which each learner has the same weight. 1https://github.com/jppbsi/LibOPF 5.2 Proposed Approach 65 • Weighted: each base classifier receives a different weight for its voting relevance based on its previous accuracy classification. • Major: the base classifier having the highest accuracy in the previous batch receives an additional weight. With the voting mechanisms described previously, we designed three rounds of experiments with variations of streaming managements, as described below: • Experiment 1: it employs OPF-fullmemory, OPF-nomemory, and OPF-window3 as base classifiers (Figure 5.1), in which each approach handles the stream of batches as described in Subsection 2.2.1 with the OPF classifier. The results of each approach are further combined using the “combined,” “weighted” and “major” approaches. Batch 2 Batch 5 Batch 6 Batch 4 Batch 3 Batch 2 Batch 1 Batch 6 Batch 6 Batch 5 Batch 4 Batch 3 Batch 5 Batch 4 Batch 3 Result Training batch Testing batch Combining method Classify Batch 2 Batch 1 Batch 1 Full-memory No-memory Window-of-size three Figure 5.1: Pictorial example of the approach named as “Experiment 1.” • Experiment 2: it partitions the stream of batches into three different portions of training data for each base classifier, hereinafter called “fold1,” “fold2,” and “fold3,” consider- ing the data management described in Subsection 2.2.1. In regard to the full-memory (Figure 5.2) and no-memory (Figure 5.3) experiments, since one has one batch only, it is partitioned into three distinct subsets for further training one OPF classifier on each subset. With respect to the window-3 (Figure 5.4) approach, all three batches are merged into only one and further partitioned into three subsets. The results of fold1, fold2, and fold3 are further combined using the “combined,” “weighted” and “major” approaches. • Experiment 3: one classifier is trained for each batch. The ensemble considers the last three models for each new batch classification procedure, similarly to the window-3 man- agement (Figure 5.5). The results are combined using the “combined” and “weighted” approaches. 5.2 Proposed Approach 66 Batch 6 Batch 6 Batch 6 Result Batch 5 Batch 4 Batch 3 Batch 2 Batch 1 fold1 fold2 fold3 Training data Testing batch Combining method Classify Figure 5.2: Experiment 2 with the full-memory approach. Batch 6 Batch 6 Batch 6 Result Batch 5 Batch 4 Batch 3 Batch 2 Batch 1 fold1 fold2 fold3 Training data Testing batch Combining method Classify Figure 5.3: Experiment 2 with the no-memory approach. Batch 6 Batch 6 Batch 6 Result Batch 5 Batch 4 Batch 3 Batch 2 Batch 1 fold1 fold2 fold3 Training data Testing batch Combining method Classify Figure 5.4: Experiment 2 with the window-of-fixed-size three. Batch 6 Batch 6 Batch 6 Batch 5 Batch 4 Batch 3 Result Batch 2 Batch 1 Training batch Testing batch Combining method Classify Figure 5.5: Experiment 3. With respect to the weight values, we adopted the following methodology2: concerning Experiments 1 and 2, the classifiers are weighted according to their accuracies, i.e., the ap- proach with the highest accuracy receives the weight as of 0.8, followed by the second best one with weight as of 0.5, and the last one with weight as of 0.3. In regard to Experiment 3, the most recent batch received the highest weight, i.e., suppose we are classifying the batch i, 2Notice that the weights were chosen empirically. 5.3 Experimental Results 67 then the training batch i− 1 has a weight as of 0.4, batch i− 2 has a weight as of 0.35, and finally batch i− 3 has a weight as of 0.25. The experiments were performed on a computer with IntelrCore(TM) i5-4690 processor, Z97M-PLUS/BR motherboard, NVIDIA Corpora- tion GM107 [GeForce GTX 750] graphics processing unit, and Corsair DDR3 8GB 1600MHz RAM. 5.3 Experimental Results In this section, we evaluated the three experimental approaches described in Section 5.2. 5.3.1 Experiment 1 As aforementioned, the idea of this experiment is to consider all outputs given by full mem- ory, no-memory and window-of-size three approaches. Figure 5.6 depicts the results concerning Covertype dataset. Clearly, one can observe that standard OPF, OPF-fullmemory and the pro- posed OPF with majority voting (“Major”) obtained the best results. Both OPF-fullmemory and OPF with majority voting were consistently similar to each other, which reflects the fact that OPF-fullmemory was also the best approach among the ones used to handle concept drift. Table 5.3 presents the mean accuracy and standard deviation concerning all batches. Figure 5.6: Experimental results concerning Covertype dataset with respect to Experiment 1. Figure 5.7 depicts the results concerning the Electricity dataset. OPF with concept drift handling versions (OPF-fullmemory, OPF-nomemory, and OPF-window3) and the ensemble- based versions obtained higher accuracy when compared to traditional OPF. Only in batch #5, the traditional OPF has been more accurate than the OPF-nomemory. Table 5.4 presents the mean accuracy and standard deviation concerning all batches. 5.3 Experimental Results 68 Table 5.3: Covertype dataset with respect to Experiment 1. Type Accuracy OPF 82.32752648±5.527916684 OPF-nomemory 66.340352±6.205296248 OPF-fullmemory 81.51895655±6.360053157 OPF-window3 66.63647852±6.483513342 OPF-combined 68.55474907±6.447446854 OPF-weighted 70.03032828±6.163191174 OPF-major 81.3674181±6.822249211 Figure 5.7: Experimental results concerning Electricity dataset with respect to Experiment 1. The combined and weighted versions obtained a similar behavior, whereas the major ver- sion has a better performance in some batches. With respect to the previous dataset shown in Figure 5.6, one can now realize the importance of handling concept drift, since in that results standard OPF (i.e., with no concept drift handling) showed very much good results. The main idea in using such a dataset where standard OPF can outperform some approaches that can han- dle concept drift is just to show the proposed approach can obtain best results in any situation, i.e., where one does have or does not have a strong concept drift situation. Table 5.4: Electricity dataset with respect to Experiment 1. Type Accuracy OPF 56.94443552±6.002938206 OPF-nomemory 67.06707179±4.373578762 OPF-fullmemory 66.73389428±4.523784782 OPF-window3 66.95818376±4.505452085 OPF-combined 67.2960539±4.434176779 OPF-weighted 67.2960539±4.434176779 OPF-major 67.23050776±4.250484987 Figure 5.8 depicts the results concerning the Hyperplane dataset. The combined and weighted versions obtained similar performances. The combined version has a drop of performance in batches #11 and #21, whereas OPF-nomemory has a better performance in batches #6, #10, #16, #20, #22, #23, among others. The major version has a similar performance as OPF-nomemory, which figured out as the two best techniques concerning this dataset. 5.3 Experimental Results 69 Figure 5.8: Experimental results concerning Hyperplane dataset with respect to Experiment 1. Table 5.5: Hyperplane dataset with respect to Experiment 1. Type Accuracy OPF 62.48453148±7.664021067 OPF-nomemory 69.01541517±7.265204095 OPF-fullmemory 60.77733197±6.950052568 OPF-window3 63.79282938±7.071207897 OPF-combined 66.90718755±6.359478356 OPF-weighted 66.90718755±6.359478356 OPF-major 67.042868±7.594141989 Figure 5.9 depicts the results concerning the Poker dataset. The weighted version has a better performance than the combined version, whereas the major version has a similar per- formance as OPF-fullmemory. Once again, these last two approaches have obtained the best results so far. The accuracies vary consistently, thus degrading the performance of standard OPF and OPF-nomemory. Some batch transitions can really show the problem of concept drift in this dataset. Consider the transition between batches #12 and #13: clearly, one can observe the accuracy decreased from 84.8% to nearly 60% considering standard OPF. Although all the other accuracies decreased either, the proposed OPF with majority voting reached nearly 72%. Table 5.6 presents the mean accuracy and standard deviation concerning all batches. Figure 5.9: Experimental results concerning Poker dataset with respect to Experiment 1. 5.3 Experimental Results 70 Table 5.6: Poker Hand dataset with respect to Experiment 1. Type Accuracy OPF 70.29847297±8.205813114 OPF-nomemory 70.17197476±6.946167608 OPF-fullmemory 79.24812266±6.830883212 OPF-window3 72.42684324±6.755000663 OPF-combined 73.72377555±6.649724747 OPF-weighted 74.74302866±6.604097967 OPF-major 78.81981252±7.820022743 Figure 5.10 depicts the results concerning the SEA dataset. The combined and weighted versions have similar behavior and also a higher accuracy when compared to traditional OPF and OPF with concept drift handling versions. This dataset showed the robustness of the pro- posed approaches, which obtained the best results so far, outperforming considerably standard OPF and OPF with concept drift handling with no ensemble-based learning. Table 5.7 presents the mean accuracy and standard deviation concerning all batches. Figure 5.10: Experimental results concerning SEA dataset with respect to Experiment 1. Table 5.7: SEA dataset with respect to Experiment 1. Type Accuracy OPF 74.39154759±1.346465467 OPF-nomemory 76.24827041±2.128896094 OPF-fullmemory 74.69366883±1.47469361 OPF-window3 75.83703666±2.551433832 OPF-combined 78.6340661±2.305279021 OPF-weighted 78.6340661±2.305279021 OPF-major 76.34419907±2.180538025 Figure 5.11 depicts the results concerning the Usenet1 dataset. Once again, the combined and weighted versions obtained similar performance, whereas the major version figured an oscillatory behavior when compared to other ensemble versions. The same behavior can be observed in the Usenet2 dataset as well (Figure 5.12). Tables 5.8 and 5.9 present the mean accuracy and standard deviation concerning all batches. 5.3 Experimental Results 71 Figure 5.11: Experimental results concerning Usenet1 dataset with respect to Experiment 1. Table 5.8: Usenet1 dataset with respect to Experiment 1. Type Accuracy OPF 54.92385783±11.19221173 OPF-nomemory 59.35022066±14.63104797 OPF-fullmemory 51.54818866±10.24161553 OPF-window3 59.95950859±14.13373717 OPF-combined 57.60552966±13.83043313 OPF-weighted 57.60552966±13.83043313 OPF-major 57.39366248±16.16345656 Figure 5.12: Experimental results concerning Usenet2 dataset with respect to Experiment 1. Table 5.9: Usenet2 dataset with respect to Experiment 1. Type Accuracy OPF 50.30906162±1.37347363 OPF-nomemory 56.69870524±10.57647197 OPF-fullmemory 52.94175238±7.191195194 OPF-window3 57.96049969±13.54929432 OPF-combined 58.45837903±12.93718405 OPF-weighted 58.45837903±12.93718405 OPF-major 56.57500459±11.91377109 5.3.2 Experiment 2 In this experiment, we decided to verify whether one can improve the concept drift handling using ensemble-based learning, but not combining different approaches, i.e., OPF-fullmemory, 5.3 Experimental Results 72 OPF-nomemory, and OPF-window3. Therefore, the idea was, for each aforementioned ap- proach, to partition the training data into three subsets, and then apply ensemble learning over them. In this experiment, we are comparing standard OPF, OPF-nomemory, OPF-fullmemory, OPF-window3, the proposed ensemble-based learning with combined, weighted, and major ap- proaches, as well as the performance of each individual fold (i.e., fold1, fold2, and fold3). Figure 5.13 depicts the results concerning the Covtype dataset. The combined and weighted versions obtained similar behavior, being the weighted version better in some batches. The major version has a worse behavior among all. In no-memory and window-3 management, the traditional OPF obtained the higher accuracy among all. A similar behavior to the one obtained in Experiment 1 can be observed, i.e., with OPF and OPF-fullmemory obtaining the most accurate results. Table 5.10 presents the mean accuracy and standard deviation concerning all batches in full-memory management. Figure 5.13: Experimental results concerning Covertype dataset with respect to Experiment 2 in full-memory management. Table 5.10: Covertype dataset with respect to Experiment 2 in full-memory management. Type Accuracy OPF 82.32752648±5.527916684 OPF-fullmemory 81.51895655±6.360053157 OPF-fold1 77.05269845±6.149721752 OPF-fold2 77.53515266±6.216085779 OPF-fold3 77.32545021±6.333944787 OPF-combined 78.4888871±6.553147785 OPF-weighted 78.58643359±6.601479619 OPF-major 77.65701845±6.154886295 In the Electricity dataset (Figure 5.14), the combined and weighted versions of full-memory, no-memory, and window-3 obtained similar behaviors, being the majority voting the one with lower accuracies. One can also observe that standard OPF did not perform well, with accuracies much below the ones obtained by the others. We can observe in the no-memory management 5.3 Experimental Results 73 (Figure 5.14) that OPF-nomemory and the combined and weighted versions figured an oscilla- tory performance. Table 5.11 presents the mean accuracy and standard deviation concerning all batches in no-memory management. Figure 5.14: Experimental results concerning Electricity dataset with respect to Experiment 2 in no-memory management. Table 5.11: Electricity dataset with respect to Experiment 2 in no-memory management. Type Accuracy OPF 56.94443552±6.002938206 OPF-nomemory 67.06707179±4.373578762 OPF-fold1 65.44168362±3.854310398 OPF-fold2 65.20587686±3.711158446 OPF-fold3 64.99470534±3.879081788 OPF-combined 67.13015469±4.162979812 OPF-weighted 67.13015469±4.162979812 OPF-major 65.21247934±4.102278054 In the Hyperplane dataset (Figure 5.15), the combined and weighted versions of full-memory, no-memory, and window-3 obtained similar performances, being the majority voting the one with smaller accuracies. The OPF with ensemble learning and traditional OPF figured an os- cillatory performance in the full-memory approach, whereas traditional OPF in no-memory management has higher accuracy only on batches #4, #11, #21, and #28. Traditional OPF in window-3 management (Figure 5.15) performed a little better when compared to the no- memory management. Interestingly, standard OPF performed better in a transition between batches #10 and #11, which seems to be a concept drift, but with the statistics of the testing data similar to the one standard OPF has been trained for (the beginning of the streaming). Ta- ble 5.12 presents the mean accuracy and standard deviation concerning all batches in window-3 management. In the Poker dataset (Figure 5.16), the combined, weighted, and major versions of full- memory, no-memory, and window-3 obtained similar performances once more, with the weighted version the most accurate one followed by the combined one. In this dataset, one can clearly ob- 5.3 Experimental Results 74 Figure 5.15: Experimental results concerning Hyperplane dataset with respect to Experiment 2 in window-3 management. Table 5.12: Hyperplane dataset with respect to Experiment 2 in window-3 management. Type Accuracy OPF 62.48453148±7.664021067 OPF-window3 63.79282938±7.071207897 OPF-fold1 63.05814352±6.930449599 OPF-fold2 62.96584131±7.003386526 OPF-fold3 63.10990428±6.694532681 OPF-combined 65.928872±8.586587354 OPF-weighted 65.928872±8.586587354 OPF-major 63.24691903±7.267281317 serve some situations where the standard OPF’s accuracy drops considerably (e.g., transitions between batches #11 and #12, and between batches #20 and #21), since it is not prepared to handle concept drift. Table 5.13 presents the mean accuracy and standard deviation concerning all batches in full-memory management. Figure 5.16: Experimental results concerning Poker dataset with respect to Experiment 2 in full- memory management. In the SEA dataset (Figure 5.17), the combined and weighted versions of full-memory, no- memory, and window-3 obtained similar performances, performing better than traditional OPF and OPF with concept drift handling. The major version achieved the worst accuracy compared 5.3 Experimental Results 75 Table 5.13: Poker dataset with respect to Experiment 2 in full-memory management. Type Accuracy OPF 70.29847297±8.205813114 OPF-fullmemory 79.24812266±6.830883212 OPF-fold1 74.32355024±6.097886009 OPF-fold2 74.84017507±6.282530565 OPF-fold3 74.49460559±6.099172864 OPF-combined 76.36024579±7.0838907 OPF-weighted 77.68715672±7.209030697 OPF-major 74.60479938±6.335993863 to the other ensemble versions. Table 5.14 presents the mean accuracy and standard deviation concerning all batches in no-memory management. Figure 5.17: Experimental results concerning SEA dataset with respect to Experiment 2 in no- memory management. Table 5.14: SEA dataset with respect to Experiment 2 in no-memory management. Type Accuracy OPF 74.39154759±1.346465467 OPF-nomemory 76.24827041±2.128896094 OPF-fold1 75.625649±2.204542238 OPF-fold2 75.38316921±2.181240932 OPF-fold3 75.84664707±2.213159411 OPF-combined 81.54980866±2.490986588 OPF-weighted 81.54980866±2.490986588 OPF-major 75.68724848±2.655161575 In Usenet1 (Figure 5.18) and Usenet 2 (Figure 5.19) datasets, the combined and weighted versions of full-memory, no-memory, and window-3 management obtained similar performances either, with the ensemble learning with majority voting figuring an oscillatory behavior. Ta- bles 5.15 and 5.16 present the mean accuracy and standard deviation concerning all batches in window-3 and full-memory management, respectively. 5.3 Experimental Results 76 Figure 5.18: Experimental results concerning Usenet1 dataset with respect to Experiment 2 in window-3 management. Table 5.15: Usenet1 dataset with respect to Experiment 2 in window-3 management. Type Accuracy OPF 54.92385783±11.19221173 OPF-window3 59.95950859±14.13373717 OPF-fold1 55.46803024±12.79941954 OPF-fold2 59.04749762±13.80458451 OPF-fold3 55.87159655±12.47894492 OPF-combined 59.7810629±13.84571992 OPF-weighted 59.7810629±13.84571992 OPF-major 54.87391672±13.76932634 Figure 5.19: Experimental results concerning Usenet2 dataset with respect to Experiment 2 in full-memory management. Table 5.16: Usenet2 dataset with respect to Experiment 2 in full-memory management. Type Accuracy OPF 50.30906162±1.37347363 OPF-fullmemory 52.94175238±7.191195194 OPF-fold1 55.95265503±12.2914427 OPF-fold2 51.793095±8.079654346 OPF-fold3 55.08751245±10.3272684 OPF-combined 55.83010672±12.45587074 OPF-weighted 55.83010672±12.45587074 OPF-major 57.04483903±11.92551197 5.3 Experimental Results 77 5.3.3 Experiment 3 In this experiment, we consider the last three batches to perform the ensemble learning step. Therefore, one classifier is trained on each batch for further combinations of results. This ex- periment compared traditional OPF, OPF-nomemory, OPF-window3, as well as the ensemble- based approaches combined and weighted. Figure 5.20 presents the results over the Covertype dataset. Once again, traditional OPF obtained the best recognition rates so far, being the weighted version similar to the combined approach, but being a little more accurate. In this case, the proposed approaches did not perform well, probably because the last three batches may have conflicting information, i.e., different statistics concerning the testing batch. Table 5.17 presents the mean accuracy and standard deviation concerning all batches. Figure 5.20: Experimental results concerning Covertype dataset with respect to Experiment 3. Table 5.17: Covertype dataset with respect to Experiment 3. Type Accuracy OPF 82.32752648±5.527916684 OPF-nomemory 66.340352±6.205296248 OPF-window3 66.63647852±6.483513342 OPF-combined 61.75935952±6.122989272 OPF-weighted 63.36128134±6.217485077 Figure 5.21 depicts the results concerning the Electricity dataset. The combined and weighted version obtained similar performances, as well as an oscillatory behavior compared to OPF with concept drift with no ensemble learning can be observed as well. In this dataset, the proposed approaches outperformed the others in each batch, thus showing to be way robust to concept drift. As a matter of fact, in a few situations OPF-nomemory obtained the highest accuracies. Table 5.18 presents the mean accuracy and standard deviation concerning all batches. 5.3 Experimental Results 78 Figure 5.21: Experimental results concerning Electricity dataset with respect to Experiment 3. Table 5.18: Electricity dataset with respect to Experiment 3. Type Accuracy OPF 56.94443552±6.002938206 OPF-nomemory 67.06707179±4.373578762 OPF-window3 66.95818376±4.505452085 OPF-combined 66.99822697±4.525034793 OPF-weighted 67.18636838±4.584960193 Figure 5.22 depicts the results concerning the Hyperplane dataset. Once again, the pro- posed approaches (combined and weighted versions) obtained the best results and with similar performances in some batches, being also slightly more accurate than OPF-window3. For other batches, one can observe that OPF-nomemory obtained very much accurate results. OPF with window-of-size 3 did not perform well with respect to the other approaches, although it has out- performed OPF with no concept drift handling. Actually, the task of choosing proper window sizes is still an open issue in the context of concept drift. The point is that smaller windows may not contain enough information about the training data, whereas bigger-sized windows may have too old and outdated information about the new data that is coming through the stream. Table 5.19 presents the mean accuracy and standard deviation concerning all batches. Figure 5.22: Experimental results concerning Hyperplane dataset with respect to Experiment 3. 5.3 Experimental Results 79 Table 5.19: Hyperplane dataset with respect to Experiment 3. Type Accuracy OPF 62.48453148±7.664021067 OPF-nomemory 69.01541517±7.265204095 OPF-window3 63.79282938±7.071207897 OPF-combined 66.09129634±8.855442297 OPF-weighted 66.08478797±8.848643462 Figure 5.23 depicts the results concerning the Poker dataset. The combined and weighted versions obtained similar performances, but the weighted version a little more accurate. In this situation, a window of size 3 to handle concept drift seems to be a good alternative, since it allowed the best recognition rates for some batches. Standard OPF did perform well in some situations too, mainly in batches where concept drift-driven approaches did not obtain satis- factory results. Table 5.20 presents the mean accuracy and standard deviation concerning all batches. Figure 5.23: Experimental results concerning Poker dataset with respect to Experiment 3. Table 5.20: Poker dataset with respect to Experiment 3. Type Accuracy OPF 70.29847297±8.205813114 OPF-nomemory 70.17197476±6.946167608 OPF-window3 72.42684324±6.755000663 OPF-combined 67.65484331±6.826856184 OPF-weighted 69.24347752±6.749471883 Figure 5.24 depicts the results concerning the SEA dataset. The combined and weighted variants obtained similar performances. As one can observe, the aforementioned approaches obtained the better performance among all, except in batch #24. In this specific situation, OPF- nomemory was slightly more accurate. Table 5.21 presents the mean accuracy and standard deviation concerning all batches. Figures 5.25 and 5.26 depict the results concerning Usenet1 and Usenet2 datasets, respec- 5.3 Experimental Results 80 Figure 5.24: Experimental results concerning SEA dataset with respect to Experiment 3. Table 5.21: SEA dataset with respect to Experiment 3. Type Accuracy OPF 74.39154759±1.346465467 OPF-nomemory 76.24827041±2.128896094 OPF-window3 75.83703666±2.551433832 OPF-combined 80.88808886±2.995032574 OPF-weighted 80.86339066±3.020802532 tively. The combined and weighted version obtained similar performances in both datasets, as one can observe in the previous experiments concerning these datasets as well. Tables 5.22 and 5.23 present the mean accuracy and standard deviation concerning all batches. Figure 5.25: Experimental results concerning Usenet1 dataset with respect to Experiment 3. Table 5.22: Usenet1 dataset with respect to Experiment 3. Type Accuracy OPF 54.92385783±11.19221173 OPF-nomemory 59.35022066±14.63104797 OPF-window3 59.95950859±14.13373717 OPF-combined 58.92475714±14.84891883 OPF-weighted 59.12452362±14.91739836 5.3 Experimental Results 81 Figure 5.26: Experimental results concerning Usenet2 dataset with respect to Experiment 3. Table 5.23: Usenet2 dataset with respect to Experiment 3. Type Accuracy OPF 50.30906162±1.37347363 OPF-nomemory 56.69870524±10.57647197 OPF-window3 57.96049969±13.54929432 OPF-combined 54.47438328±9.70819266 OPF-weighted 54.84932934±9.738073086 5.3.4 Discussion In regard to Experiments 1 and 2, in most datasets they allowed us similar recognition rates, but we did not observe significant better performances with respect to the ensemble-based ap- proaches and individual folds in the Experiment 2, just in some cases. However, the proposed approaches outperformed individual folds in almost all datasets. Also, Experiment 1 showed us the proposed approaches are pretty much suitable to handle the concept drift, and using in- formation from different learners can improve the recognition rates. For some datasets, we observed that combining information from different concept drift handling approaches (Exper- iment 1) is more beneficial than using information from the same handling approach. We can verify in Experiment 2 that training in each fold is faster than training in the non-split version (OPF-nomemory, OPF-fullmemory, and OPF-window3). Although the size is pretty much the same, since OPF has a quadratic complexity for training, the learning process is faster when performed in smaller training sets (PONTI; PAPA, 2011). The Experiment 3 allowed us to combine the three last batches, and not mixing different concept drift approaches. We have observed that such methodology brings us results that are somehow close to the ones obtained with OPF with window-of-size 3 for concept drift handling, but being faster, since we are now training on three individual batches, and not in only one. 5.4 Partial Considerations 82 5.4 Partial Considerations In this work, we dealt with the problem of concept drift concerning the Optimum-Path Forest with ensemble learning. As far as we are concerned, no study aimed at considering the aforementioned method to handle changes in a data stream up to date. The experiments over real and synthetic datasets compared standard OPF, OPF-fullmemory, OPF-nomemory, and OPF-window3 approaches, as well as three proposed distinct ensemble- based versions to address concept drift. We have shown that OPF ensemble is suitable to work under these dynamic scenarios since its recognition rates were considerably better compared to traditional OPF and OPF with concept drift handling with no ensemble learning. We have observed that, in Experiment 1, the combined ensemble approach obtained the highest accuracy on SEA dataset, considerably improving its performance compared to tradi- tional OPF and the OPF with concept drift handling versions with no ensemble learning. We can also verify that combined methods have better accuracy than the ones obtained on each fold individually in Experiment 2. Additionally, the main advantage concerning Experiment 3 over OPF-window3 is its training time, since training in smaller data is faster than a larger one, i.e., training three individual batches is faster than training three merged batches. In regard to future works, we intend to dynamically learn and adjust the window size, as well as to implement OPF on data stream with incremental learning, i.e., where we can pose constraints in the training time. Therefore, the idea is not retraining the whole classifier when a new batch comes to play, but just update the trained classifier. 5.5 OPF with dynamic window This section presents the results of OPF with drift detection based on accuracy against tra- ditional OPF and OPF in the no-memory, full-memory, and fixed-window3 versions. The main objective of this work is to verify the OPF performance with a simple drift detector mechanism to deal with the concept drift detection. We employed a dynamic sliding window in public datasets, being the results compared to the traditional OPF and its no-memory, full-memory, and fixed-window3 versions. The OPF verifies the accuracy of the last batch and uses it as a threshold. We considered that if the accuracy drops below a threshold, a concept drift is detected. There is no concept drift while the accuracy is above a threshold and the training set contains all previous batches to maintain the stability of the data. If concept drift occurs, i.e., accuracy drops below a threshold 5.5 OPF with dynamic window 83 (which was empirically defined), the current training data is discarded and the training phase employs the newest batch. The datasets used in this work are Electricity (Subsection 2.3.1.2), Hyperlane, SEA, Usenet1, and Usenet2 (Subsection 2.3.1.1). In order to evaluate the techniques, we divided each dataset into 30 batches (equally) aiming to simulate a stream of data. Figures 5.27 to 5.31 depict the accuracy rates of all aforementioned techniques considering the datasets employed in this work. Let us have a look at Figure 5.27, which stands for the results considering the Electricity dataset. Traditional OPF has a lower accuracy than OPF versions. Also, we observe that the higher accuracy values alternate among the OPF versions. The detection was considered in batches #9, #17, #21, and #25. Figure 5.27: Experimental results concerning Electricity dataset with threshold 63. Figure 5.28 depicts the results over Hyperplane dataset stream. OPF with dynamic-window has a similar behavior than OPF with fixed-window3. Dynamic window did better in batches #8, #13, #17, #20, #23, #27 and #30, whereas fixed-window3 did better in batches #6, #7, #12, #16, #21, #26 and #28. OPF-nomemory has some batches of higher accuracy. The detection was considered in batches #6, #11, #12, #15, #16, #18, #19, #21, #22, #25, #26 and #28. Figure 5.29 depicts the results over the SEA dataset stream. OPF with dynamic-window has a similar behavior in accuracy with OPF-fullmemory until batch #17. It has a similar behavior also in the chosen data used in the training phase. After batch #17, the OPF dynamic-window detected a drift and discarded the old data, and we can see different behavior in its accuracy being more similar to the fixed-window3. The detection was considered in batches #16, #17, #23 and #24. Figure 5.30 depicts the results over the Usenet1 dataset stream. OPF with dynamic-window starts with similar behavior to OPF-fullmemory until batch #8. After that, the value of its accuracy behaves as an alternation of the accuracy values of OPF-fullmemory and OPF-fixed- 5.5 OPF with dynamic window 84 Figure 5.28: Experimental results concerning Hyperplane dataset with threshold 64. Figure 5.29: Experimental results concerning SEA dataset with threshold 73. window3. The detection was considered in batches #8, #13, #14, #19 and #28. Figure 5.30: Experimental results concerning Usenet1 dataset with threshold 45. Figure 5.31 depicts the results over the Usenet2 dataset stream. OPF with dynamic-window starts with similar behavior to OPF-fullmemory until batch #6. After that, the value of its accuracy behaves as an alternation of the accuracy values of OPF-fullmemory and OPF-fixed- window3. The detection was considered in batches #5, #7, #13, #22 and #25. 5.5 OPF with dynamic window 85 Figure 5.31: Experimental results concerning Usenet2 dataset with threshold 45. In this work, we dealt with the problem of concept drift concerning the OPF classifier with a dynamic-window. The experiments over real and synthetic datasets compared stan- dard OPF, OPF-fullmemory, OPF-nomemory, and OPF-window3 approaches, as well as the OPF-dynamic-window to address concept drift. We have shown that the OPF-dynamic-window accuracy has similar accuracy as OPF-fullmemory or OPF-window3, depending on how the window behaves in the stream of data. When accuracy drops below an empiric threshold, the OPF-dynamic-window employs the most recent batch on training, the same dataset used in the OPF-nomemory training phase. While accuracy is above the threshold, the batches are added on training to increase stability. Chapter 6 INCREMENTAL OPFknn This chapter presents the work by Iwashita et al. (IWASHITA et al., 2021), which proposes the OPFknn with incremental learning. The idea of an incremental algorithm comes from the need to add new instances to the classifier without retraining the entire data, which non-stationary environments and streaming data may demand. These environments require the learner to be constantly updated since repeating the entire learning process might be prohibitive. So, adjust- ing the model to the new data shows to be a better choice, particularly for real-time response applications. We named our algorithm as Incremental OPFknn (IOPFknn) and employed it in some databases, including non-technical losses environments. Non-technical losses stand for the energy that is consumed but not billed, which is usually referred to as commercial losses, and affect the energy grid as a whole. Frauds in energy consumption are somehow prevalent in developing countries and may harm the quality of energy and further improvements in social programs that can benefit from tax revenues. Therefore, the automatic identification of such illegal users is important to maintain decent services for the population. So the machine learn- ing techniques have to overcome such an issue by mining information from fraudsters and legal users for further decision-making, hence requiring the learner to be constantly updated. 6.1 Introduction Non-technical losses, also known as commercial losses, stand primarily for energy theft in power distribution systems, but not limited to it. Non-payment by customers and measurement errors in accounting and record-keeping are also frequent events that can characterize such illegal activities. Underdeveloped countries are the most affected since socioeconomic factors are crucial to shaping electricity theft behaviors (RAZAVI; FLEURY, 2019). Research on computer-assisted non-technical losses is usually grouped into two scenarios: 6.1 Introduction 87 (i) identification or detection and (ii) categorization. The former approaches aim at understand- ing the behavior of illegal profiles to identify them among hundreds of thousands of consumers. On the other hand, categorization techniques are concerned in learning the primary information used to describe users, i.e., to select the subset of features that maximizes the identification rates. Buzau et al. (BUZAU et al., 2019) used data from smart meters to feed supervised machine learning techniques for non-technical losses detection. The authors argued that using labeled information is challenging since we most rely on imbalanced datasets. Therefore, the work em- ployed undersampling techniques to remove the majority class, i.e., legal consumers, for further performing pattern recognition. Ramos et al. (RAMOS et al., 2011) introduced the Optimum-Path Forest (OPF) (PAPA; FALCA˜O; SUZUKI, 2009; PAPA et al., 2012) classifier in the context of non- technical losses identification. OPF works under the graph partition assumption for pattern recognition, i.e., the rationale behind it is to partition a graph induced by the training samples into clusters of samples from the very same class. The authors reported results that outper- formed some state-of-the-art techniques such as Support Vector Machines (SVM). Later on, Fernandes et al. (FERNANDES et al., 2019) proposed a probabilistic OPF classifier that was able to produce soft outputs, i.e., given a profile from some consumer, OPF could output its proba- bility of being legal or not. The main idea was to provide a framework that could monitor users over time and then learn whether they are likely to become fraudsters or not. From another perspective, Martins et al. (MARTINS et al., 2019) developed a non-intrusive energy meter for non-technical losses identification in low-voltage AC installations. The au- thors argued that the proposed approach does not require on-site calibration, besides being a safe-and-easy device to be installed in service drop lines. Kim et al. (KIM et al., 2019) devised an intermediate monitor meter framework that can divide the grid into small logical sections and then provide a more in-depth analysis of the power flow for further non-technical losses in- spection. The work reported real-time detection recognition rates up to 95%. Last but not least, Ramos et al. (RAMOS et al., 2016) proposed the Binary Black Hole Algorithm to characterize the profile of illegal consumers in Brazil. The authors modeled such an issue as a metaheuristic- based optimization problem, such that the accuracy of the OPF classifier was the fitness function used to find the most representative subset of features. Most of the works mentioned above benefited from the machine learning framework, in which techniques learn by experience. Such algorithms are useful in problems like data mining, poorly-understood domains, and in environments where the learner needs to dynamically adapt to changing conditions (MITCHELL, 1997). The learners build models from sample inputs and 6.1 Introduction 88 can be categorized according to the available feedback: in supervised learning, the algorithm predicts a result and has the correct outcome; in unsupervised learning, there is no information about the outputs; and in semi-supervised learning, there is incomplete information about the correct outcome (RUSSELL; NORVIG, 1995). Incremental learning is a paradigm that works when new instances appear, and the method adjusts the model according to them. If the entire learning process has to happen repeatedly whenever a new instance arrives, it might spend time and computation resources. Therefore, adjusting learned data according to the new instances might be a better decision, particularly for real-time response applications (GENG; SMITH-MILES, 2009). Other examples include time- dependent data generation (e.g., time-series data) and nonstationary environments, i.e., when the target concept changes over time (concept drift), and the learner should be able to self- adapt (GENG; SMITH-MILES, 2009; IWASHITA; PAPA, 2019). As aforementioned, the Optimum-Path Forest stands for a pattern recognition framework that partitions instances (graph nodes) into a graph of optimum-path trees (OPTs). Each OPT has a key sample (prototype), which is the root tree, and the prototypes compete among them- selves to conquer the remaining instances. Currently, the OPF classifier has three versions: (i) supervised (PAPA; FALCA˜O; SUZUKI, 2009; PAPA et al., 2012; PAPA; FERNANDES; FALCA˜O, 2017), (ii) unsupervised (ROCHA; CAPPABIANCO; FALCA˜O, 2009), and (iii) semi-supervised (AMORIM et al., 2016) ones. The supervised OPF classifier has two approaches: one that uses a com- plete graph (PAPA; FALCA˜O; SUZUKI, 2009; PAPA et al., 2012) and the other that uses a k-nn graph (OPFknn) (PAPA; FERNANDES; FALCA˜O, 2017). Recently, Iwashita and Papa (IWASHITA; PAPA, 2018; IWASHITA; ALBUQUERQUE; PAPA, 2019) proposed the use of OPF classifiers in the concept drift environment, and other works were developed to speed up the training algorithm taking into account the structural advantages of the data (PAPA et al., 2012; PONTI; ROSSI, 2013; PONTI; PAPA, 2011; IWASHITA et al., 2014). Ponti and Riva (PONTI; RIVA, 2017) proposed an incremental version of the OPF with the complete graph (OPF Incremental - OPFI), in which the traditional OPF trains the initial model structure, and then the OPFI updates the model including the new instances. In this work, we propose an incremental version of the OPF with the k-nn graph (incremental OPFknn) algorithm in order to avoid retraining all instances when including new samples in the training dataset. We performed experiments to compare the proposed approach against traditional OPFknn. We showed the proposed approach can be so effective as standard OPF, but considerably faster in the context of non-technical losses identification and general-purpose problems. Besides, we also evaluated the proposed approach in the context of non-technical losses identification. Finally, 6.2 Theoretical Background 89 we are not aware of previous works that aimed at modelling the problem of non-technical losses as an incremental learning task. In a nutshell, the main contributions of this work are two-fold: • to propose an incremental version of the OPFknn classifier; and • to introduce the incremental learning paradigm in the context of non-technical losses identification. The remainder of this work is organized as follows. Section 6.2 presents the theoretical back- ground regarding OPFknn, and Section 6.3 discusses the proposed approach based on the incre- mental learning paradigm. Section 6.4 presents the datasets and methodology, while Section 6.5 revisits the experiments. Finally, Section 6.6 states conclusions and future works. 6.2 Theoretical Background In the OPF with k-neighborhood (OPFknn), one connects each node to its k-nearest neigh- bors for the further calculation of a density value. The rationale is that high-density nodes will become the maxima of a probability density function (pdf), thus conquering the remaining nodes and forming an optimum-path forest, which is a collection of optimum-path trees rooted at each maximum (prototype). Later on, in the classification phase, a test sample is included in that optimum-path forest by finding its k-nearest neighbors and further computing its density value. The node from the training set that conquers that test sample will define its class. The training and classification processes can be modeled as a reward-competition prob- lem, in which samples try to conquer others by offering them optimum-path costs based on arc weights and a path-cost function. Different OPF formulations employ distinct path-cost func- tions, adjacency relations, and methodology to estimate prototypes. The next sections aim at explaining in more detail the OPFknn training and classification phases. 6.2.1 Training Step Let Z = {x1,x2, . . . ,xm} be a dataset such that x ∈ ℜn. Besides, let us assume that Z = Z1 ∪Z2, in which Z1 and Z2 stand for the training and test sets, respectively. Additionally, we suppose that Z1∩Z2 = /0. Based on these considerations, one can derive a weighted graph Gtr = (Ak,Z1,w) such that Ak denotes an adjacency relation that connects each sample to 6.2 Theoretical Background 90 its k-nearest neighbors, and w :Z1×Z1→ℜ+ concerns a function that computes arc weights between nodes. Figure 6.1 depicts a 2-nearest neighbors weighted graph derived from a training set with nine nodes1. Figure 6.1: Toy example: a 2-nearest neighbors graph derived from a given training set. Edges are weighted by the distance between their corresponding nodes. Given the graph Gtr, the next step concerns computing the probability density function ρ of each node based on its neighborhood. The rationale behind computing such a value concerns the fact that the higher the density of some sample, the more likely it is to become a prototype. Unlike OPF with a complete graph, where the prototypes are placed nearby the decision bound- ary (PAPA; FALCA˜O; SUZUKI, 2009; PAPA et al., 2012), OPFknn aims at estimating prototypes in the highest density regions (PAPA; FERNANDES; FALCA˜O, 2017). The pdf can be computed as follows: ρ(xi) = 1√ 2piσ2 |A ∗k (xi) | ∑ ∀x j∈A ∗k (x) exp (−d2(xi,x j) 2σ2 ) , (6.1) where d(xi,x j) denotes the distance between nodes xi and x j such that xi,x j ∈ Z1 and i 6= j, σ = d f3 , and d f is the maximum arc weight in the training set. Besides, A ∗ k (xi) denotes the adjacency relation for sample xi considering the k∗-nearest neighbors, in which k∗ ∈ [1,kmax]. Although Papa and Falca˜o (PAPA; FALCA˜O, 2009) proposed to find k∗ as the one that maximizes the classification accuracy over a validating set Gval ⊂Gtr, in this work we considered k∗= kmax. Figure 6.2 displays the training graph from Figure 6.1 after density computation, in which the sample with the highest density value is chosen as the prototype, i.e., the dashed sample ‘F’. After the pdf computation, a competition process among prototypes takes place using the path-cost function C, which can be computed as follows: 1Since the symmetry is not guaranteed in the k-nearest neighbors graph, some nodes have degree greater than two. 6.2 Theoretical Background 91 Figure 6.2: Density computation using Equation 6.1 concerning the graph depicted in Figure 6.1. The number nearby nodes stand for their density values. C(xi) = { ρ(xi) if xi ∈R, ρ(xi)−δ otherwise C(τx j · (x j,xi)) = min{C(τx j),ρ(xi)}. (6.2) where ρ(xi)− δ is employed to avoid plateaus nearby the maxima of the pdf such that δ is a small constant, R ⊆Z1 stands for the set of prototype (root) samples, τx j denotes a path with terminus at sample x j, and (x j,xi) ∈Ak∗ corresponds to an edge connecting samples x j and xi. The main idea of the competition process is to find the sample that maximizes Equation 6.2 for every training node. In the end, one has an optimum-path forest with trees rooted in the prototype samples, as depicted in Figure 6.3. The collection of these trees terms the name of the classifier. Figure 6.3: Optimum-path forest generated after the competition process using the densities shown in Figure 6.2 and arc weights computed in Figure 6.1. For the sake of explanation, we employed δ = 0.01 in this example. Prototypes are highlighted, and the conquered samples are assigned to the same label (color) of its corresponding prototype (i.e., root of the tree). A more detailed description of the OPFknn is implemented in Algorithm 1. Lines 1− 3 initialize the variables, and also inserts all samples in the priority queue Q. The main loop in Lines 5−14 is in charge of the core of the algorithm. It first removes a sample q from Q with maximum path value C(q) in Line 6. If q has not been conquered by any other sample, then P(q) = nil (Line 7) and q is a prototype of the connectivity map (a maximum of the pdf). Since q ∈R, its connectivity value is reset to ρ(q) (Line 8) according to the first part of Equation 6.2. It is also assigned to it the label of its corresponding conqueror for optimum-path propagation 6.2 Theoretical Background 92 to the rest of its tree. Algorithm 1: OPFknn algorithm Input: λ -labeled graph Gtr = (Z1,A ∗k ) and δ value. Output: Predecessor map P, path-cost map C, and label map L. Auxiliary: Priority queue Q, density map ρ , and variable cst. 1 for all xi ∈ Gtr do 2 Compute ρ(xi) using Equation 6.1; 3 P(xi)← nil, C(xi)← ρ(xi)−δ , Q← xi; 4 end 5 while Q 6= /0 do 6 Remove from Q a sample q such that C(q) is maximum; 7 if P(q) = nil then 8 L(q)← λ (q), and C(q)← ρ(q); 9 end 10 for all u ∈Ak(q) such that C(u)C(u) then 13 L(u)← L(q), P(u)← q, C(u)← cst; 14 Update position of u in Q; 15 end 16 end 17 end 18 return [P,C,L] The inner loop in Lines 10− 14 evaluates all adjacent sample u of q to which q can offer a better connectivity value (i.e., C(u) < C(q)). If the path φq · (q,u) offers a higher cost to u (Lines 11−12), then the current path φu is replaced by the new path φq · (q,u), being the maps C(u), L(u), and P(u) updated accordingly (Lines 13−14). 6.2.2 Classification Step The optimum-path forest computed in the training phase is then used to classify each test sample in the classification module (PAPA; FERNANDES; FALCA˜O, 2017) as follows: each sample xi ∈ Z2 is connected to its k∗-nearest neighbors in the optimum-path forest computed earlier, and further it is evaluated the training sample that satisfies the criterium above: C(xi) = max x j∈Ak∗(xi) {min{C(s),ρ(t)}}, (6.3) such that i 6= j. Notice that xi is labeled with the same label of the sample that conquers it. After this process, sample xi is then removed from the graph. 6.3 Proposed Incremental Learning Approach 93 6.3 Proposed Incremental Learning Approach In this section, we present the proposed approach based on incremental learning for the Optimum-Path Forest classifier, hereinafter called IOPFknn. As aforementioned, real-world problems usually suffer from two main drawbacks: (i) large datasets and (ii) dynamic envi- ronments. Both issues are somehow connected to each other, i.e., every time the statistics of the test set changes, there is a need to retrain the model again. However, when the training set is sufficiently large, it turns out to be unfeasible to update the model on-the-fly. The issue addressed in this work, i.e., computer-aided non-technical losses identification, suits in both situations, i.e., energy companies now can keep track of hundreds of thousands of users, which, at the same time, are improving themselves and finding other ways to hack the system. Therefore, a classifier that has been trained recently may not be accurate enough to identify potential new fraudsters in the grid. The proposed IOPFknn is capable of adding new training samples and self-updating without the need for re-training the whole set. Standard OPF requires a whole new training step, even if only one sample comes into the stage. The same happens with traditional pattern recognition techniques as well. Let x′ ∈ℜn be a new node that is going to be added to the training set. In this case, we must consider two important circumstances: (1) the kmax-nearest neighbors of x′ and (2) the training nodes that have x′ in their k∗-neighborhood, i.e., we need to find all training nodes that are now affected by this new node x′. In other words, we are interested in finding the training nodes that have now x′ in their k∗-neighborhood. We are using k∗ = kmax since we assume the training set is growing sufficiently large and thus does not make sense to search for k∗ within the interval [1,kmax]. If we have a situation in which a few samples are added to the training set, then we can assume kmax is suitable to be used. Notice that when the number of samples that are going to be added in the training set is considerably higher than kmax, one may need to consider re-training the classifier with the whole training set using standard OPFknn. Figure 6.4 illustrates a new sample x′ being added to the training set. The proposed approach must address two distinct situations: (1) sample x′ does not fall into the neighborhood of any other training sample, or (2) the other way around, i.e., sample x′ does fall into the neighborhood of some training sample. Let us first analyze situation 1), which has also two different possibilities and are stated by Propositions 6.3.1 and 6.3.2. 6.3 Proposed Incremental Learning Approach 94 A B C D E F I G H0.88 0.84 0.81 0.86 0.88 0.90 0.89 0.88 0.86 x' Figure 6.4: A new sample x′ is added to the optimum-path forest pre-computed by standard OPFknn, and further its kmax-nearest neighbors are found. We assume k∗ = 2 in this example. Proposition 6.3.1. Let Y = {y1,y2, . . . ,ykmax} be a set of samples such that yi ∈ Akmax(x′). If ρ(x′) ≤ ρ(yi), ∀yi ∈ Y , we have that x′ ∈ R, i.e., x′ will become a prototype and will not conquer any sample in Y . Proof. Let us pick any sample yi ∈ Y . Besides, let Cx′(yi) be the cost that sample x′ offers to node yi. When x ′ pops out from the priority queue Q, since it has no predecessors, i.e., P(x′) = nil, then C(x′) = ρ(x′) (Line 8 in Algorithm 1). Besides, given that ρ(x′) ≤ ρ(yi), Cx′(yi) = min{C(x′),ρ(yi))} = ρ(yi). Since each training sample has its cost updated at least once (all samples are inserted in the priority queue Q in Line 3 of Algorithm 1) and the minimum cost a sample can be assigned is its density, we can state that C(yi) ≥ ρ(yi) ≥ Cx′(yi). Also, assuming the OPF employs the FIFO (first-in-first-out) policy, i.e., a sample that offers the minimum cost first has the priority over the conquered sample, x′ will not conquer any sample within its neighborhood even if it offers the same optimum-path cost to it. Therefore, x′ will become a prototype. Proposition 6.3.2. Let Y = {y1,y2, . . . ,ykmax} be a set of samples such that yi ∈ Akmax(x′). Suppose that ∃yi ∈ Y such that ρ(yi) ≤ ρ(x′). In this case, we have that x′ ∈R, i.e., x′ will become a prototype and will not conquer any sample in Y . Proof. Let yi ∈ Y be a sample such that ρ(yi)≤ ρ(x′), i = 1,2, . . . ,kmax. When x′ is removed from the priority queue Q, it has no predecessors since it does not fall into the neighborhood of other samples. Therefore, C(x′) = ρ(x′). Moreover, since ρ(x′) ≥ ρ(yi), it means that x′ would be removed from Q prior to yi, i.e., there is a possibility that x ′ could conquer yi. Since Cx′(yi) = min{C(x′),ρ(yi)} = ρ(yi) and C(yi) ≥ ρ(yi), FIFO policy does not allow yi to be conquered by x′. Therefore, x′ will not conquer any sample in its neighborhood, thus becoming a prototype. We then shall consider situation 2), i.e., when x′ falls in the neighborhood of some training sample. In this case, we must analyze a few distinct scenarios, as discussed below. 6.3 Proposed Incremental Learning Approach 95 Proposition 6.3.3. LetY = {y1,y2, . . . ,yM} be a set of samples such that x′ ∈Ak∗(yi). Besides, let z? be the predecessor of yi in the optimum-path forest, i.e., P(yi) = z ? (Figure 6.5). In this case, if ρ(yi) ρ ′(yi). Given that ρ(yi) ρ(x′), then x′ will be conquered by a sample yi ∈ Y that offers the maximum cost. Otherwise, i.e., if Cyi(x ′)≤ ρ(x′), then x′ will become a prototype. Proof. If Cyi(x ′) ≤ ρ(x′), x′ will not be conquered by any sample in Y and will become a prototype. Besides, if Cyi(x ′)> ρ(x′), then Equation 6.3 holds: Cyi(x ′) = max yi min{C(yi),ρ(x′))}, (6.4) i.e., x′ is conquered by yi. 6.4 Methodology 96 Proposition 6.3.6. Let x∗ be a sample that belonged to the neighborhood of another node x¯ that now contains x′ in its neighborhood. In such a situation, one needs to consider whether x∗ will become a prototype or it will be conquered by some other sample. Proof. Let Y = {y1,y2, . . . ,yN} be a set of samples such that x? ∈Ak∗(yi), and let y?i ∈ Y be the sample that satisfies the equation below: Cyi(x ?) = max yi min{C(yi),ρ(x?))}. (6.5) If Cy?i (x ?)≤ ρ(x?), then x? will not be conquered by any sample in Y , thus becoming a proto- type. On the other hand, if Cy?i (x ?)> ρ(x?), then x? will be conquered by y?i . 6.4 Methodology The experiments were divided into two rounds: (i) the first one aimed at evaluating the proposed approach in general-purpose datasets, while (ii) the second one assessed the robust- ness of IOPFknn under two datasets concerning NTL identification. All experiments aimed at comparing the proposed IOPFknn against its counterpart (i.e., naive) version OPFknn in terms of efficiency for training and effectiveness over the testing set. We expect that IOPFknn and OPFknn are similar in terms of recognition rates, being the former more effective during training. The experimental protocol was conducted as follows. Let D = D1∪D2∪D3 be a dataset such D1, D2 and D3 stand for the training, incremental, and testing sets, respectively. Be- sides, |D1∪D2| stands for 50% of the entire dataset D , whereas D3 accounts for the remaining 50%. The idea is to use D1 and D2 to assess the efficiency of IOPFknn when performing online training, while D3 is used for effectiveness purposes only2. Traditional OPFknn uses D1 ∪D2 as a single training set, while IOPFknn is evaluated as follows: it is first trained on D1 for further adding samples from D2 and then updating the model. Besides, the kmax value was set as a percentage of D1 and chosen empirically according to the dataset. Table 6.1 presents a brief description of each dataset and the kmax percentage used. The first two datasets account for consumer profiles from commercial and industrial cus- tomers obtained from an electric power company in Brazil. The original datasets contain 8 2These percentages were set empirically. 6.4 Methodology 97 Table 6.1: Dataset description and kmax percentages. Dataset # of samples # of features kmax (D1 percentage) Commercial 14,856 4 1% Industrial 9,546 4 1% Hyperplane 90,000 10 0.1% Cifar10 60,000 1,024 1% Mnist 70,000 784 1% Forest Covertype 581,012 54 0.01% Electricity 45,312 8 1% Poker Hand 829,201 10 0.01% features (demand billed, demand contracted, demand measured or maximum demand, reactive energy, power transformer, power factor, installed power, and load factor), but we ended up with only the four most accurate ones for each dataset: • Commercial: demand measured, power factor, installed power, and load factor. • Industrial: demand contracted, demand measured, installed power, and load factor. To accomplish such a purpose, we partitioned the datasets into training, validation, and test sets, respectively. The training dataset contains 50% of the original dataset, followed by 30% and 20% concerning the validation and test sets, respectively. The idea is to employ both training and validation sets to find the subset of features that maximize the accuracy over the test set, with the accuracy being the fitness function. We used the well-known Particle Swarm Optimization (KENNEDY; EBERHART, 1995). Each agent is initialized with random binary positions and the original dataset is mapped to a new one that contains the features that were selected in this first sampling. In addition, the fitness function of each agent is set to the standard OPF (PAPA; FALCA˜O; SUZUKI, 2009; PAPA et al., 2012) recognition rate over the validating set after training. The final subset will be the one that maximizes the curve over the range of values, i.e., the features that maximize the accuracy over the validating set. The accuracy over the test set is then assessed by using the final subset of the selected features. Notice the fitness function employed in this paper is the accuracy measure proposed by (PAPA; FALCA˜O; SUZUKI, 2009), which is capable of handling unbalanced classes. We also tripled the size of that datasets by introducing a small noise on each feature x, i.e., x = x±α . In this paper we used α = 0.05. The remaining datasets are used to assess the 6.5 Experiments 98 performance of the proposed approach in general-purpose domains. We partitioned D1∪D2 into nine different configurations, i.e., 10% for the training set D1 and 90% for the incremental set D2 (i.e., 10−90), 20% for the training set D1 and 80% for the incremental set D2 (i.e., 20− 80), and so on. Besides, each configuration has been randomly generated 15 times to allow statistical analysis and to report the mean accuracies and execution times. The main idea of this experimental setup is to evaluate whether the proposed IOPFknn is preferable to update the training model with the incremental set rather than the traditional OPFknn training over the entire training set, i.e., D1 ∪D2. Concerning the OPFknn implemen- tation, we employed the open-source library LibOPF (PAPA; SUZUKI; X, ). Additionally, we considered the accuracy measure proposed by Papa et al. (PAPA; FALCA˜O; SUZUKI, 2009) that takes into account unbalanced datasets, which is usually the case in the context of NTL identi- fication. 6.5 Experiments In this section, we discuss the experiments conducted to show the robustness of the pro- posed approach. 6.5.1 Non-Technical Losses Datasets In the first experiment, we assessed the robustness of IOPFknn in the Commercial dataset using different setups, i.e., incremental sets with distinct sizes. The rationale is to verify if there is a specific limit in the incremental set size that tells us when it is good to prefer IOPFknn rather than OPFknn. Figure 6.6 depicts such an initial experiment. The first point to observe concerns the recognition rates over the test set. As aforemen- tioned, we always used 50% of the entire dataset to build the training and incremental sets. Therefore, it would be expected that the accuracies reported in Figure 6.6 should be similar to each other regardless of the setting up. However, as described in Section 6.3, we set k∗ = kmax, where the value of kmax varies according to the training set percentage. Therefore, a configu- ration of 10%− 90% uses a smaller value for kmax than the configuration of 90%− 10%, thus constraining the pdf estimation to smaller neighborhoods. Such a statement explains why we have different accuracies for the very same training set size. Also, one can observe the ac- curacies are quite similar between IOPFknn and OPFknn, thus showing the correctness of the 6.5 Experiments 99 Figure 6.6: Experiment concerning Commercial dataset with different setups: recognition rates are computed over the test set. idea presented in Section 6.3. A further statistical evaluation using the Wilcoxon signed-rank test (WILCOXON, 1945) with a significance of 5% revealed that four out of nine configurations obtained similar accuracies. Besides, the ones that were not considered alike differ up to 3% only. Figure 6.7 depicts the training time considering the different configurations. Since we were motivated to provide a similar but faster solution, these results are the most important ones. Clearly, one can observe the training times provided by IOPFknn are consistently shorter than the training times spent by OPFknn (63% faster, on average). Figure 6.7: Experiment concerning Commercial dataset with different setups: training time [s]. The behavior of OPFknn is the expected one, i.e., the training time must not be that much 6.5 Experiments 100 different among the different configurations since the training set size is the same. Regarding IOPFknn, the expected behavior, i.e., the smaller the incremental set, the less effective the pro- posed technique is, turns out to be regarded until it reaches the configuration 60%−40%. After that, IOPFknn becomes more effective, which can be explained by the complexity of some addi- tional operations required, i.e., some modifications in the standard OPFknn to deal with online learning. The next experiment concerns the Industrial dataset, which comprises profiles of large com- panies and industries. Usually, these are the consumers the electric power company is expected to deal with since they are in charge of the largest thefts. Figure 6.8 depicts the recognition rates under different configurations with respect to the dataset mentioned above. Figure 6.8: Experiment concerning Industrial dataset with different setups: recognition rates are computed over the test set. Similar to what happened in the previous experiment, the recognition rates decreased as the training set percentage increases. The rationale behind such a scenario concerns the kmax param- eter, which forces the pdf computation over larger neighborhoods, i.e., the rightmost columns. Basically, we are forcing fewer and bigger clusters, which may not reflect the real distribution of the data. The Wilcoxon test outcome tells us that three out of nine configurations are similar in between IOPFkmax and OPFkmax. The main differences here concerning the recognition rates vary up to 2% only. Figure 6.9 displays the mean training time concerning the Industrial dataset. Similar be- havior to the one from the Commercial dataset was expected, i.e., IOPFknn has been faster than standard OPFknn for all configurations (33% faster). In real life, one shall use only one config- uration for both efficiency and effectiveness, being the main idea of such experiments to stress the robustness of the proposed approach under different scenarios. The most important obser- 6.5 Experiments 101 Figure 6.9: Experiment concerning Industrial dataset with different setups: training time [s]. vations we can draw come from two main points: (i) IOPFknn and OPFknn obtained very much close recognition rates in all configurations, and (ii) the proposed approach figured as the fastest one, thus corroborating the primary motivation of this work, i.e., to deal with dynamic scenarios that require online training. 6.5.2 General-Purpose Datasets In this section, we present the experimental results concerning general-purpose datasets. We are motivated in a sense the proposed approach can be applied to a broader range of appli- cations, not limited to non-technical loss identification, despite being our primary concern in this research. Table 6.2 presents the mean recognition rates over the general-purpose datasets. For the sake of clarity, we presented and discussed the results concerning the configuration 50%−50% only. We chose such a configuration since it figures training and incremental sets with the same size, i.e., we are not favoring either set. Also, this situation highlights an important aspect, i.e., the situations in which the proposed approach and standard OPFknn obtained similar and different recognition rates. Concerning Hyperplane and Cifar10 datasets, both approaches obtained quite close recogni- tion rates, while for Mnist, Electricity, and Poker Hand datasets standard OPFknn outperformed IOPFknn. Besides, in the Forest Covertype dataset, the IOPFknn outperformed standard OPFknn. The main problem related to medium-to-large datasets and sufficient large training and incre- mental sets concerns the possibility of several optimum-path forests. Although the total cost 6.5 Experiments 102 Table 6.2: Recognition rates using the configuration 50− 50 concerning the general-purpose datasets. Dataset Configuration OPFknn IOPFknn Hyperplane 50-50 58.72% 57.33% Cifar10 50-50 55.59% 54.79% Mnist 50-50 97.00% 91.26% Forest Covertype 50-50 84.09% 87.04% Electricity 50-50 69.86% 63.41% Poker Hand 50-50 67.15% 64.42% of those optimum-path forests is the same, the tie-zones (i.e., plateaus with the same density) may comprise samples from different classes. Therefore, depending on the order these samples are processed (i.e., the order they are added to the priority queue Q in Algorithm 1), we may have samples from different classes, but with the same cost conquering their neighbors. Such behavior will influence the final recognition rate. The problem mentioned above has been subjected to research using different approaches, as stated by Fernandes et al. (FERNANDES; PAPA, 2019), which proposed an approach to alleviate the tie-zone problem in the context of standard OPF (i.e., the one that makes use of the complete graph). Although we could concentrate on a similar approach, we understand that it is not the focus of this work. If we choose another configuration for the Mnist dataset, e.g., 90%−10%, the difference between OPFknn and IOPFknn is 1.73% only; or yet in Poker dataset with 90%− 10%, the difference between OPFknn and IOPFknn is 0.33%. Therefore, we argue that it is highly likely that one shall find a configuration that minimizes the difference between these approaches. Notice we are playing with the training and incremental sets only, i.e., the test set does not take part in the incremental learning step. Now, we concentrate on the computational load for the sake of efficiency comparison. Once again, the idea is to evaluate whether the trade-off between accuracy and efficiency is worth us- ing the proposed approach or not. Table 6.3 presents the mean execution time for training concerning OPFkmax and IOPFkmax. However, we changed the configuration to 90%− 10%, which means we are using 90% for training purposes and only 10% to compose the incremental set. Even in that less favorable scenario, IOPFkmax performed faster than its counterpart ap- proach. We mean “less favorable” for we are using much fewer incremental samples, i.e., we expect that the larger the incremental set, the faster IOPFkmax will be. 6.6 Partial Considerations 103 Table 6.3: Training time [s] using the configuration 90− 10 concerning the general-purpose datasets. Dataset Configuration OPFknn IOPFknn Hyperplane 90-10 69.86 17.47 Cifar10 90-10 763.61 495.83 Mnist 90-10 810.71 610.98 Forest Covertype 90-10 5807.53 1012.13 Electricity 90-10 21.85 16.64 Poker Hand 90-10 6003.23 1549.73 6.5.3 Discussion For most datasets in the experiments, we use a value of 1% ofD1, except Hyperplane, which uses a percentage of 0.1%, and Covertype and Poker Hand, which uses a percentage of 0.01%. We chose the value of 1% for datasets with moderate size and the value of 0.1% and 0.01% for large datasets as the bottleneck of OPFknn is in the chosen value of kmax. We can choose the percentage dimension of kmax depending on the size of the dataset. The kmax issue is an open problem, it is a parameter that the user will determine. If you do not have time problems, you can leave kmax with a larger size, up to the size of the training set. However, if computational time and computational cost are a problem, we recommend using a lower value for kmax. Although we do not guarantee an exact match between IOPFknn and OPFknn in terms of recognition rates due to tie-zones and other characteristics of the algorithm, we showed the proposed approach is able to obtain very much close (sometimes statistically similar) results but being considerably faster for training purposes. 6.6 Partial Considerations Incremental learning aims at dealing with dynamic scenarios and, at the same time, keeping the model updated and robust to changes in the data flow. Non-technical losses figure a well- known issue continuously faced by electric power companies, mainly in developing countries. In such a scenario, consumers try to hack the grid in order to pay fewer bills, but their profile changes too. Dealing with these thefts requires a classifier that can adjust itself quickly, thus adapting to the new consumers’ behavior. In this work, we proposed the IOPFknn, a variant of the OPFknn classifier that is based on the incremental learning framework, i.e., when a new sample (or a subset of samples) is available 6.6 Partial Considerations 104 for training, there is no need to train the model over the entire training set again. IOPFknn can adjust itself using that new samples only. Experiments on NTL and general-purpose datasets support the above statement. Concerning future works, we intend to elaborate on solutions that can deal with tie-zones efficiently, but without paying the price of a drop in efficiency. Chapter 7 UNSUPERVISED OPF In this chapter, we investigate the unsupervised OPF behavior in non-stationary environ- ments. The unsupervised OPF algorithm was slightly modified for these experiments: since unsupervised OPF does not work with “true labels,” we make him able to calculate the accu- racy of the clustering results. The unsupervised OPF here we called “OPF Cluster,” and for the “OPF Supervised,” we apply the complete graph OPF version. We performed two steps of experiments. In the first stage of the experiments, inspired by the methodology used in Chapter 4, we follow the accuracy of the supervised and unsupervised OPF in different approaches: traditional, with a sliding window of one size and size 3, and full-memory. In the second stage, we used the accuracy decay to consider a concept drift and evaluated the detection with different threshold values. The datasets used in the experiments were Hyperplane, SEA, Usenet1, and Usenet2. In the first stage, we performed four experiments. The first is the “traditional” way, and we train only in the first batch to classify the rest of the data stream. In the second and third experiments, we used a sliding window to go through the data and train a classifier with the most recent information; we use a window with one batch size for training and another with three batch size, respectively. In the fourth experiment, we used all the information previously received to train a classifier and further perform the classification. Figure 7.1 depicts the accuracy rates of the first experiment considering the datasets men- tioned above. We can see that the classifier trained in the first batch with supervised OPF has higher accuracy over the stream data than unsupervised OPF in Hyperplane and SEA datasets. Also, we observe that, in the Usenet datasets, the accuracy of both classifiers is similar. Figure 7.2 depicts the accuracy rates of the second experiment, training with one batch as window size. As in the first experiment, the supervised OPF has higher accuracy over the stream 7 Unsupervised OPF 106 0 10 20 30 40 50 60 70 80 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 OPF Cluster OPF Supervised Batch Ac cu ra cy 62 64 66 68 70 72 74 76 78 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 OPF Cluster OPF Supervised Batch Ac cu ra cy 0 10 20 30 40 50 60 70 80 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 OPF Cluster OPF Supervised Batch Ac cu ra cy 0 10 20 30 40 50 60 70 80 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 OPF Cluster OPF Supervised Batch Ac cu ra cy (a) (b) (c) (d) Figure 7.1: Accuracy over dataset stream in the first experiment: a) Hyperplane b) SEA c) Usenet1 d) Usenet2. Some concept drifts are hilighted. data in the Hyperplane dataset and is slightly superior in the SEA dataset. We also observe that the accuracy of both classifiers is similar in the Usenet datasets, being the unsupervised OPF a little better in some batches in the Usenet2 dataset. 0 10 20 30 40 50 60 70 80 90 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 OPF Cluster OPF Supervised Batch Ac cu ra cy 0 10 20 30 40 50 60 70 80 90 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 OPF Cluster OPF Supervised Batch Ac cu ra cy 0 20 40 60 80 100 120 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 OPF Cluster OPF Supervised Batch Ac cu ra cy 0 20 40 60 80 100 120 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 OPF Cluster OPF Supervised Batch Ac cu ra cy (a) (b) (c) (d) Figure 7.2: Accuracy over dataset stream in the second experiment: a) Hyperplane b) SEA c) Usenet1 d) Usenet2. Some concept drifts are hilighted. Figure 7.3 depicts the accuracy rates of the third experiment, with a training phase using three batches of window size. Again, the supervised OPF has higher accuracy over the stream data in the Hyperplane dataset and at the beginning of the SEA dataset. The accuracy in the Usenet datasets of both classifiers is similar. Figure 7.4 depicts the accuracy rates of the fourth experiment, using all the previously seeing data to train a classifier. In the Hyperplane dataset, the supervised OPF has higher accuracy at the beginning, and from the middle of the stream onwards, both classifiers behave 7 Unsupervised OPF 107 (a) (b) (c) (d) 0 10 20 30 40 50 60 70 80 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 OPF Cluster OPF Supervised Batch Ac cu ra cy 62 64 66 68 70 72 74 76 78 80 82 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 OPF Cluster OPF Supervised Batch Ac cu ra cy 0 10 20 30 40 50 60 70 80 90 100 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 OPF Cluster OPF Supervised Batch Ac cu ra cy 0 10 20 30 40 50 60 70 80 90 100 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 OPF Cluster OPF Supervised Batch Ac cu ra cy Figure 7.3: Accuracy over dataset stream in the third experiment: a) Hyperplane b) SEA c) Usenet1 d) Usenet2. Some concept drifts are hilighted. similarly. In the SEA dataset, the classifiers are similar, except at the first eight batches where supervised OPF performs better. The accuracy of unsupervised OPF in the Usenet1 dataset is slightly better in some moments, and in the Usenet2 dataset, the better performance varies between both classifiers. (a) (b) (c) (d) 0 10 20 30 40 50 60 70 80 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 OPF Cluster OPF Supervised Batch Ac cu ra cy 64 66 68 70 72 74 76 78 80 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 OPF Cluster OPF Supervised Batch Ac cu ra cy 0 10 20 30 40 50 60 70 80 90 100 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 OPF Cluster OPF Supervised Batch Ac cu ra cy 0 10 20 30 40 50 60 70 80 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 OPF Cluster OPF Supervised Batch Ac cu ra cy Figure 7.4: Accuracy over dataset stream in the fourth experiment: a) Hyperplane b) SEA c) Usenet1 d) Usenet2. Some concept drifts are hilighted. In the second stage, we performed the same four experiments as the previous stage but here measuring what we called “Accuracy in drift detection.” We estimated this accuracy with the number of drifts correctly assigned divided by the real number of drifts occurring in the dataset. The drift detection is determined through a threshold: when a drop in accuracy reaches a value ranging from 1%, 5%, 20%, 30%, 40%, and 50%, we assume that drift occurs. Figure 7.5 depicts the accuracy in drift detection rates of the first experiment considering 7 Unsupervised OPF 108 the datasets mentioned above. We can see that larger threshold values make the detection not be accused, contributing to the lack of accuracy in detecting drift. In Hyperplane, SEA, and Usenet2 datasets, the thresholds above 5% are already high. In Hyperplane, both classifiers performed similarly, and in SEA and Usenet2 datasets, the unsupervised OPF detect slightly better when the threshold is set to 1%. In the Usenet1 dataset, the supervised OPF performs better with 1% and 5% threshold, whereas the unsupervised OPF performs in the same way from 1% to 30% of a threshold value. (a) (b) (c) (d) 0 0,05 0,1 0,15 0,2 0,25 0,3 1% 5% 20% 30% 40% 50% OPF Cluster OPF Supervised Drift Ac cu ra cy 0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 1% 5% 20% 30% 40% 50% OPF Cluster OPF Supervised Drift Ac cu ra cy 0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 1% 5% 20% 30% 40% 50% OPF Cluster OPF Supervised Drift Ac cu ra cy 0 0,1 0,2 0,3 0,4 0,5 0,6 1% 5% 20% 30% 40% 50% OPF Cluster OPF Supervised Drift Ac cu ra cy Figure 7.5: Accuracy in drift detection over dataset stream in the first experiment: a) Hyperplane b) SEA c) Usenet1 d) Usenet2. Figure 7.6 depicts the accuracy in drift detection rates of the second experiment. In the SEA dataset, unsupervised OPF was unable to detect drift, whereas, in Hyperplane, the threshold of 20% was also already high. In the Usenet1 dataset, both classifiers behaved similarly, but in threshold of 20% to 30%, the unsupervised OPF stayed with stable performance. In Usenet2, the unsupervised OPF performed better than supervised OPF when thresholds vary between 5% to 20%. Figure 7.7 depicts the accuracy in drift detection rates of the third experiment. Again, in the SEA and Hyperplane datasets, the threshold of 20% was already high, hence supervised OPF and unsupervised OPF were unable to detect drift. In the Usenet1 and Usenet2 datasets, the unsupervised OPF behaves very well up until the threshold of 20%. Figure 7.8 depicts the accuracy in drift detection rates of the fourth experiment. In the Hyperplane dataset, unsupervised OPF was unable to detect drift with a threshold above 5%, whereas this bad threshold for supervised OPF was 20%. In the Hyperplane dataset, when the threshold varies from 1% to 5%, the supervised OPF performed better than the unsupervised version. In the SEA dataset, they both behave equally. In the Usenet1 dataset, thresholds from 7 Unsupervised OPF 109 (a) (b) (c) (d) 0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1 1% 5% 20% 30% 40% 50% OPF Cluster OPF Supervised Drift Ac cu ra cy 0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 1% 5% 20% 30% 40% 50% OPF Cluster OPF Supervised Drift Ac cu ra cy 0 0,2 0,4 0,6 0,8 1 1,2 1% 5% 20% 30% 40% 50% OPF Cluster OPF Supervised Drift Ac cu ra cy 0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 1% 5% 20% 30% 40% 50% OPF Cluster OPF Supervised Drift Ac cu ra cy Figure 7.6: Accuracy in drift detection over dataset stream in the second experiment: a) Hyper- plane b) SEA c) Usenet1 d) Usenet2. (a) (b) (c) (d) 0 0,1 0,2 0,3 0,4 0,5 0,6 1% 5% 20% 30% 40% 50% OPF Cluster OPF Supervised Drift Ac cu ra cy 0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 1% 5% 20% 30% 40% 50% OPF Cluster OPF Supervised Drift Ac cu ra cy 0 0,2 0,4 0,6 0,8 1 1,2 1% 5% 20% 30% 40% 50% OPF Cluster OPF Supervised Drift Ac cu ra cy 0 0,2 0,4 0,6 0,8 1 1,2 1% 5% 20% 30% 40% 50% OPF Cluster OPF Supervised Drift Ac cu ra cy Figure 7.7: Accuracy in drift detection over dataset stream in the third experiment: a) Hyperplane b) SEA c) Usenet1 d) Usenet2. 20% to 50% for unsupervised OPF did not alter the behave of values. In the Usenet2 dataset, unsupervised OPF has better accuracy when compared to supervised OPF with thresholds of 5% and 30%. We noticed that the performance of unsupervised OPF was slightly worse than the super- vised OPF. Despite this result, it was already expected that supervised OPF could overcome unsupervised OPF in accuracy since supervised OPF benefits from the information labels in the training phase. Regarding the accuracy of drift detection, as observed in the experiment results, the unsupervised OPF performed better in some situations, sometimes it was worse and sometimes behaved equally as the supervised OPF. We then consider that we can also apply unsupervised OPF in environments with concept drift. 7 Unsupervised OPF 110 (a) (b) (c) (d) 0 0,1 0,2 0,3 0,4 0,5 0,6 1% 5% 20% 30% 40% 50% OPF Cluster OPF Supervised Drift Ac cu ra cy 0 0,05 0,1 0,15 0,2 0,25 0,3 0,35 1% 5% 20% 30% 40% 50% OPF Cluster OPF Supervised Drift Ac cu ra cy 0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 1% 5% 20% 30% 40% 50% OPF Cluster OPF Supervised Drift Ac cu ra cy 0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 1% 5% 20% 30% 40% 50% OPF Cluster OPF Supervised Drift Ac cu ra cy Figure 7.8: Accuracy in drift detection over dataset stream in the fourth experiment: a) Hyper- plane b) SEA c) Usenet1 d) Usenet2. Although we did not present in this work, we also performed experiments employing mea- sures of dissimilarity in distributions to detect concept drift. Among the chosen measures, we applied Kullback-Leibler divergence (which, despite not being a distance metric, is pretty used in concept drift literature in unsupervised environments) and Hellinger distance to detect changes in the data distributions (GOLDENBERG; WEBB, 2019). However, we did not obtain relevant results, and we did not include them in the present work. We verified the need for further studies on unsupervised OPF in environments where we do not have information about the labels. To determine desirable mechanisms for drift detection in these environments, as future works, we will examine other distance measures to apply in unsupervised data. In situations where the available data is not labeled, we have to use some clustering method. As a concept drift example, at time t we can have n clusters, and at time t+1 we have n′ clusters, n 6= n′. Identifying if a new class appears is a difficult scale problem (if a small cluster is close to another existing one), and detecting if there is a concept drift (new class appearance, or Concept Evolution) becomes a problem. The OPF approach has not an algorithm that forces a user-defined number of clusters. A positive point (in problems in which the number of clusters is unknown) of the unsupervised OPF is that it finds the number of clusters at runtime. As future works, we intend to fill this gap by proposing an algorithm to find a fixed number of clusters with the unsupervised OPF and to validate this approach in the problem of concept drift. Another area to be explored is: suppose that in time t we have n clusters, and in time t +1 we have n′ = n+1 clusters. However, a more detailed analysis of this “new cluster” will tell us if it characterizes a concept drift (if this new cluster is significant). This new cluster 7 Unsupervised OPF 111 may not contain a significant number of samples, being an irrelevant cluster. To solve this problem with the unsupervised OPF, we can propose two main approaches: (i) removing this irrelevant cluster or (ii) merging this cluster with the closest cluster or another more strongly connected cluster, creating an expanded cluster. By definition of closer, we can understand as the conventional distance between the prototypes of each cluster. By strongly connected cluster, we can understand as being the cluster that would offer the higher sum of costs (the unsupervised OPF aims to maximize the cost of each sample) to the merged irrelevant cluster. Chapter 8 CONCLUSIONS In this Ph.D. thesis, we proposed to apply supervised and unsupervised approaches based on the Optimum Path Forest algorithm in dynamic environments to deal with concept drift. We also proposed an incremental learning approach concerning supervised OPF named Incremental OPFknn (IOPFknn). In Chapter 2, we presented a bibliographical review concerning concept drift. In Chapter 3, we introduced the Optimum Path Forest classifier in supervised versions (complete graph and k-nn graph) and unsupervised version. Thus, this Ph.D. thesis attempted to answer the questions: i) can OPF be efficient to handle non-stationary environments? ii) can OPF classifier be robust enough to address the concept drift problem? iii) can we provide an effective OPF implementation to handle dynamic environments? Regarding an efficient OPF implementation to handle non-stationary environments, in Chap- ter 4, we presented an evaluation of the OPF classifier in a concept drift environment using full-memory, no-memory, and window approaches against other classifiers. In this experiment, we divided the database into a stream of batches that were received by the algorithm. The algo- rithm trained the first batch, and the following ones were classified and later used for training – a similar study to that done by Klinkenberg and Joachims (KLINKENBERG; JOACHIMS, 2000). As mentioned above, we employ supervised OPF in three ways: full-memory (Figure 2.2), no-memory (Figure 2.3), and a sliding window with fixed size (Figure 2.4). We compared stan- dard OPF, OPFknn, SVM, and three distinct versions to address concept drift on each technique in synthetic datasets. The experiments showed that OPF is suitable to work under these dy- namic scenarios since its recognition rates were considerably better when adapted to address concept drift. The OPF versions have lower time-consuming training time compared to SVM versions, and in some datasets, they also have lower time-consuming testing time, which is a beneficial advantage. We also assessed the OPF classifier with a drift detection method based on accuracy in Section 5.5. In this study, we adopted the OPF classifier with a drift detection 8 Conclusions 113 method based on accuracy values and also a window with dynamic size. The selection of a “good” window size in fixed sliding windows methods is a trade-off between fast adaptabil- ity to data (small window) and good generalization in phases where there is no concept drift (large window) (KLINKENBERG; JOACHIMS, 2000). The method used an empiric threshold to detect changes in accuracy values, and the window size is decreased to one batch-size when the accuracy drops below this threshold. If there is no drift, it adds the new batch to the train- ing set for stability increasing. The experiments over real and synthetic datasets compared the standard OPF, OPF-fullmemory, OPF-nomemory, and OPF-window3 approaches, against the OPF-dynamic-window to address concept drift. We showed that the OPF-dynamic-window accuracy has similar accuracy as OPF-fullmemory or OPF-window3, depending on how the window behaves in the stream of data. Concerning the OPF classifiers be robust enough to address the concept drift problem, in Chapter 5 we evaluated an ensemble of OPF classifiers in the concept drift problem. We used a committee of classifiers to make decisions using the previous approaches and with differ- ent decision rules to choose the final result. The three voting mechanisms were: “combined,” “weighted” and “major.” The “combined” voting mechanism chooses the most voted result among base classifier outputs as the classification result. The “weighted” voting mechanism weighs each base classifier with different values for voting relevance based on its previous accuracy classification. The “major” voting mechanism gives additional weight to the base classifier with higher accuracy in the anterior batch. The experiments in this method had three parts. In the first experiment, we combined the results of the full-memory, no-memory, and win- dow approaches using the three voting mechanisms to obtain the committee’s decision. In the second experiment, we used modified versions of the previous approach: for each full-memory, no-memory, and window method, we divide the data into three parts instead of using only one training set. In other words, in the no-memory version, the most recent batch is divided into three parts and trained with OPF, and the classification is the combination methods above. Sim- ilarly, the full-memory version includes all previously seen data into a large dataset and divided it into three parts. We trained each one with OPF, and the combination mechanisms made the decisions. In the window version, the sliding window of size three keeps the data and divided it into three parts, and the results are combined using one of the three voting mechanisms. In the third experiment, each batch trained one classifier: the committee considers the last three trained models for each new lot to be classified. The results are combined using the three approaches described above. The experiments over real and synthetic datasets compared standard OPF, OPF-fullmemory, OPF-nomemory, and OPF-window3 approaches, against the three proposed ensemble-based versions to address concept drift. We showed that the OPF ensemble is suit- 8 Conclusions 114 able to work under these dynamic scenarios since its recognition rates were considerably better compared to traditional OPF and OPF with concept drift handling with no ensemble learning. We can also verify that combined methods have better accuracy than the ones obtained on each fold individually in the second experiment. And we also discovered that the main advantage concerning the third experiment over OPF-window3 is its training time since training in smaller data is faster than a larger one, i.e., training three individual batches is faster than training three merged batches. Regarding if we can provide an effective OPF implementation to handle dynamic environ- ments, in Chapter 6, we proposed an OPFknn approach with incremental learning. Learning one sample at a time can be useful in large and continuous data stream environments. In this envi- ronment that requires the learner to be updated constantly, since repeating the entire learning process might be prohibitive, adjusting the model to the new data shows to be a better choice, particularly for real-time response applications. Whenever traditional supervised OPF executes the training step, it trains with the entire training data, regardless of the number of new samples added to data. This situation can be costly in cases where new instances are frequently added. Therefore, the idea to implement an incremental OPF that adds new instances to the optimum path forest and updates the costs of the trees without re-training the entire database is very con- venient. We named our method Incremental OPFknn (IOPFknn), in which the user interacts with the classifier constantly changing the labeled training dataset. We tested the problem in data provided by energy companies that made available data regularly, thus requiring the learner to be updated constantly. Although we do not guarantee an exact match between IOPFknn and OPFknn in terms of recognition rates due to tie-zones and some characteristics of the algorithm, we showed the proposed approach is able to obtain very much close (sometimes statistically similar) results but being considerably faster for training purposes. We conducted some experiments with unsupervised OPF in non-stationary environments in Chapter 7. We compared unsupervised OPF with supervised OPF to visualize that it can also operate in these types of environments. We also performed experiments using distance measures to detect drift in non-supervised environments. However, we did not reach highly relevant results and did not present them in this work. Concerning future works, we intend to extend IOPFknn work and elaborate solutions that can deal with tie-zones efficiently but without paying the price of a drop in efficiency. Un- supervised environments are a challenging area in terms of concept drift. Some authors use distance measures, statistical distances, dissimilarity measures, among others, to quantify and detect concept drift. Thus, as future works, we intend to do further study in more metrics used 8 Conclusions 115 in the literature, and we also propose some methods to deal with concept evolution. Essentially, our proposals involve semi-supervised and unsupervised applications. REFERENCES ABDULSALAM, H.; SKILLICORN, D.; MARTIN, P. Classification using streaming random forests. IEEE Transactions on Knowledge and Data Engineering, v. 23, n. 1, p. 22–36, 2011. AMORIM, W. P. et al. Improving semi-supervised learning through optimum connectivity. Pattern Recognition, v. 60, n. Supplement C, p. 72–85, 2016. ARABMAKKI, E.; KANTARDZIC, M. Som-based partial labeling of imbalanced data stream. Neurocomputing, v. 262, p. 120 – 133, 2017. BERTINI, J.; NICOLETTI, M.; ZHAO, L. Ensemble of complete p-partite graph classifiers for non-stationary environments. In: IEEE Congress on Evolutionary Computation. [S.l.: s.n.], 2013. p. 1802–1809. BIFET, A.; GAVALDA`, R. Learning from time-changing data with adaptive windowing. In: SIAM International Conference on Data Mining. [S.l.: s.n.], 2007. BUZAU, M. M. et al. Detection of non-technical losses using smart meter data and supervised learning. IEEE Transactions on Smart Grid, v. 10, n. 3, p. 2661–2670, 2019. CHEN, C. H. et al. A GA-based approach for mining membership functions and concept-drift patterns. In: IEEE Congress on Evolutionary Computation (CEC). [S.l.: s.n.], 2015. p. 2961–2965. ISSN 1089-778X. CHEN, S.; HE, H. SERA: Selectively recursive approach towards nonstationary imbalanced stream data mining. In: International Joint Conference on Neural Networks (IJCNN). [S.l.: s.n.], 2009. p. 522–529. ISSN 1098-7576. DI´AZ, A. O. et al. Fast adapting ensemble: A new algorithm for mining data streams with concept drift. The Scientific World Journal, v. 2015, 2015. DITZLER, G.; POLIKAR, R. An ensemble based incremental learning framework for concept drift and class imbalance. In: The 2010 International Joint Conference on Neural Networks (IJCNN). [S.l.: s.n.], 2010. p. 1–8. ISSN 1098-7576. DITZLER, G.; POLIKAR, R. Incremental learning of concept drift from streaming imbalanced data. IEEE Transactions on Knowledge and Data Engineering, IEEE Computer Society, Los Alamitos, CA, USA, v. 25, n. 10, p. 2283–2301, 2013. ISSN 1041-4347. DITZLER, G.; POLIKAR, R.; CHAWLA, N. An incremental learning algorithm for non-stationary environments and class imbalance. In: 20th International Conference on Pattern Recognition (ICPR). [S.l.: s.n.], 2010. p. 2997–3000. ISSN 1051-4651. References 117 ELWELL, R.; POLIKAR, R. Incremental learning of concept drift in nonstationary environments. IEEE Transactions on Neural Networks, v. 22, n. 10, p. 1517–1531, Oct 2011. ISSN 1045-9227. ESCOVEDO, T. et al. Learning under concept drift using a neuro-evolutionary ensemble. International Journal of Computational Intelligence and Applications, v. 12, n. 04, p. 1340002, 2013. FALCA˜O, A.; STOLFI, J.; LOTUFO, R. The image foresting transform theory, algorithms, and applications. IEEE Transactions on Pattern Analysis and Machine Intelligence, v. 26, n. 1, p. 19–29, 2004. FAN, W. Systematic data selection to mine concept-drifting data streams. In: Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. [S.l.]: ACM, 2004. p. 128–137. FARID, D. et al. An adaptive ensemble classifier for mining concept drifting data streams. Expert Systems with Applications, v. 40, n. 15, p. 5895 – 5906, 2013. FERNANDES, S. E. N.; PAPA, J. P. Pruning optimum-path forest ensembles using quaternion-based optimization. In: 2017 International Joint Conference on Neural Networks. [S.l.: s.n.], 2017. (IJCNN), p. 984–991. FERNANDES, S. E. N.; PAPA, J. P. Improving optimum-path forest learning using bag-of-classifiers and confidence measures. Pattern Analysis and Applications, Springer London, p. 703–716, 2019. FERNANDES, S. E. N. et al. A probabilistic optimum-path forest classifier for non-technical losses detection. IEEE Transactions on Smart Grid, v. 10, n. 3, p. 3226–3235, 2019. FERNANDES, S. E. N. et al. Pruning optimum-path forest ensembles using metaheuristic optimization for land-cover classification. International Journal of Remote Sensing, v. 38, p. 5736–5762, 2017. GAMA, J. et al. Learning with drift detection. In: SBIA Brazilian Symposium on Artificial Intelligence. [S.l.]: Springer Verlag, 2004. v. 3171, p. 286–295. GAMA, J. et al. A survey on concept drift adaptation. ACM Computing Surveys, ACM, v. 46, n. 4, p. 44:1–44:37, 2014. GAO, J. et al. A general framework for mining concept-drifting data streams with skewed distributions. In: Proceedings of the SIAM International Conference on Data Mining (SDM ’07). [S.l.: s.n.], 2007. GAO, Y. et al. Saccos: A semi-supervised framework for emerging class detection and concept drift adaption over data streams. IEEE Transactions on Knowledge and Data Engineering, p. 1–1, 2020. GARCI´A, M. et al. Early drift detection method. In: Fourth International Workshop on Knowledge Discovery from Data Streams. [S.l.: s.n.], 2006. p. 77–86. GENG, X.; SMITH-MILES, K. Incremental learning. In: . Encyclopedia of Biometrics. Boston, MA: Springer US, 2009. p. 731–735. References 118 GOLDENBERG, I.; WEBB, G. I. Survey of distance measures for quantifying concept drift and shift in numeric data. Knowledge & Information Systems, v. 60, n. 2, p. 591 – 615, 2019. ISSN 02191377. HAREL, M. et al. Concept drift detection through resampling. In: Proceedings of the 31st International Conference on Machine Learning. [S.l.]: JMLR Workshop and Conference Proceedings, 2014. p. 1009–1017. HARRIES, M. SPLICE-2 Comparative Evaluation: Electricity Pricing. [S.l.], 1999. HEGEDUS, I.; ORMA´NDI, R.; JELASITY, M. Massively distributed concept drift handling in large networks. Advances in Complex Systems, v. 16, n. 04n05, p. 1350021, 2013. HEUSINGER, M.; RAAB, C.; SCHLEIF, F. Passive concept drift handling via variations of learning vector quantization. Neural Computing and Applications, 2020. ISSN 1433-3058. Available from Internet: . IWASHITA, A. et al. A path- and label-cost propagation approach to speedup the training of the optimum-path forest classifier. Pattern Recognition Letters, Elsevier Science Inc., v. 40, p. 121–127, 2014. IWASHITA, A. et al. An incremental optimum-path forest classifier and its application to non-technical losses identification. Computers and Electrical Engineering (submitted). 2021. IWASHITA, A. S.; ALBUQUERQUE, V. H. C. de; PAPA, J. P. Learning concept drift with ensembles of optimum-path forest-based classifiers. Future Generation Computer Systems, v. 95, p. 198–211, 2019. IWASHITA, A. S.; PAPA, J. P. Learning concept drift with optimum-path forest. In: 23rd Iberoamerican Congress on Pattern Recognition (CIARP). [s.n.], 2018. p. 13–15. Late breaking works. Available from Internet: . IWASHITA, A. S.; PAPA, J. P. An overview on concept drift learning. IEEE Access, v. 7, p. 1532–1547, 2019. KATAKIS, I.; TSOUMAKAS, G.; VLAHAVAS, I. An ensemble of classifiers for coping with recurring contexts in data streams. In: Proceedings of the 2008 Conference on ECAI 2008: 18th European Conference on Artificial Intelligence. [S.l.]: IOS Press, 2008. p. 763–764. KATAKIS, I.; TSOUMAKAS, G.; VLAHAVAS, I. Incremental Clustering for the Classification of Concept-Drifting Data Streams. 2008. KATAKIS, I.; TSOUMAKAS, G.; VLAHAVAS, I. Tracking recurring contexts using ensemble classifiers: an application to email filtering. Knowledge and Information Systems, v. 22, n. 3, p. 371–391, 2010. ISSN 0219-3116. KENNEDY, J.; EBERHART, R. Particle swarm optimization. In: Proceedings of International Conference on Neural Networks. [S.l.: s.n.], 1995. v. 4, p. 1942–1948. KHAMASSI, I. et al. Self-adaptive windowing approach for handling complex concept drift. Cognitive Computation, v. 7, n. 6, p. 772–790, 2015. References 119 KIM, J. Y. et al. Detection for non-technical loss by smart energy theft with intermediate monitor meter in smart grid. IEEE Access, v. 7, p. 129043–129053, 2019. KLINKENBERG, R. Using labeled and unlabeled data to learn drifting concepts. In: Workshop notes of IJCAI-01 Workshop on Learning from Temporal and Spatial Data. [S.l.]: AAAI Press, 2001. p. 16–24. KLINKENBERG, R.; JOACHIMS, T. Detecting concept drift with support vector machines. In: Proceedings of the Seventeenth International Conference on Machine Learning. [S.l.]: Morgan Kaufmann, 2000. p. 487–494. KLINKENBERG, R.; RENZ, I. Adaptive information filtering: Learning in the presence of concept drifts. In: Workshop Notes of the ICML/AAAI-98 Workshop Learning for Text Categorization. [S.l.]: AAAI Press, 1998. p. 33–40. KOLTER, J.; MALOOF, M. Dynamic weighted majority: An ensemble method for drifting concepts. Journal of Machine Learning Research, JMLR.org, v. 8, p. 2755–2790, 2007. KUBAT, M.; WIDMER, G. Adapting to drift in continuous domains. In: Proceedings of the 8th European Conference on Machine Learning. [S.l.]: Springer, 1995. p. 307–310. KUNCHEVA, L. I. Classifier ensembles for detecting concept change in streaming data: Overview and perspectives. In: Proceedings of the 2nd Workshop SUEMA 2008 (ECAI 2008). [S.l.: s.n.], 2008. p. 5–10. KURLEJ, B.; WOZNIAK, M. Active learning approach to concept drift problem. Logic Journal of IGPL, v. 20, n. 3, p. 550–559, 2012. LI, P.; WU, X.; HU, X. Learning from concept drifting data streams with unlabeled data. Neurocomputing, Elsevier Science Publishers B. V., v. 92, p. 145–155, 2012. ISSN 0925-2312. LI, P. et al. Learning concept-drifting data streams with random ensemble decision trees. Neurocomputing, v. 166, p. 68 – 83, 2015. LICHMAN, M. UCI Machine Learning Repository. 2013. Available from Internet: . LIU, A.; LU, J.; ZHANG, G. Concept drift detection via equal intensity k-means space partitioning. IEEE Transactions on Cybernetics, Institute of Electrical and Electronics Engineers (IEEE), p. 1–14, 2020. ISSN 2168-2275. Available from Internet: . LIU, S. et al. Concept drift detection for data stream learning based on angle optimized global embedding and principal component analysis in sensor networks. Computers & Electrical Engineering, v. 58, p. 327 – 336, 2017. LOEFFEL, P. X.; MARSALA, C.; DETYNIECKI, M. Classification with a reject option under concept drift: The droplets algorithm. In: IEEE International Conference on Data Science and Advanced Analytics (DSAA). [S.l.: s.n.], 2015. p. 1–9. LU, N.; ZHANG, G.; LU, J. Concept drift detection via competence models. Artificial Intelligence, v. 209, p. 11 – 28, 2014. References 120 MARTINS, A. V. et al. Non-intrusive energy meter for non-technical losses identification. IEEE Transactions on Instrumentation and Measurement, p. 1–1, 2019. MASUD, M. et al. Detecting recurring and novel classes in concept-drifting data streams. In: IEEE 11th International Conference on Data Mining. [S.l.: s.n.], 2011. p. 1176–1181. MASUD, M. et al. Classification and novel class detection in concept-drifting data streams under time constraints. IEEE Transactions on Knowledge and Data Engineering, v. 23, n. 6, p. 859–874, 2011. MINKU, L.; YAO, X. DDD: A new ensemble approach for dealing with concept drift. IEEE Transactions on Knowledge and Data Engineering, v. 24, n. 4, p. 619–633, April 2012. ISSN 1041-4347. MIRZA, B.; LIN, Z.; LIU, N. Ensemble of subset online sequential extreme learning machine for class imbalance and concept drift. Neurocomputing, v. 149, Part A, p. 316 – 329, 2015. MITCHELL, T. M. Machine Learning. 1. ed. New York, NY, USA: McGraw-Hill, Inc., 1997. ISBN 0070428077, 9780070428072. MITCHELL, T. M. et al. Experience with a learning personal assistant. Communications of the ACM, ACM, New York, NY, USA, v. 37, n. 7, p. 80–91, July 1994. ISSN 0001-0782. PAPA, J.; FALCA˜O, A.; SUZUKI, C. Supervised pattern classification based on optimum-path forest. International Journal of Imaging Systems and Technology, Wiley-Interscience, v. 19, n. 2, p. 120–131, 2009. PAPA, J. et al. Efficient supervised optimum-path forest classification for large datasets. Pattern Recognition, v. 45, n. 1, p. 512 – 520, 2012. PAPA, J. P.; FALCA˜O, A. X. A new variant of the optimum-path forest classifier. In: Proceeding of the 4th International Symposium on Advances in Visual Computing. Berlin, Heidelberg: Springer-Verlag, 2008. p. 935–944. ISBN 978-3-540-89638-8. PAPA, J. P.; FALCA˜O, A. X. A new variant of the optimum-path forest classifier. In: Advances in Visual Computing. Berlin, Heidelberg: Springer Berlin Heidelberg, 2008. v. 5358, p. 935–944. PAPA, J. P.; FALCA˜O, A. X. A learning algorithm for the optimum-path forest classifier. In: TORSELLO, A.; ESCOLANO, F.; BRUN, L. (Ed.). Graph-Based Representations in Pattern Recognition. [S.l.]: Springer Berlin Heidelberg, 2009, (Lecture Notes in Computer Science, v. 5534). p. 195–204. ISBN 978-3-642-02123-7. PAPA, J. P. et al. Efficient supervised optimum-path forest classification for large datasets. Pattern Recognition, Elsevier Science Inc., New York, NY, USA, v. 45, n. 1, p. 512–520, 2012. PAPA, J. P.; FALCA˜O, A. X.; SUZUKI, C. T. M. Supervised pattern classification based on optimum-path forest. International Journal of Imaging Systems and Technology, John Wiley & Sons, Inc., New York, NY, USA, v. 19, n. 2, p. 120–131, 2009. ISSN 0899-9457. PAPA, J. P.; FERNANDES, S. E. N.; FALCA˜O, A. X. Optimum-path forest based on k-connectivity: Theory and applications. Pattern Recognition Letters, v. 87, n. 1, p. 117–126, 2017. References 121 PAPA, J. P.; SUZUKI, C. T. N.; X, A. Libopf: A library for the design of optimum-path forest classifiers. Software version 2.1 available at http://www.ic.unicamp.br/ afal- cao/libopf/index.html. PEDREGOSA, F. et al. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, v. 12, p. 2825–2830, 2011. PONTI, M.; RIVA, M. An incremental linear-time learning algorithm for the optimum-path forest classifier. Information Processing Letters, v. 126, p. 1 – 6, 2017. PONTI, M. P.; PAPA, J. P. Improving accuracy and speed of optimum-path forest classifier using combination of disjoint training subsets. In: SANSONE, C.; KITTLER, J.; ROLI, F. (Ed.). Multiple Classifier Systems. Berlin, Heidelberg: Springer Berlin Heidelberg, 2011. p. 237–248. ISBN 978-3-642-21557-5. PONTI, M. P.; ROSSI, I. Ensembles of optimum-path forest classifiers using input data manipulation and undersampling. In: ZHOU, Z.; ROLI, F.; KITTLER, J. (Ed.). Multiple Classifier Systems. Berlin, Heidelberg: Springer Berlin Heidelberg, 2013. p. 236–246. RAMOS, C. C. O. et al. On the study of commercial losses in brazil: A binary black hole algorithm for theft characterization. IEEE Transactions on Smart Grid, v. 9, n. 2, p. 676–683, 2016. RAMOS, C. C. O. et al. A new approach for nontechnical losses detection based on optimum-path forest. IEEE Transactions on Power Systems, v. 26, n. 1, p. 181–189, 2011. RAZA, H.; PRASAD, G.; YUHUA, L. Adaptive learning with covariate shift-detection for non-stationary environments. In: 14th UK Workshop on Computational Intelligence (UKCI). [S.l.: s.n.], 2014. p. 1–8. RAZAVI, R.; FLEURY, M. Socio-economic predictors of electricity theft in developing countries: An indian case study. Energy for Sustainable Development, v. 49, p. 1 – 10, 2019. RIBEIRO, P. B.; PAPA, J. P.; ROMERO, R. A. F. An ensemble-based approach for breast mass classification in mammography images. In: SPIE Medical Imaging. [S.l.: s.n.], 2017. p. 101342N–1–101342N–8. ROCHA, L.; CAPPABIANCO, F.; FALCA˜O, A. Data clustering as an optimum-path forest problem with applications in image analysis. International Journal of Imaging Systems and Technology, Wiley Periodicals, v. 19, n. 2, p. 50–68, 2009. RODRIGUES, D. et al. Eeg-based person identification through binary flower pollination algorithm. Expert Systems with Applications, v. 62, p. 81–90, 2016. RUSSELL, S. J.; NORVIG, P. Artificial Intelligence: A Modern Approach. Upper Saddle River, NJ, USA: Prentice-Hall, Inc., 1995. ISBN 0-13-103805-2. SALGANICOFF, M. Density-adaptive learning and forgetting. In: Proceedings of the Tenth International Conference on Machine Learning. [S.l.: s.n.], 1993. p. 276–283. SCHLIMMER, J.; GRANGER, R. Beyond incremental processing: Tracking concept drift. In: KEHLER, T.; ROSENSCHEIN, S. (Ed.). Proceedings of the Fifth National Conference on Artificial Intelligence. [S.l.: s.n.], 1986. p. 502–507. References 122 SCHOLZ, M.; KLINKENBERG, R. An ensemble classifier for drifting concepts. In: Proceedings of the Second International Workshop on Knowledge Discovery in Data Streams. [S.l.: s.n.], 2005. p. 53–64. SETHI, T. S.; KANTARDZIC, M. On the reliable detection of concept drift from streaming unlabeled data. Expert Systems with Applications, v. 82, p. 77 – 99, 2017. SILVA, B. da et al. Dealing with non-stationary environments using context detection. In: Proceedings of the 23rd International Conference on Machine Learning. [S.l.]: ACM, 2006. p. 217–224. SILVA, J. de A.; HRUSCHKA, E. R.; GAMA, J. An evolutionary algorithm for clustering data streams with a variable number of clusters. Expert Systems with Applications, v. 67, p. 228 – 238, 2017. SPINOSA, E. et al. Cluster-based novel concept detection in data streams applied to intrusion detection in computer networks. In: Proceedings of the 2008 ACM Symposium on Applied Computing. [S.l.]: ACM, 2008. (SAC’08), p. 976–980. ISBN 978-1-59593-753-7. STANLEY, K. Learning Concept Drift with a Committee of Decision Trees. [S.l.], 2003. STREET, W.; KIM, Y. A streaming ensemble algorithm (sea) for large-scale classification. In: Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. [S.l.]: ACM, 2001. p. 377–382. SUN, J.; LI, H. Dynamic financial distress prediction using instance selection for the disposal of concept drift. Expert Systems with Applications, v. 38, n. 3, p. 2566 – 2576, 2011. SUSNJAK, T.; BARCZAK, A.; HAWICK, K. Adaptive cascade of boosted ensembles for face detection in concept drift. Neural Computing and Applications, Springer-Verlag, v. 21, n. 4, p. 671–682, 2012. ISSN 0941-0643. TENNANT, M. et al. Scalable real-time classification of data streams with concept drift. Future Generation Computer Systems, v. 75, p. 187 – 199, 2017. TSYMBAL, A. The Problem of Concept Drift: Definitions and Related Work. [S.l.], 2004. VIVEKANANDAN, P.; NEDUNCHEZHIAN, R. Mining data streams with concept drifts using genetic algorithm. Artificial Intelligence Review, Kluwer Academic Publishers, v. 36, n. 3, p. 163–178, 2011. WADEWALE, K.; DESAI, S. Survey on method of drift detection and classification for time varying data set. International Research Journal of Engineering and Technology, v. 2, n. 9, p. 709–713, 12 2015. WANG, H.; ABRAHAM, Z. Concept drift detection for imbalanced stream data. The Computing Research Repository (CoRR), abs/1504.01044, 2015. WIDMER, G.; KUBAT, M. Learning in the presence of concept drift and hidden contexts. Machine Learning, Kluwer Academic Publishers, v. 23, n. 1, p. 69–101, 1996. WIDYANTORO, D. Exploiting unlabeled data in concept drift learning. Jurnal Informatika, v. 8, n. 1, p. 54–62, 2007. ISSN 1411-0105. References 123 WILCOXON, F. Individual comparisons by ranking methods. Biometrics Bulletin, v. 1, n. 6, p. 80–83, 1945. XU, S.; WANG, J. Dynamic extreme learning machine for data stream classification. Neurocomputing, v. 238, p. 433 – 449, 2017. YALCIN, A.; ERDEM, Z.; GURGEN, F. Ensemble based incremental SVM classifiers for changing environments. In: 22nd International Symposium on Computer and Information Sciences (ISCIS). [S.l.: s.n.], 2007. p. 1–5. YANG, Y.; WU, X.; ZHU, X. Combining Proactive and Reactive Predictions for Data Streams. New York, NY, USA: ACM, 2005. 710–715 p. (KDD ’05). ZAREMOODI, P.; BEIGY, H.; SIAHROUDI, S. K. Novel class detection in data streams using local patterns and neighborhood graph. Neurocomputing, v. 158, p. 234 – 245, 2015. ZAREMOODI, P.; SIAHROUDI, S. K.; BEIGY, H. A support vector based approach for classification beyond the learned label space in data streams. In: Proceedings of the 31st Annual ACM Symposium on Applied Computing. New York, NY, USA: ACM, 2016. (SAC ’16), p. 910–915. ZHANG, Y. et al. Three-layer concept drifting detection in text data streams. Neurocomputing, v. 260, p. 393 – 403, 2017. ZLIOBAITE, I. Learning under concept drift: an overview. Computing Research Repository by Cornell library, p. –1–1, 2010. GLOSSARY ADAGoLF – Age-based Drift handling Gossip Learning Framework ADWIN – ADaptive WINdowing ALCSD – Adaptive Learning with Covariate Shift-Detection CCP – Conceptual Clustering and Prediction CDC – Concept Drift Committee CDDGoLF – Concept Drift Detection Gossip Learning Framework CDGFM – Concept Drift Genetic-Fuzzy Mining CPp-AbDG – Complete P-partite Attribute-based Decision Graph CV – Conceptual Vectors DARLING – Density-Adaptive Reinforcement Learning DDD – Diversity for Dealing with Drifts DDM – Drift Detection Method DELM – Dynamic Extreme Learning Machine DWM – Dynamic Weighted Majority ECSMiner – Enhanced Classifier for data Streams with novel class Miner EDDM – Early Drift Detection Method EDIST2 – Error Distance Approach for Drift Detection and Monitoring EDTC – Ensemble Decision Trees for Concept drifting data streams EI-kMeans – Equal Intensity k-means space partitioning EM – Ensemble Model References 125 ESOS-ELM – Ensemble of Subset Online Sequential Extreme Learning Machine FAE – Fast Adapting Ensemble FDCD – Financial Distress Concept Drift FDP – Financial Distress Prediction FEAC-Stream – Fast Evolutionary Algorithm for Clustering data Streams FLORA – FLOating Rough Approximation FRANN – Floating Rough Approximation in Neural Networks GLVQ – Generalized Learning Vector Quantization KBS – Knowledge-Based Sampling KDD – Knowledge Discovery and Data mining LFR – Linear Four Rates LOCE – LOcal Classifier Ensemble Learn ++.CDS – Learn++ with the Synthetic Minority class Oversampling TEchnique Learn ++.NIE – Learn++ and Nonstationary and Imbalanced Environments Learn ++.NSE – Learn++ and NonStationary Environments Learn ++.SMOTE – Learn++ and SMOTE MC-NN – Micro-Cluster Nearest Neighbour MD3 – Margin Density Drift Detection MST – Minimum Spanning Tree NEVE – Neuro-EVolutionary Ensemble NTL – Non-Technical Loss OLINDDA – OnLIne Novelty and Drift Detection Algorithm OPFknn – Optimum-Path Forest with k-neighborhood OPFI – Optimum-Path Forest Incremental OPF – Optimum-Path Forest OPT – Optimum-Path Trees References 126 RBF – Radial Basis Functions RL-CD – Reinforcement Learning with Context Detection RLS-SOM – Reduced labeled Samples-Self Organizing Map RSLVQ – Robust Soft Learning Vector Quantization RePro – REactive plus PROactive SACCOS – Semi-supervised Adaptive ClassifiCation Over data Stream SCANR – Stream Classifier And Novel and Recurring class detector SEA – Streaming Ensemble Algorithm SERA – SElectively Recursive Approach SUN – Semi-supervised classification algorithm for data streams with concept drifts and UNlabeled data SVMLearn++ – SVM and Learn++ SVM – Support Vector Machines SVSCLASS – Support Vector-based Stream CLASSifier UCB – Uncorrelated Bagging k-NN – k-Nearest Neighbor Appendix A LEARNING CONCEPT DRIFT WITH ENSEMBLES OF OPTIMUM-PATH FOREST-BASED CLASSIFIERS This appendix extends the work presented by Iwashita et al. (IWASHITA; ALBUQUERQUE; PAPA, 2019). We performed the same experiments described in Chapter 5 with the synthetic datasets: Hyperplane and SEA. Regarding the real-world dataset, we used the Forest Covertype, which contains 581,012 samples and 54 features, being used by many stream-based classifiers. Although the idea is to assess the OPF robustness under concept drift, we also included naı¨ve SVM with a Radial Basis Function kernel optimized through cross-validation for com- parison purposes. The “traditional SVM” parameters were optimized as follows: we divided the first batch in 50% to compose the training set and the remaining 50% to compose the validating set (used to optimize kernel parameters). Also, we consider the following proportions for the training and validating sets: 60%−40%, 70%−30%, 80%−20% and 90%−10%. Such proce- dure was performed for all datasets, and we used the parameters that maximized SVM accuracy over the validating set of the aforementioned configurations (i.e., training and validating set per- centages). The other SVM versions use a grid search parameter estimation methodology with 90%− 10% of training and validating sets, respectively. Regarding SVM implementation, we used scikit-learn (PEDREGOSA et al., 2011). Finally, we employed an accuracy measure proposed by Papa et al. (PAPA; FALCA˜O; SUZUKI, 2009). The figures below contain the results presented in Chapter 5, but in this appendix, we add the results of SVM versions. Figures A.1 to A.3 show the results evaluated in Experiment 1 and described in Section 5.2. Figure A.1 depicts the results concerning the Covertype dataset. We can observe that standard A Learning Concept Drift with Ensembles of Optimum-Path Forest-based Classifiers 128 SVM obtained the best results among the SVM versions. Table A.1 presents the mean accuracy and standard deviation concerning all batches. Figure A.1: Experimental results concerning Covertype dataset with respect to Experiment 1. Table A.1: Covertype dataset with respect to Experiment 1. Type Accuracy OPF 82.32752648±5.527916684 OPF-nomemory 66.340352±6.205296248 OPF-fullmemory 81.51895655±6.360053157 OPF-window3 66.63647852±6.483513342 OPF-combined 68.55474907±6.447446854 OPF-weighted 70.03032828±6.163191174 OPF-major 81.3674181±6.822249211 SVM 78.7832119±5.549011807 SVM-nomemory 66.69459997±7.552454182 SVM-fullmemory 67.99529252±7.904770131 SVM-window3 65.19059759±7.751632536 SVM-combined 66.92436969±7.48059647 SVM-weighted 67.21876483±7.472223996 SVM-major 67.27513031±7.29145594 Figure A.2 depicts the results concerning the Hyperplane dataset. The SVM versions in batches #6 to #10 seem to has better performance than traditional SVM, but in batch #11 they present a drop in performance whereas traditional SVM has an improvement. The SVM-major in some batches has interesting accuracy values. Table A.2 presents the mean accuracy and standard deviation concerning all batches. Figure A.3 depicts the results concerning the SEA dataset. Among the SVM versions, the results seem to be more accurate than traditional SVM, with some batches of exceptions. Table A.3 presents the mean accuracy and standard deviation concerning all batches. Figures A.4 to A.6 show the results evaluated in Experiment 2. Figure A.4 depicts the results concerning the Covertype dataset in full-memory management. We can see that the SVM versions have better behavior than SVM-fullmemory. Table A.4 presents the mean accuracy and standard deviation concerning all batches in full-memory management. A Learning Concept Drift with Ensembles of Optimum-Path Forest-based Classifiers 129 Figure A.2: Experimental results concerning Hyperplane dataset with respect to Experiment 1. Table A.2: Hyperplane dataset with respect to Experiment 1. Type Accuracy OPF 62.48453148±7.664021067 OPF-nomemory 69.01541517±7.265204095 OPF-fullmemory 60.77733197±6.950052568 OPF-window3 63.79282938±7.071207897 OPF-combined 66.90718755±6.359478356 OPF-weighted 66.90718755±6.359478356 OPF-major 67.042868±7.594141989 SVM 66.68400903±10.72104972 SVM-nomemory 77.9653311±11.55842702 SVM-fullmemory 67.393422±10.28898787 SVM-window3 70.15282272±11.27674476 SVM-combined 72.91148562±10.0056331 SVM-weighted 72.91148562±10.0056331 SVM-major 75.71327959±11.22732262 Figure A.3: Experimental results concerning SEA dataset with respect to Experiment 1. In the Hyperplane dataset (Figure A.5), the combined and weighted SVM versions obtained similar performances, being more accurate than traditional SVM except in three batches. Ta- ble A.5 presents the mean accuracy and standard deviation concerning all batches in no-memory management. A Learning Concept Drift with Ensembles of Optimum-Path Forest-based Classifiers 130 Table A.3: SEA dataset with respect to Experiment 1. Type Accuracy OPF 74.39154759±1.346465467 OPF-nomemory 76.24827041±2.128896094 OPF-fullmemory 74.69366883±1.47469361 OPF-window3 75.83703666±2.551433832 OPF-combined 78.6340661±2.305279021 OPF-weighted 78.6340661±2.305279021 OPF-major 76.34419907±2.180538025 SVM 81.50678338±2.119882634 SVM-nomemory 84.387488±3.026719665 SVM-fullmemory 82.1416731±2.194377926 SVM-window3 83.80219638±3.334783039 SVM-combined 84.00122831±2.664180205 SVM-weighted 84.00122831±2.664180205 SVM-major 84.5856531±2.932968912 Figure A.4: Experimental results concerning Covertype dataset with respect to Experiment 2 in full-memory management. Table A.4: Covertype dataset with respect to Experiment 2 in full-memory management. Type Accuracy OPF 82.32752648±5.527916684 OPF-fullmemory 81.51895655±6.360053157 OPF-fold1 77.05269845±6.149721752 OPF-fold2 77.53515266±6.216085779 OPF-fold3 77.32545021±6.333944787 OPF-combined 78.4888871±6.553147785 OPF-weighted 78.58643359±6.601479619 OPF-major 77.65701845±6.154886295 SVM 78.7832119±5.549011807 SVM-fullmemory 67.99529252±7.904770131 SVM-fold1 75.4320561±7.40372963 SVM-fold2 76.36477648±7.629993757 SVM-fold3 69.56091117±6.700559242 SVM-combined 75.80669193±7.455094698 SVM-weighted 75.98013331±7.575125596 SVM-major 76.57805514±7.514583676 A Learning Concept Drift with Ensembles of Optimum-Path Forest-based Classifiers 131 Figure A.5: Experimental results concerning Hyperplane dataset with respect to Experiment 2 in no-memory management. Table A.5: Hyperplane dataset with respect to Experiment 2 in no-memory management. Type Accuracy OPF 62.48453148±7.664021067 OPF-nomemory 69.01541517±7.265204095 OPF-fold1 67.92163028±6.590813201 OPF-fold2 68.15354062±6.862876752 OPF-fold3 67.82526386±7.16983859 OPF-combined 72.09896266±8.563747093 OPF-weighted 72.09896266±8.563747093 OPF-major 68.2630471±7.031192091 SVM 66.68400903±10.72104972 SVM-nomemory 77.9653311±11.55842702 SVM-fold1 77.74081683±11.2355244 SVM-fold2 77.69870045±11.24403346 SVM-fold3 77.86011476±11.30176483 SVM-combined 78.04638724±11.37294124 SVM-weighted 78.04638724±11.37294124 SVM-major 77.6620691±11.23552217 In the SEA dataset (Figure A.6), the SVM versions and traditional SVM performed simi- larly in various batches. The results of SVM versions seem to be more accurate than traditional SVM in batches #10 to #15, whereas traditional SVM was a little more accurate in few batches. Table A.6 presents the mean accuracy and standard deviation concerning all batches in window- 3 management. Figures A.7 to A.9 show the results evaluated in Experiment 3. Figure A.7 presents the results over the Covertype dataset. Once again, traditional SVM obtained the best recognition rates considering SVM versions. Table A.7 presents the mean accuracy and standard deviation concerning all batches. Figure A.8 depicts the results concerning the Hyperplane dataset. The combined and weighted versions of SVM obtained similar performances, being sometimes more accurate than traditional SVM. Table A.8 presents the mean accuracy and standard deviation concerning all A Learning Concept Drift with Ensembles of Optimum-Path Forest-based Classifiers 132 Figure A.6: Experimental results concerning SEA dataset with respect to Experiment 2 in window- 3 management. Table A.6: SEA dataset with respect to Experiment 2 in window-3 management. Type Accuracy OPF 74.39154759±1.346465467 OPF-window3 75.83703666±2.551433832 OPF-fold1 75.21760424±2.51672435 OPF-fold2 75.60642824±2.16436392 OPF-fold3 75.68674859±2.52686664 OPF-combined 81.34387231±2.938457534 OPF-weighted 81.34387231±2.938457534 OPF-major 75.76676728±2.795645255 SVM 81.50678338±2.119882634 SVM-window3 83.80219638±3.334783039 SVM-fold1 83.19033838±3.540041707 SVM-fold2 83.471199±3.476069957 SVM-fold3 83.46943507±3.163171141 SVM-combined 83.46486041±3.393117611 SVM-weighted 83.46486041±3.393117611 SVM-major 83.59102159±3.371619002 Figure A.7: Experimental results concerning Covertype dataset with respect to Experiment 3. batches. Figure A.9 depicts the results concerning the SEA dataset. The SVM versions and tradi- A Learning Concept Drift with Ensembles of Optimum-Path Forest-based Classifiers 133 Table A.7: Covertype dataset with respect to Experiment 3. Type Accuracy OPF 82.32752648±5.527916684 OPF-nomemory 66.340352±6.205296248 OPF-window3 66.63647852±6.483513342 OPF-combined 61.75935952±6.122989272 OPF-weighted 63.36128134±6.217485077 SVM 78.7832119±5.549011807 SVM-nomemory 66.69459997±7.552454182 SVM-window3 65.19059759±7.751632536 SVM-combined 61.29515555±6.557138516 SVM-weighted 63.40883021±6.754922019 Figure A.8: Experimental results concerning Hyperplane dataset with respect to Experiment 3. Table A.8: Hyperplane dataset with respect to Experiment 3. Type Accuracy OPF 62.48453148±7.664021067 OPF-nomemory 69.01541517±7.265204095 OPF-window3 63.79282938±7.071207897 OPF-combined 66.09129634±8.855442297 OPF-weighted 66.08478797±8.848643462 SVM 66.68400903±10.72104972 SVM-nomemory 77.9653311±11.55842702 SVM-window3 70.15282272±11.27674476 SVM-combined 69.20823466±13.07394588 SVM-weighted 69.29923652±13.20099646 tional SVM performed similarly in various batches. Among the SVM versions, the results seem to be more accurate than traditional SVM in batches #10 to #15 and #26 to #30, whereas tra- ditional SVM was a little more accurate in few batches. Table A.9 presents the mean accuracy and standard deviation concerning all batches. A Learning Concept Drift with Ensembles of Optimum-Path Forest-based Classifiers 134 Figure A.9: Experimental results concerning SEA dataset with respect to Experiment 3. Table A.9: SEA dataset with respect to Experiment 3. Type Accuracy OPF 74.39154759±1.346465467 OPF-nomemory 76.24827041±2.128896094 OPF-window3 75.83703666±2.551433832 OPF-combined 80.88808886±2.995032574 OPF-weighted 80.86339066±3.020802532 SVM 81.50678338±2.119882634 SVM-nomemory 84.387488±3.026719665 SVM-window3 83.80219638±3.334783039 SVM-combined 83.77519379±3.641911321 SVM-weighted 83.72836431±3.619883153