Análise de escalabilidade de aplicações Hadoop/Mapreduce por meio de simulação
Rocha, Fabiano da Guia
MetadataMostrar registro completo
During the last years we have witnessed a significant growing in the amount of data processed in a daily basis by companies, universities, and other institutions. Many use cases report processing of data volumes of petabytes in thousands of cores by a single application. MapReduce is a programming model, and a framework for the execution of applications which manipulate large data volumes in machines composed of thousands of processors/cores. Currently, Hadoop is the most widely adopted free implementation of MapReduce. Although there are reports in the literature about the use of MapReduce applications on platforms with more than one hundred cores, the scalability is not stressed and much remain to be studied in this field. One of the main challenges in the scalability study of MapReduce applications is the large number of configuration parameters of Hadoop. There are reports in the literature that mention more than 190 configuration parameters, 25 of which are known to impact the application performance in a significant way. In this work we study the scalability of MapReduce applications running on Hadoop. Due to the limited number of processors/cores available, we adopted a combined approach involving both experimentation and simulation. The experimentation has been carried out in a local cluster of 32 nodes, and for the simulation we have used MRSG (MapReduce Over SimGrid). In a first set of experiments, we identify the most impacting parameters in the performance and scalability of the applications. Then, we present a method for calibrating the simulator. With the calibrated simulator, we evaluated the scalability of one well-optimized application on larger clusters, with up to 10 thousands of nodes.