Data preparation pipeline recommendation via meta-learning

Zagatti, Fernando Rezende

Ver/

Dissertação do mestrado do Fernando Rezende Zagatti (939.5Kb)

Carta de autorização do orientador (69.86Kb)

Fecha

2021-05-26

Autor

Zagatti, Fernando Rezende

Metadatos

Mostrar el registro completo del ítem

Resumen

Data preparation is a essential stage in the machine learning pipeline, aiming to convert noisy and disordered data into refined data compatible with the algorithms. However, data preparation is time-consuming and requires specialized knowledge. In this scenario, automating data preparation and decreasing the effort made by data scientists at this stage is a scientific challenge of great practical relevance. Each dataset has its particular characteristics and can be interpreted in different ways. Despite its relevance, current automated machine learning (AutoML) platforms disregard or make simple hardcoded pipelines for data preparation. Trying to fill this gap, we present a meta-learning-based recommendation system for data preparation. Our system recommends five pipelines, ranked by their relevance, so it is useful for users with varied experience levels. Using the top recommendation to simulate an entirely automatic choice of data preparation pipeline, we demonstrate that our proposal allows a better performance of an AutoML system, unable to find a classification model due to the noisy data. Besides, our method's accuracy rates are similar to those achieved by a reinforcement-learning-based algorithm with the same goal, but it is up to two orders of magnitude faster. Morevover, we demonstrate our method in a real-world application and evaluate its benefits and limitations in this scenario.

URI

https://repositorio.ufscar.br/handle/ufscar/14790

Colecciones

Teses e dissertações

El ítem tiene asociados los siguientes ficheros de licencia:

Creative Commons

Excepto si se señala otra cosa, la licencia del ítem se describe como Attribution-NonCommercial-NoDerivs 3.0 Brazil