Aplicação de Técnicas de Aprendizado de Máquina na Identificação de Marcadores Genéticos para a Doença de Alzheimer
Abstract
Single nucleotide polymorphism (SNP) is the variation at a single position in the nucleotide chain where DNA is formed. Since it is a genetic alteration, it is of utmost importance for the study of human health. Through it, it is possible to predict individuals' responses to certain medications, search for genes related to hereditary diseases in a family group, and it can also be associated with more complex diseases such as cardiovascular diseases, diabetes, cancer, and Alzheimer's Disease. With the use of supervised Machine Learning, it is possible to conduct studies on the relationship between SNPs and complex diseases, with each SNP being an input variable for such algorithms. Thus, the aim of this project was to investigate the relationship between SNPs and Alzheimer's Disease, through Machine Learning algorithms. For this purpose, datasets of individuals and their respective SNPs and diagnoses (Normal, Mild Cognitive Impairment, or Alzheimer's Disease) were used. Therefore, this study presents an innovative approach to identifying genetic markers associated with Alzheimer's Disease (AD), combining machine learning techniques and genomic analysis into a set of four crucial steps. The first step consists of data preprocessing and normalization, followed by the implementation of Genome-Wide Association Studies (GWAS) on all datasets generated in the previous phase. The third step employs advanced machine learning methods on the most promising dataset identified in the previous steps. Finally, the fourth step involves a comparative analysis of the results achieved in the GWAS and machine learning stages. The results of this study revealed a comprehensive set of SNPs associated with AD, including both those previously known in the scientific literature and promising new discoveries. Furthermore, the importance of data handling in quality control was highlighted, which had a significant impact on the results obtained. The machine learning models used in this study showed distinct profiles of the most significant SNPs, emphasizing the complexity and heterogeneity of AD. This variation in coefficients and feature importances underscores the need for an integrated and multifaceted approach to AD genetics research. In summary, this study demonstrates the potential of machine learning and genomic analysis in advancing knowledge about AD. The results provide new insights into the emerging field of AD genomics, opening up new perspectives for more effective therapeutic and diagnostic strategies. The findings presented represent a significant advancement towards a deeper understanding of AD genetics and its impact on human health.
Collections
The following license files are associated with this item: