Seleção de SNPs utilizando random forests
Abstract
Single Nucleotide Polymorphisms (SNPs) are single-base variations in the nucleotide sequence of different individuals or between homologous sequences within a living being. A large part of genetic variations occur as SNPs. Many of these genetic variations occur in plants, influencing characteristics directly linked to crop productivity, such as rice. In addition to being the largest producer among Western countries, Brazil is also the largest per capita consumer of rice. Rice is one of the main foods for human nutrition, being the food base for more than half of the world population and mostly produced by Asian countries, but also widely produced in Brazil. Rice is part of the Genetic Improvement Program of the Brazilian Agricultural Research Corporation (Embrapa), which aims to improve rice crops with the goal of reaching the consumption preference pattern in Brazil. The Selection of SNPs that are strongly related to the amylose content of rice is one of the problems to be solved in Embrapa’s Genetic Improvement program. The Selection of SNPs can be modeled computationally using Machine Learning tools, a subarea of Artificial Intelligence, making analysis faster and less costly. Thus, the objective of this research is to develop a method capable of performing the SNP Selection task. That is, given a characteristic of an organism, the method must find the SNPs related to the given characteristic. As a test case, the method will be applied to the SNPs of the genomic content of different rice crops, in order to find out which SNPs had the greatest impact on their amylose content. The developed method proved to be efficient in solving the SNP Selection problem. The analysis of the method highlighted an SNP that was validated experimentally by Embrapa as important for the amylose content.
Collections
The following license files are associated with this item: