Machine Learning in Genome-Wide Association Studies

Machine learning applications in genetics and genomics | Nature Reviews  Genetics

Machine learning applications in genetics and genomics | Credit: Nature Reviews Genetics


GWAS are used to discover genetic variations that are a part of prevalent human illnesses. GWAS became famous by finding thousands of genes that are linked to a wide range of genetic diseases. In spite of that, the discovered genes have had the greatest success in explaining heritability of Mendelian illnesses while accounting for just a fraction of it. It is probable that several genetic and environmental variations are necessary to explain complex illnesses. These nonlinear, non-additive effects, known as epistasis, make conventional single-gene-at-a-time techniques unable to use in a GWAS. In contrast, a set of complicated machine learning algorithms that can identify and describe various interactions in the genome are required.

This special issue focuses on six papers that discuss how researchers are using machine learning techniques to uncover interacting genomic variations for GWAS. Liu et al. developed a deep-learning framework for predicting the quantitative characteristics of Single Nucleotide Polymorphisms (SNPs) using convolutionary neural networks and for examining genotyl contributions to the Trait on saliency maps. The authors used simulations and real-world soybean datasets to test the suggested method. The findings revealed that deep learning may skip imputed missing data to provide more accurate predictions of quantitative traits than techniques in the scientific community that are widely-used. The authors state that their method can find important SNPs and SNP combinations related to GWAS data more quickly and effectively. The authors proposed a machine learning-based method to isolate and categorize circular RNAs (circLGB). The researchers included three novel, sequence-derived characteristics and two well-known features, the combination of which included adenosine to inosine (A-to-I) deamination, A-to-I density, and internal ribosome entrance site. The LightGBM classifier is used to class circRNAs with feature selection by circLGB. To investigate how and why some circRNAs have microRNA binding sites, as well as finding patterns in the number of microRNA binding sites, the authors use a complex machine learning method called circMRT to uncover regulatory patterns. Circular MRT included many types of information that were assembled into sets, such as sequence-based, graph, genomic context, and regulatory information. Research shows that the suggested algorithms outdo other existing approaches. In a paper review by Nicholls et al., the authors explored how well ML approaches may perform in genetic risk analysis by exploring three aspects: particular methods, input characteristics, and performance of resultant output models. To further the GWAS endeavor, the authors prioritized complicated disease-associated loci, and also investigated ML’s influence on helping to achieve the GWAS end-game with the subsequent wide-ranging translational effect. ENhanced Permutation testing through multiple Pruning (Leem et al., 2015) (ENPP). ENPP will eliminate unnecessary information and features. According to their simulation research, ENPP may, after a single permutation round, eliminate approximately 50% of the features, and after the 100th permutation round, 98% of the features are deleted. The unpruned permutation technique used over 80% of the compute time, whereas only 7.4% was needed with the new method. Furthermore, they also used the technique to a collection of 300 K SNPs in an actual data set, to identify the link between the non-normal distributed phenotype. In order to identify complicated multivariate effects in GWAS, Arabnejad et al. created Nearest-neighbor Projected-Distance Regression (NPDR), which is a machine learning method. Using regression formalism, NPDR has statistically significant control for multiple testing while providing efficient tests. It also includes a framework for dealing with the complexity of population structures in regression, which was used in GWAS data from SLE (SLE). Additionally, the authors evaluated NPDR on various datasets, including unbalanced datasets and outcomes that vary in terms of their distributions. Epistatic, as well as other factors, were shown to contribute to the complicated disease known as SLE. In the third study, the authors sought to examine the possibility of having 300 K unique stomach tissue-related SNP’s (single nucleotide polymorphisms) with GC risk. The authors’ work included using the Sherlock integrative analysis in conjunction with a sequence kernel association combination test to examine the cumulative impact of eSNPs by conducting a gene-based study. At the SNP level, they discovered two new GC risk variations. The two genes that have never been found before are unique susceptibility genes of GC. Gene-based studies showed that both genes are substantially overexpressed in tumor tissues than in normal tissue. Enriched in many cancer-related pathways were a few genes expressing themselves with the new genes in normal stomach tissues.

Categories: Clinical, Tech&Innovation