Kernel Methods for large-scale Genomics Data Analysis

1 Kernel Methods for large-scale Genomics Data Analysis
Wang et al.

2 Background Machine learning (ML) has been illustrated as a promising tool to deal with challenges regarding data growth in genomics ML methods can be used to learn how a very large number of genetic variants (SNPs) are associated with complex phenotypes (diseases, disorders etc.) This study highlights potential roles that ML, particularly kernel methods, will have in modern genomics

3 Kernels for Genomic Data
Kernel methods are based on mathematical functions that smooth data They allow us to use linear classifiers to solve non-linear problems by transforming the non-linearly separable data

4 Kernels for Genomic Data contd.
Some advantages to kernel methods over traditional regression methods are the following: Allowance for high-dimensional genomic data Allowance for nonlinear relations between outcomes and the genomic data Flexibility to include structural information

5 Kernels for Genomic Data contd.
A key component for a kernel is a kernel function The function converts info for a pair of subjects into a quantitative measure representing their similarity with respect to genetic data For GWAS studies, the weighted linear kernel is popular


7 Kernels for Genomic Data contd.
For the weighted kernel function, SNPs are coded as G and G has values 0, 1, or 2 based on the number of the minor allele essentially encoding homozygous or heterozygous For q SNPs, the weighted function for subjects i and j can be expressed as: 𝐾 𝑖𝑗 = 𝑘=1 𝑞 𝑤 𝑘 𝐺 𝑖𝑘 𝐺 𝑗𝑘

8 Kernels for Genomic Data contd.
𝑤 𝑘 weights each SNP and is expressed as the standard error of the estimated minor allele frequency (MAF): 𝑤 𝑘 =1/ 𝑝 𝑘 (1− 𝑝 𝑘 ) Other types of weights can be used and higher-order polynomial functions can be used for higher-order interactions Ultimately, other types of kernels besides the weighted kernel can also be used Since common alleles can be carried by many subjects (by chance alone) then giving greater weights to sharing rare variants can increase the strength of the relationship between the kernel matrix and the phenotype Other types of weights can be used and higher-order polynomial functions can be used to capture higher-order genetic interactions (interactions involving 3 or more markers contributing to complex traits i.e. diseases) Define the kernel however you see “similarity” fits your application

9 Building Predictive Models
The goal is to be able to predict phenotypes for different individuals based on known genomic data (supervised ML) A common practice is to build the prediction model based on top-ranked markers from GWAS and a few experimentally known susceptibility markers (“cherry picking”) This so-called “cherry picking” strategy shows poor performance in most cases due to the fact that top genetic variants only explain a small amount of phenotype variation and genetic studies suffer from low replicability BUT this is how people who do GWAS studies have gotten around the scalability issues

10 Building Predictive Models contd.
Another strategy is to train the model using all the available markers as well as all other available information such as epigenetic markers Another strategy is to train the model using all the available markers as well as all other available information such as epigenetic markers (markers that characterize phenotypes that are not dictated by DNA) In this paper, an efficient kernel method solution is shown that uses this strategy efficiently by using feature selection/weighting as well as prediction in a unified framework

11 Building Predictive Models contd.
For disease risk prediction, support vector machines (SVMs) may be used. SVMs are a well-developed method seeking an optimal hyperplane that separates the data into 2 classes maximizing the margin How to make use of SVM with non-linearly separable data?? NEXT SLIDE

12 Kernel Trick Nonlinear classification is attained by using the kernel trick: mapping the non-linear separable data-set into a higher dimensional space where we can find a hyperplane that can separate the samples



15 Building Predictive Models contd.
SVMs are advantageous for high-dimensional genomic data: Ability to deal with all markers without any pre-pruning or selection Accounts for complex relationships amongst markers However, SVMs are black-box approaches that only provide classification and it is difficult to extract more information Its application in genomic prediction is very limited despite promising results from various studies

16 Building Predictive Models contd.
Another potential classifier is kernel logistic regression (KLR) KLR offers a natural estimate of probability and adapts to other probabilistic approaches The hinge losses of KLR and SVM are actually very similar and the methods have similar expected performance but the significant differences lie in their applications Naturally, it is a “kernelized” version of logistic regression that offers more desirable features than SVMs It also is easy to extend to multiclass prediction However, the original KLR does not scale well with large data sets. New fast and sparse-driven versions have been developed but not yet seen with genomic prediction

17 Building Predictive Models contd.
Many strategies can be adopted to improve whole-genome risk prediction One strategy is exploiting block structure underlying genomic data Using kernel based methods does not necessarily drastically improve the prediction Using kernel based methods does not necessarily drastically improve the prediction because performance in real data analysis is always limited by sample size and content embedded in the data

18 The genomic data space can be seen as 3 layers of space: the original data (X), the transformed feature space made by the kernel functions and finally the reduced kernel space made by kernel approximation A lot of the methods in the paper efficiently explore and fit models in the H space (kernel counterparts of many well known ML methods) The figure shows the type of applicable ML methods used in each data space The figure also shows that we take the genomic data as well as the underlying structure as input

19 Multiple Kernel Learning (MKL)
Instead of selecting a fixed single kernel, multiple kernel learning (MKL) uses multiple candidate kernels to map the data into the other space MKL achieves better performance by finding optimal weights for each base kernel

20 Multiple Kernel Learning (MKL) contd.
MKL seeks to make a composite kernel as a linear combination of different kernels where model complexity is controlled by regularization Applications of MKL in genomic data are currently limited but will increase alongside large-scale genomics data

21 Genomic Data Fusion A closely related concept to MKL is kernel-based data fusion Both data fusion and MKL are facilitated by the closure property of kernels (sum or weighted sum of kernels is another valid kernel) Kernel fusion methods allow for integration of data with different types (gene expression, DNA methylation, CNV etc.) and structures in function prediction These methods as well as other ML strategies provide novel tools for gene function prediction and annotation Kernel matrices generated from different data combining into one global kernel Data fusion here has been defined in a broad sense but other studies have been done outline its applications in bioinformatic settings (unsupervised as well as supervised methodology)

22 Structured Association Mapping
Leveraging structural information unlike traditional test- statistics-based or PCA-based methods ( 𝑆 2 𝑀 2 𝑅) Using various structural information present in the genome (phenome and transcriptome) to improve accuracy of identifying causal variants Reference the kernel diagram again

23 Structured Association Mapping contd.
An important source of genome structural info is genome annotations (known binding sites, exon regions etc.) This data can be considered as prior knowledge about SNPs to be used to search for disease susceptibility markers For example, SNPs in highly conserved regions are more likely to have true associations since conserved regions are functionally important How to best use this prior knowledge will become increasingly important as genomic annotations improve in quality and quantity

24 Discussion Kernel machine learning methods offer potential tools for large-scale and high-dimensional data analysis, for genomics in particular Kernel ML methods can be integrated with classical ML techniques Future work involves improving the scalability to sample size, dimensionality and data heterogeneity We know that the kernel trick allows efficient search in the higher dimension; however, a main limitation of kernel methods is high cost involved in the learning which is at least quadratic in the number of samples which with genomic data growing calls for additional research in approximation methods What I mean in terms of data heterogeneity is that its difficult to predefine an optimal kernel function for a specific application given the complex data structure and types in genomics which is why MKL and ensemble type learning schemes should be more considered

25 Reference Wang, Xuefeng, Eric P. Xing, and Daniel J. Schaid. “Kernel Methods for Large-Scale Genomic Data Analysis.” Briefings in Bioinformatics 16, no. 2 (March 2015): 183–92.

