Workshop on Methods for Genomic Selection (El Batán, July 15, 2013) Paulino Pérez & Gustavo de los Campos
Objectives To go, in a very short session, over a set of examples that illustrate how to implement various types of genome-enabled prediction methods using the BGLR package. BGLR is a new package that we recently developed that implements various types of Bayesian parametric and semi-parametric methods. The focus will be on examples; theory and deeper treatment will be offered in a short course (last week of September)
Outline Brief Introduction to whole-genome regression & roadmap. Ridge Regression and the Genomic BLUP (G-BLUP). Bayesian Methods (‘The Bayesian Alphabet’). Kernel Regression.
Classical Quantitative Genetic Model Phenotype Genetic Value Environmental effect Our error terms will involve both true environmental effects plus approximation errors emerging due to model miss-specification and because of imperfect LD between markers and QTL (‘error in predictor variables’)
The two most important challenges Complexity. How to incorporate in our models the complexity of a genetic mechanism that may involve complex interactions between alleles at multiple genes (non-linearity) as well as interactions with environmental conditions? Coping with the course of dimensionality. In the models we will consider the number of unknowns (e.g., marker effects) can vastly exceed the sample size. This induce high sampling variance of estimates and consequently large MSE. How do we confront this?
Confronting Complexity Elements of Model Specification How many markers? Which markers? What type of interactions? Dominance Epistasis (type, order) What about non-parametric approaches?
Confronting the ‘Curse of dimensionality’ In the regressions we will consider the number of parameters exceeds by large the number of data-points. In this context, standard estimation procedures (OLS, ML) cannot be used (often the solution is not unique, and when it is estiamtes have large sampling variance). Therefore, we will consider in all cases regularized regression which involve either shrinkage of estimates or variable selection or a combination of bot.
The Bias-Variance Tradeoffs Sampling Distribution of Estimates Variance Squared-Bias Bias-Variance Tradeoffs
Roadmap Linear methods - Effects of shrinkage: a case study based on Ridge Regression. - Genomic Best Linear Unbiased predictor. - Methods for really large p (e.g., 1 million markers). - The Bayesian Alphabet (a collection of methods that perform different type of shrinkage of estimates) 2. Reproducing Kernel Hilbert Spaces Regressions (RKHS) - Choice of bandwidth parameter. - Kernel Averaging
1. Parametric Methods
Whole-Genome Regression Methods [1] Penalized Parametric Bayesian Ridge Regression (Shrinkage) LASSO (Shrinkage and Selection) Elastic Net Bayesian Ridge Regression (shrinkage) Bayes B/C (selection & shrinkage) Bayes A Bayesian LASSO [1]: Meuwissen Hayes & Goddard (2001)
2. Ridge Regression & The Genomic BLUP (G-BLUP)
Penalized Regressions OLS maximizes goodness of fit to the training data (min RSS, equivalent to maximize R2). Problem: when p is large relative to n, estimates have large sampling variance, and consequently large Mean-Squared-Error. Regularization Parameter Penalty on Model Complexity
Commonly Used Penalties (Bridge Regression)
Penalty on Model Complexity Ridge Regression Penalty on Model Complexity Regularization Parameter
Example 1. How does λ affects: Shrinkage of estimates. Goodness of fit (e.g., residual sum of squares ) Model complexity (e.g., DF) Prediction Accuracy
Results Example 1
Results (DF)
Results (estimates)
Results (estimates)
Results (shrinkage of estimates with RR)
Results (fitness to training data)
Results (fitness to testing data)
Ridge Regression & G-BLUP
Example 1.
Computation of Genomic Relationship Matrix with large numbers of markers
3. The Bayesian Alphabet
Penalized and Bayesian Regressions - In penalized regressions, shrinkage is induced by adding to the objective function a penalty on model complexity - The type of shrinkage induced depends on the form of the penalty
Commonly Used Penalties (Bridge Regression)
Bayesian Regression Model for Genomic Selection
A grouping of priors
Results
Average Prediction Squared Error of Effects Markers ‘QTL’ --------------------------------------------- BRR 1.475324e-05 0.03782839 BA 1.388743e-05 0.03586657 BL 1.513329e-05 0.03705841 BC 4.641837e-05 0.02864067 BB 1.834702e-05 0.03374704 ----------------------------------------------
Estimated Marker Effects: BRR
Estimated Marker Effects: BayesA
Estimated Marker Effects: BayesC
Estimates of marker Effects BayesA Vs BRR
Estimates of marker Effects BayesA Vs BayesC
Estimates of marker Effects BayesA Vs BL
Estimates of marker Effects BayesA Vs BRR
Prediction Accuracy of realized genetic values by model
4. Kernel Regression
Framework Phenotype Genetic Value Model Residual Ridge Regression / LASSO Bayes A, Bayes B, Bayesian LASSO … - Linear model: - Reproducing Kernel Hilbert Spaces Regression Neural Networks … - Semi-parametric models:
RKHS Regressions (Background) Uses: Scatter-plot smoothing (Smoothing Splines) [1] Spatial smoothing (‘Kriging’) [2] Classification problems (Support vector machines) [3] Animal model … Regression setting (it can be of any nature) unknown function [1] Wahba (1990) Spline Models for Observational Data. [2] Cressie, N. (1993) Statistics for Spatial Data. [3] Vapnik, V. (1998) Statistical Learning Theory.
RKHS Regressions (Background) Non-parametric representation of functions Reproducing Kernel: Must be positive (semi) definite: Defines a correlation function: Defines a RKHS of real-valued functions [1] [1] Aronszajn, N. (1950) Theory of reproducing kernels
Functions as Gaussian processes K=A => Animal Model [1] [1] de los Campos Gianola and Rosa (2008) Journal of Animal Sci.
RKHS Regression in BGLR1 ETA<-list( list(K=K,model='RKHS') ) fm<-BGLR(y=y,ETA=ETA,nIter=...) [1]: the algorithm is described in de los Campos et al. Genetics Research (2010)
Choosing the RK based on predictive ability Strategies Grid of Values of ө + CV Fully Bayesian: assign a prior to ө (computationally demanding) Kernel Averaging [1] [1] de los Campos et al. (2010) WCGALP & Genetics Research (In press)
Histograms of the off-diagonal entries of each of the three t kernels used (K1, K2, K3) in the RKHS models for the wheat dataset
How to Choose the Reproducing Kernel? [1] Pedigree-models K=A Genomic Models: - Marker-based kinship - Model-derived Kernel Predictive Approach Explore a wide variety of kernels => Cross-validation => Bayesian methods [1] Shawne-Taylor and Cristianini (2004)
Example 2
Example 2
Example 2
Example 2
Example 3 Kernel Averaging
Kernel Averaging Strategies Grid of Values of + CV Fully Bayesian: assign a prior to (computationally demanding) Kernel Averaging [1] [1] de los Campos et al., Genetics Research (2010)
Kernel Averaging
Example 4 (100th basis function)
Example 4 (100th basis function, h=)
Example 4 (KA: trace plot residual variance)
Example 4 (KA: trace plot kernel-variances)
Example 4 (KA: trace plot kernel-variances)
Example 4 (KA: prediction accuracy)