I. Statistical Methods for Genome-Enabled Prediction of Complex Traits OUTLINE THE CHALLENGES OF PREDICTING COMPLEX TRAITS ORDINARY LEAST SQUARES (OLS)

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Topic 12: Multiple Linear Regression
A. The Basic Principle We consider the multivariate extension of multiple linear regression – modeling the relationship between m responses Y 1,…,Y m and.
GENERAL LINEAR MODELS: Estimation algorithms
ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
FTP Biostatistics II Model parameter estimations: Confronting models with measurements.
Fast Bayesian Matching Pursuit Presenter: Changchun Zhang ECE / CMR Tennessee Technological University November 12, 2010 Reading Group (Authors: Philip.
Prediction, Correlation, and Lack of Fit in Regression (§11. 4, 11
The General Linear Model. The Simple Linear Model Linear Regression.
The Simple Linear Regression Model: Specification and Estimation
Linear Methods for Regression Dept. Computer Science & Engineering, Shanghai Jiao Tong University.
Simple Linear Regression
1 Chapter 3 Multiple Linear Regression Ray-Bing Chen Institute of Statistics National University of Kaohsiung.
Curve-Fitting Regression
Linear Regression Models Based on Chapter 3 of Hastie, Tibshirani and Friedman Slides by David Madigan.
7. Least squares 7.1 Method of least squares K. Desch – Statistical methods of data analysis SS10 Another important method to estimate parameters Connection.
Ordinary least squares regression (OLS)
Linear and generalised linear models
Basic Mathematics for Portfolio Management. Statistics Variables x, y, z Constants a, b Observations {x n, y n |n=1,…N} Mean.
Linear and generalised linear models
Basics of regression analysis
Linear and generalised linear models Purpose of linear models Least-squares solution for linear models Analysis of diagnostics Exponential family and generalised.
Linear regression models in matrix terms. The regression function in matrix terms.
Modern Navigation Thomas Herring
Classification and Prediction: Regression Analysis
Lecture 10A: Matrix Algebra. Matrices: An array of elements Vectors Column vector Row vector Square matrix Dimensionality of a matrix: r x c (rows x columns)
Objectives of Multiple Regression
ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:
Chapter 12 Multiple Linear Regression Doing it with more variables! More is better. Chapter 12A.
Linear Regression Andy Jacobson July 2006 Statistical Anecdotes: Do hospitals make you sick? Student’s story Etymology of “regression”
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.
SUPA Advanced Data Analysis Course, Jan 6th – 7th 2009 Advanced Data Analysis for the Physical Sciences Dr Martin Hendry Dept of Physics and Astronomy.
Multiple Regression. Simple Regression in detail Y i = β o + β 1 x i + ε i Where Y => Dependent variable X => Independent variable β o => Model parameter.
Lecture 2: Statistical learning primer for biologists
Review of statistical modeling and probability theory Alan Moses ML4bio.
Simple and multiple regression analysis in matrix form Least square Beta estimation Beta Simple linear regression Multiple regression with two predictors.
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
STA302/1001 week 11 Regression Models - Introduction In regression models, two types of variables that are studied:  A dependent variable, Y, also called.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.
Computational Intelligence: Methods and Applications Lecture 14 Bias-variance tradeoff – model selection. Włodzisław Duch Dept. of Informatics, UMK Google:
Applied statistics Usman Roshan.
Regression Models for Linkage: Merlin Regress
Bayesian Semi-Parametric Multiple Shrinkage
Part 5 - Chapter
LECTURE 06: MAXIMUM LIKELIHOOD ESTIMATION
Probability Theory and Parameter Estimation I
Ch3: Model Building through Regression
CH 5: Multivariate Methods
Workshop on Methods for Genomic Selection (El Batán, July 15, 2013) Paulino Pérez & Gustavo de los Campos.
Evgeniya Anatolievna Kolomak, Professor
...Relax... 9/21/2018 ST3131, Lecture 3 ST5213 Semester II, 2000/2001
Bias and Variance of the Estimator
The regression model in matrix form
Regression Models - Introduction
Simple Linear Regression
Model Comparison: some basic concepts
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
OVERVIEW OF LINEAR MODELS
Linear Model Selection and regularization
What are BLUP? and why they are useful?
5.4 General Linear Least-Squares
Simple Linear Regression
OVERVIEW OF LINEAR MODELS
Generally Discriminant Analysis
Product moment correlation
Learning Theory Reza Shadmehr
Parametric Methods Berlin Chen, 2005 References:
Biointelligence Laboratory, Seoul National University
Topic 11: Matrix Approach to Linear Regression
Regression Models - Introduction
Presentation transcript:

I. Statistical Methods for Genome-Enabled Prediction of Complex Traits OUTLINE THE CHALLENGES OF PREDICTING COMPLEX TRAITS ORDINARY LEAST SQUARES (OLS) ESTIMATES CURSE OF DIMENSIONALITY (Biased regression – ridge regression)

Genomic Prediction  Many important traits and diseases are moderately to highly heritable, suggesting that these traits could be predicted based on knowledge of the individual’s genotype.  Modern genotyping and sequencing methods provide very detailed description of genomes, even at the sequence level.  In principle, we should be able to use genotypic information for accurate prediction of complex traits and diseases.

The Task PhenotypesGenotypes Prediction model Predictions Decisions Phenotype Genetic Value Model residual

Confronting Complexity  How many markers?  Which markers?  What type of interactions?  Dominance  Epistasis (type, order)

The standard linear genetic model considers that the phenotypic response of the i th individual ( ) is explained by a factor common to all individuals (µ), a genetic factor specific to that individual ( ), and a residual comprising all other non-genetic factors ( ), among others, the environmental effects (temporal or spatial) and the effects described by the experimental design. Then, the linear genetic model for n genotypes (i=1,2,..,n) is represented as In this standard linear genetic model, the genetic factor can be described by using a summation of molecular marker effects or by using pedigree. Meuwissen et al. (2001) were the first to propose doing an explicit regression of phenotypes on the marker genotypes using the simple parametric regression model, such that (j=1,2,…,p), where is the regression of on the j th marker covariate (j=1,2,…,p), and is the number of copies of bi-allelic markers (0,1,2 or -1,0,1). In matrix notation, this can be represented as THE BASIC GENETIC MODEL

Ordinary Least Squares (OLS) Estimates Consider the following model: Where: is the phenotype of the i th individual, is an effect common to all individuals (an “intercept”), are covariates (e.g., marker genotypes), is the effect of the j th covariate and is a model residual. In matrix notation the model is expressed as:

OLS estimates The ordinary least squares estimate of is the solution to the following optimization problem: The argument to be minimized is the residual sum of squares. The solution to the above optimization problem is given by

MULTIPLE REGRESSION the residuals have common variance and are statistically independent (zero covariance). The response vector Y is a function of the random error plus constants; thus Var(Y)=Var(error ) The predictors are independent

The collinearity problem Singularity arise when some linear function of the independent variables is very close to zero. Some linear functions of the columns of X are zero or close to zero. Then a unique does not exist. In cases X is only close to be singular (nearly- singular) but the solution is very unstable and the variances of the regression coefficients become very large Interdependent independent variables that are closely linked in the system being studied cause near singularities in X (i.e., molecular markers)

Penalized regression The least squares estimators of the regression coefficients are BLUE with the MINIMUM VARIANCE. Under collinearity this MINIMUM VARIANCE may be unacceptable large. Relax the condition of UNBIASED estimator A measure of average closeness of an estimator to the parameter being estimated is the MEAN SQUARED ERROR of the estimator

The challenges of highly dimensional marker data Two different approaches can be used to confront the challenges posed by p>>n 1.subset selection we design an algorithm to select k out of p (k<p) predictors; our final model will include only these k predictors. 2.shrinkage estimation uses all available predictors and confronts the challenges posed by regressions with p>n by using shrinkage estimation methods

II. Statistical Methods for Genome-Enabled Prediction of Complex Traits OUTLINE SHRINKAGE (PENALIZED) LINEAR REGRESSION  RIDGE REGRESSION (RR)  GBLUP BAYESIAN VERSION OF RR and GBLUP

Shrinkage (Penalized) estimates An approach used to solve the problems emerging in large-p with small-n regressions is to use penalized estimates; these estimates are obtained as the solution to an optimization problem that balances two components: how well the model fits the data and how-complex the model is.

General form of the optimization problem loss function that measures fitness (lack of fit) of the model to the data measures model complexity (degree of freedom) regularization parameter controlling the trade-offs between fitness and model complexity.

Ridge Regression (RR) Measure lack of fitness of the model Model complexity is where S define the set of coefficients to be penalized Then In matrix

The first order conditions of the above optimization problem are satisfied by the following system of linear equations In penalized estimation, the regularization parameter ( ) controls the trade-offs between model complexity and model goodness of fit and. This affects parameter estimates (their value, and the statistical properties), the model goodness of fit to the training dataset, and the ability of the model to predict unobserved phenotypes. Ridge Regression (RR)

Ridge Regression Singular square matrices can be made non- singular by adding a constant to the diagonal of the matrix. If is singular then is non singular, where k is a small positive constant. This small quantity makes the off-diagonal appear relatively less important and thus it suppresses the near-singularity. Ridge works with the CENTERED and SCALED INDEPENDENT VARIABLES ( Z ) Ridge Regression (RR)

The first order conditions of the above optimization problem are satisfied by the following system of linear equations In penalized estimation, the regularization parameter ( ) controls the trade-offs between model complexity and model goodness of fit and. This affects parameter estimates (their value, and the statistical properties), the model goodness of fit to the training dataset, and the ability of the model to predict unobserved phenotypes.

RIDGE REGRESSION (RR) -- SUMMARY The OLS estimates of regression coefficients are the solution to the following systems of equations The RR estimates has a very similar form, simply add a constant to the diagonal of the matrix of coefficients, that is: where is a constant and is a diagonal matrix with zero in its first diagonal entry (d 1 =0, this, to avoid shrinking the estimate of the intercept) and ones in the remaining diagonal entries of D. When D or equal zero, the solution to the above problem is OLS. Adding a constant to the diagonal entries of the coefficient matrix of the system of equations in,, makes it non-singular and shrinks the estimates of regression coefficients other than the intercept towards zero. This induces bias but reduces the variance of the estimates; in large-p with small-n problems this may reduce MSE of estimates and may yield more accurate predictions.

RIDGE REGRESSION (RR BLUP) One of the first penalized regression methods used in genomic prediction is Ridge Regression (RR), which is equivalent to the mixed models that yield Best Linear Unbiased Predictor (BLUP). This marker-based model (also called RR-BLUP) is expressed as where Z is the design matrix that relates individuals to phenotypic observations, X is the genotype matrix for the bi-allelic markers, where is the number of observation collected for the i th individual

RIDGE REGRESSION (RR-BLUP) The solution to the optimization problem for RR can be written as Here the ridge parameter is the ratio between the residual and the marker variance, and induces shrinkage of marker effects toward zero. Since is the vector of marker effects, the genomic estimated breeding value (GEBV) is

Let then is with and is the unknown residual variance parameter estimated from the data. When for then, the model is named GBLUP, the G matrix is the genomic derived relationship matrix, and These two models are equivalent. But the last model with does not provide the marker effects but is computationally simpler than the previous one. RIDGE REGRESSION (RR-BLUP) -- AN EQUIVALENT MODEL (GBLUP)

BAYESIAN VERSION OF RR From a Bayesian perspective, can be viewed as the conditional posterior mode in a model with Gaussian likelihood and IID (independent and identically distributed) Gaussian marker effects, that is, or

BAYESIAN VERSION OF RR Where is the prior variance of the marker effect. The posterior mean and mode of the above is equal to the RR estimate with and thus is the BLUP of the marker that is used to obtain the predicted genetic value of the individuals (their GEVB)

The posterior distribution of in the above model is multivariate normal with a mean (co-variance matrix) equal to the solution (inverse of the coefficient matrix) of the following system this is just the RR equations and is also the Best Linear Unbiased Predictor (BLUP) of given y. Recall that the ratio is equivalent to in RR. In a fully-Bayesian models we assign priors to each of these variance parameters, this allow inferring these unknowns from the same training data that is used to estimate marker effects BAYESIAN VERSION OF RR

Therefore from the predicted genetic values (GEBV) using the BLUP of the marker effects are How the GBLUP is obtained? Change the variable and get BAYESIAN VERSION OF RR

Alternatively, and using properties of the multivariate normal distribution, Therefore, with p>>n expression last equation is computationally more convenient. However, the last expression does not yield estimates of marker effects BAYESIAN VERSION OF GBLUP

SUMMARY BAYESIAN VERSION OF RR AND GBLUP These three models are equivalent. If genotypes are centered then G matrix can be calculated using

HOW TO MAP FROM TO The genetic values are then is an estimate of The is the allele frequency of the j th marker Then the contribution of each marker to the genetic variance is and the breeding value at each marker is