Aaron Lorenz Department of Agronomy and Horticulture

Slides:



Advertisements
Similar presentations
Regression analysis Relating two data matrices/tables to each other Purpose: prediction and interpretation Y-data X-data.
Advertisements

Association Mapping as a Breeding Strategy
Qualitative and Quantitative traits
Genomic Tools for Oat Improvement
R OBERTO B ATTITI, M AURO B RUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Feb 2014.
Antibody Response During a PRRS Outbreak can be Predicted Using High-Density SNP Genotypes Nick V.L. Serão 1 *, R.A. Kemp 2, B.E. Mote 3, J.C.S. Harding.
Added value of whole-genome sequence data to genomic predictions in dairy cattle Rianne van Binsbergen 1,2, Mario Calus 1, Chris Schrooten 3, Fred van.
1 Simple Linear Regression and Correlation The Model Estimating the Coefficients EXAMPLE 1: USED CAR SALES Assessing the model –T-tests –R-square.
Perspectives from Human Studies and Low Density Chip Jeffrey R. O’Connell University of Maryland School of Medicine October 28, 2008.
Genome-wide association mapping Introduction to theory and methodology
Chapter 2: Lasso for linear models
1 United States Department of Agriculture-Agriculture Research Service (USDA-ARS), U.S. Arid-Land Agricultural Research Center, North Cardon Lane,
Regression Usman Roshan CS 675 Machine Learning. Regression Same problem as classification except that the target variable y i is continuous. Popular.
Computer Simulation in Plant Breeding Introduction Outline Application I: Breeding Method Application II: Gene Mapping Application III: Genetic Modeling.
REGRESSION What is Regression? What is the Regression Equation? What is the Least-Squares Solution? How is Regression Based on Correlation? What are the.
Quantitative Genetics
Linear Regression Models Based on Chapter 3 of Hastie, Tibshirani and Friedman Slides by David Madigan.
Dimension Reduction and Feature Selection Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University.
REGRESSION Predict future scores on Y based on measured scores on X Predictions are based on a correlation from a sample where both X and Y were measured.
Extension of Bayesian procedures to integrate and to blend multiple external information into genetic evaluations J. Vandenplas 1,2, N. Gengler 1 1 University.
Mark E. Sorrells & Elliot Heffner Department of Plant Breeding & Genetics Association Breeding Strategies for Crop Improvement.
Mating Programs Including Genomic Relationships and Dominance Effects
Mating Programs Including Genomic Relationships and Dominance Effects Chuanyu Sun 1, Paul M. VanRaden 2, Jeff R. O'Connell 3 1 National Association of.
PATTERN RECOGNITION AND MACHINE LEARNING
Module 7: Estimating Genetic Variances – Why estimate genetic variances? – Single factor mating designs PBG 650 Advanced Plant Breeding.
The appropriateness of three different association analysis models, the least square solution to the fixed effects General Linear Model (GLM-Q, Searle.
ConceptS and Connections
SolGS: A Bioinformatics Solution for Genomic Selection Isaak Y Tecle, Naama Menda, Jeremy Edwards, Lukas Mueller.
Chapter 12 Multiple Linear Regression Doing it with more variables! More is better. Chapter 12A.
2007 Paul VanRaden and Mel Tooker Animal Improvement Programs Laboratory, USDA Agricultural Research Service, Beltsville, MD, USA
Inference for Regression Simple Linear Regression IPS Chapter 10.1 © 2009 W.H. Freeman and Company.
Jeff O’ConnellInterbull annual meeting, Orlando, FL, July 2015 (1) J. R. O’Connell 1 and P. M. VanRaden 2 1 University of Maryland School of Medicine,
Regression-Based Linkage Analysis of General Pedigrees Pak Sham, Shaun Purcell, Stacey Cherny, Gonçalo Abecasis.
讲 座 提 纲讲 座 提 纲 1 什么是分子育种 2 历史回顾 3 全基因组策略 4 基因型鉴定 5 表现型鉴定 6 环境型鉴定 (etyping) 7 标记 - 性状关联分析 8 标记辅助选择 9 决策支撑系统 10 展望.
Council on Dairy Cattle Breeding April 27, 2010 Interpretation of genomic breeding values from a unified, one-step national evaluation Research project.
Regression Usman Roshan CS 675 Machine Learning. Regression Same problem as classification except that the target variable y i is continuous. Popular.
Machine Learning 5. Parametric Methods.
4 basic analytical tasks in statistics: 1)Comparing scores across groups  look for differences in means 2)Cross-tabulating categoric variables  look.
Multibreed Genomic Evaluation Using Purebred Dairy Cattle K. M. Olson* 1 and P. M. VanRaden 2 1 Department of Dairy Science Virginia Polytechnic and State.
VISG – LARGE DATASETS Literature Review Introduction – Genome Wide Selection Aka Genomic Selection Set of Markers 10,000’s - enough to capture most genetic.
Powerful Regression-based Quantitative Trait Linkage Analysis of General Pedigrees Pak Sham, Shaun Purcell, Stacey Cherny, Gonçalo Abecasis.
Computacion Inteligente Least-Square Methods for System Identification.
EAAP Meeting, Stavanger Estimation of genomic breeding values for traits with high and low heritability in Brown Swiss bulls M. Kramer 1, F. Biscarini.
I. Statistical Methods for Genome-Enabled Prediction of Complex Traits OUTLINE THE CHALLENGES OF PREDICTING COMPLEX TRAITS ORDINARY LEAST SQUARES (OLS)
Genomic Selection in Multi- Breed Dairy Cattle Populations John B. Cole Animal Genomics and Improvement Laboratory Agricultural Research Service, USDA.
Regression Models for Linkage: Merlin Regress
Regression Usman Roshan.
Probability Theory and Parameter Estimation I
Complex Genomic Trait Predictions to Accelerate Plant Breeding Programs Kelci Miclaus1, Luciano da Costa e Silva1 , and Lauro Jose Moreira Guimaraes2.
Ch3: Model Building through Regression
Regression.
Workshop on Methods for Genomic Selection (El Batán, July 15, 2013) Paulino Pérez & Gustavo de los Campos.
New Methods for Analyzing Complex Traits
Washington State University
Roberto Battiti, Mauro Brunato
Genome-wide Association Studies
Lecture 23: Cross validation
Correlation for a pair of relatives
What is Regression Analysis?
OVERVIEW OF LINEAR MODELS
Linear Model Selection and regularization
What are BLUP? and why they are useful?
Washington State University
Biointelligence Laboratory, Seoul National University
OVERVIEW OF LINEAR MODELS
Regression Usman Roshan.
DEVELOPMENT OF A GENETIC INDICATOR OF BIODIVERSITY FOR FARM ANIMALS
Lecture 23: Cross validation
The Basic Genetic Model
Presentation transcript:

Aaron Lorenz Department of Agronomy and Horticulture Genomic selection Aaron Lorenz Department of Agronomy and Horticulture

Role of markers in crop improvement Varies by objective, germplasm, trait genetic architecture. Bernardo, 2008

Genomic selection Training Population Calibration Set DNA marker data Phenotypic data Model training Training Population Calibration Set Predict and select No QTL mapping No testing for significant markers I’m not going to give much background on genomic prediction because I think it has become well known enough as a method. I’ll only introduce some terminology. The basic idea is you genotype a large population of indivduals, phenotype them well (this is called a training population or sometimes calibration set), combine all marker and phenotypic data into a single statistical model (termed model training), and use the model to predict the genetic value of individuals that had been genotyped but not phenotyped. You get your predictions and you select on them just as you would on phenotypes. Note that there is no QTL mapping, no delcaration of signficant markers. We’re using all markers…. Selection candidates

A genome-wide approach typically provides better predictions Genomic rA One very nice thing about taking a genome-wide approach is that it typically works better than a QTL mapping/MAS approach. This study from Rex Bernardo’s lab looked at 36 instances of population-trait combinations, and in nearly every case GS does substantially better than MAS. This figures on the right is from a simulation study of my own showing the advantage in prediction accuracy of GS compared to MAS. MAS GS MAS GS MAS rA Lorenz (2013) Lorenzana and Bernardo (2009)

Whittaker et al. (2000) When doing MAS, cannot include all the markers, so must select subset of markers to fit. No entirely satisfactory way of doing this exists. Objective is to evaluate ridge regression. Superior to subset selection when objective is to make predictions.

Whittaker et al. (2000) Find subset of markers Q. Interested in Cannot include all markers in Q Increases variance of β If number of markers really large, not enough d.f.

Whittaker et al. (2000) Ridge regression – include all variables, but replace normal least-squares estimators with Normal estimates shrunk toward 0 Degree of shrinkage determined by lambda Choose lambda to minimize model error Addition of λI term reduces collinearity and prevents the matrix XTX from becoming singular.

Whittaker et al. (2000)

MHG 2001 Objective: “Compare statistical methods for their accuracy in predicting total breeding value of individuals in a situation where a limited number of recorded individuals are genotyped for many markers.” - Computer simulation - 2000 individuals - Need to estimate 50,000 haplotype effects The whole story starts with a simulation study in 2001. The authors set out to see how accurately the BVs could be predicted assuming very dense marker data was available, much densier marker data than was available at the time, but they were looking forward. Authors noted arbitrariness of setting marker effect to full value or zero simply because it surpassed some predetermined, and arbitrary, threshold.

MHG 2001 r(GEBV:True BV) Genomic selection models

Genomic selection models LARGE p !! Shrinkage models RR-BLUP, G-BLUP Dimension reduction methods Partial least squares Principal component regression Variable selection models BayesB, BayesCπ, BayesDπ Kernel and machine learning methods Support vector machine regression Training population Line Yield Mrk 1 Mrk 2 … Mrk p Line 1 76 1 Line 2 56 Line 3 45 Line 4 67 Line n 22 … …and in the day of high-density markers, this means we probably have many more markers than observations, resulting in the well-known large p, small n problem. This means ordintary least squares cannot be used for estimation, but a variety of other more sophisticated models can be used. The most population is RR-BLUP, where markers are treated as random effects to be sampled from a common distribution. That’s all I’ll say about that. smaller n !!

Baseline model --More predictors than variables. --Solution: fit predictors as random effects. -- Constrain possible effects. -- What distribution is β being sampled from?

Priors and penalizations (examples)

Double exponential distribution Normal distribution Represent two different assumptions about the underlying distribution of QTL effects

de Los Campos et al. (2013) Priors

Marker effect estimates Large-effect QTL simulated Many small-effect QTL simulated BayesCπ I didn’t think that example was illustrative enough, so I simulated some data. Here, we have a large effect QTL present. You can see RR-BLUP shrinks this thing way down, whereas BayesCpi, the variable selection method, allows it to have an effect probably closer to reality. RR-BLUP

Comparing marker effects between models

G-BLUP Similar to tradition BLUP with pedigrees Calculate genomic relationship matrix Use genomic relationships in mixed-linear model to predict breeding value of relatives

Training Pop. Training Pop. Selection candidates Selection candidates Relationships between TP and selection candidates leveraged for prediction

Equivalency between RR-BLUP and G-BLUP From MVN distribution properties: Only valid with the normal prior!

Predicting prediction accuracy Daetwyler et al. (2008) Lian et al. (2014) N = training pop size h2 = trait heritability Me = effective number of loci r2 = LD between marker and QTL (see Lian ref)

Factors affecting prediction accuracy Training population size Trait heritability Influence of G x E, precision of measurements Marker density Effective population size of breeding population i.e., genetic diversity of breeding population Genetic relationship between training population and selection candidates Statistical model

Effect of relationships: Predicting across populations 1180 polymorphic markers Validation sets Subpop 2 PC 2 Subpop 1 Training sets Here is a typical example. Here we have a PCA plot from marker data of barley lines from three different breeding programs. PC 1 BuschAg University of MN NDSU 6-row

Effect of relationships: Presence of relatives in TP Pred accuracy Mean relationship of top ten relatives Clark et al. (2012)

Models typically similar in accuracy Models also equivalent in: Bernardo and Yu (2007) [Maize] Lorenzana and Bernardo (2009) [Several plant species] Van Raden et al. (2009) [Holstein] Hayes (2009) [Holstein] RR-BLUP BayesCpi Bayesian LASSO Accuracy Despite the different assumptions in genetic architecture made by the different models, and the fact the QTL effects are not of equal size and do have different genetic architectures, including epistasis, the simplest model, RR-BLUP, assuming all QTL effects of the same variance often do just as well, especially in empirical studies, as the more “realistic models”. The reason for this is probably that LD within domesticated species is extensive, and therefore several markers absorb the effect of large-effect QTL, making it seem that many markers control a trait, as RR-BLUP assumes.

Why? Extensive LD in plant and animal breeding programs Perfect situation for G-BLUP Long stretches of genome that are identical by descent means relationships calculated with markers are good indicators of relationships at causal polymorphisms. Extensive LD also means it’s hard for variable selection models to zero in on markers in tight LD with casual polymorphisms. Expect variable selection models will be superior when Individuals are unrelated Very large TP (millions?) Very high marker density so that markers in LD with causal polymorphisms

Resources and packages rrBLUP package cran.r-project.org/web/packages/rrBLUP/rrBLUP.pdf Endelman, J.B. 2011. Ridge regression and other kernels for genomic selection with R package rrBLUP. Plant Genome 4:250-255. Endelman, J.B., and J-L. Jannink. 2012. Shrinkage estimation of the realized relationship matrix. G3:2:1045 BLR (Bayesian Linear Regression) package http://bglr.r-forge.r-project.org/ Perez et al. 2010. Genomic-enabled prediction based on molecular markers and pedigree using the Bayesian linear regression package in R. Plant Genome 3:106-116.

References Bernardo, R. 2008. Molecular markers and selection for complex traits in plants: Learning from the last 20 years. Crop Sci 48:1649-1664. Clark, S.A., J.M. Hickey, H.D. Daetwyler and van der Werf, Julius HJ. 2012. The importance of information on relatives for the prediction of genomic breeding values and the implications for the makeup of reference data sets in livestock breeding schemes. Genet. Sel. Evol. 44:. Daetwyler, H.D., B. Villanueva and J.A. Woolliams. 2008. Accuracy of predicting the genetic risk of disease using a genome-wide approach. Plos One 3:. de los Campos, G., J.M. Hickey, R. Pong-Wong, H.D. Daetwyler and M.P.L. Calus. 2013. Whole-genome regression and prediction methods applied to plant and animal breeding. Genetics 193:327-+. Lian, L., A. Jacobson, S. Zhong and R. Bernardo. 2014. Genomewide prediction accuracy within 969 maize biparental populations. Crop Sci. Lorenz, A.J. 2013. Resource allocation for maximizing prediction accuracy and genetic gain of genomic selection in plant breeding: A simulation experiment. G3-Genes Genomes Genetics 3:481-491. Lorenzana, R.E. and R. Bernardo. 2009. Accuracy of genotypic value predictions for marker-based selection in biparental plant populations. Theor. Appl. Genet. 120:151-161. Meuwissen, T.H., B.J. Hayes and M.E. Goddard. 2001. Prediction of total genetic value using genome-wide dense marker maps. Genetics 157:1819-1829. Whittaker, J.C., R. Thompson and M.C. Denham. 2000. Marker-assisted selection using ridge regression. Genet. Res. 75:249-252.