Aaron Lorenz Department of Agronomy and Horticulture Genomic selection Aaron Lorenz Department of Agronomy and Horticulture
Role of markers in crop improvement Varies by objective, germplasm, trait genetic architecture. Bernardo, 2008
Genomic selection Training Population Calibration Set DNA marker data Phenotypic data Model training Training Population Calibration Set Predict and select No QTL mapping No testing for significant markers I’m not going to give much background on genomic prediction because I think it has become well known enough as a method. I’ll only introduce some terminology. The basic idea is you genotype a large population of indivduals, phenotype them well (this is called a training population or sometimes calibration set), combine all marker and phenotypic data into a single statistical model (termed model training), and use the model to predict the genetic value of individuals that had been genotyped but not phenotyped. You get your predictions and you select on them just as you would on phenotypes. Note that there is no QTL mapping, no delcaration of signficant markers. We’re using all markers…. Selection candidates
A genome-wide approach typically provides better predictions Genomic rA One very nice thing about taking a genome-wide approach is that it typically works better than a QTL mapping/MAS approach. This study from Rex Bernardo’s lab looked at 36 instances of population-trait combinations, and in nearly every case GS does substantially better than MAS. This figures on the right is from a simulation study of my own showing the advantage in prediction accuracy of GS compared to MAS. MAS GS MAS GS MAS rA Lorenz (2013) Lorenzana and Bernardo (2009)
Whittaker et al. (2000) When doing MAS, cannot include all the markers, so must select subset of markers to fit. No entirely satisfactory way of doing this exists. Objective is to evaluate ridge regression. Superior to subset selection when objective is to make predictions.
Whittaker et al. (2000) Find subset of markers Q. Interested in Cannot include all markers in Q Increases variance of β If number of markers really large, not enough d.f.
Whittaker et al. (2000) Ridge regression – include all variables, but replace normal least-squares estimators with Normal estimates shrunk toward 0 Degree of shrinkage determined by lambda Choose lambda to minimize model error Addition of λI term reduces collinearity and prevents the matrix XTX from becoming singular.
Whittaker et al. (2000)
MHG 2001 Objective: “Compare statistical methods for their accuracy in predicting total breeding value of individuals in a situation where a limited number of recorded individuals are genotyped for many markers.” - Computer simulation - 2000 individuals - Need to estimate 50,000 haplotype effects The whole story starts with a simulation study in 2001. The authors set out to see how accurately the BVs could be predicted assuming very dense marker data was available, much densier marker data than was available at the time, but they were looking forward. Authors noted arbitrariness of setting marker effect to full value or zero simply because it surpassed some predetermined, and arbitrary, threshold.
MHG 2001 r(GEBV:True BV) Genomic selection models
Genomic selection models LARGE p !! Shrinkage models RR-BLUP, G-BLUP Dimension reduction methods Partial least squares Principal component regression Variable selection models BayesB, BayesCπ, BayesDπ Kernel and machine learning methods Support vector machine regression Training population Line Yield Mrk 1 Mrk 2 … Mrk p Line 1 76 1 Line 2 56 Line 3 45 Line 4 67 Line n 22 … …and in the day of high-density markers, this means we probably have many more markers than observations, resulting in the well-known large p, small n problem. This means ordintary least squares cannot be used for estimation, but a variety of other more sophisticated models can be used. The most population is RR-BLUP, where markers are treated as random effects to be sampled from a common distribution. That’s all I’ll say about that. smaller n !!
Baseline model --More predictors than variables. --Solution: fit predictors as random effects. -- Constrain possible effects. -- What distribution is β being sampled from?
Priors and penalizations (examples)
Double exponential distribution Normal distribution Represent two different assumptions about the underlying distribution of QTL effects
de Los Campos et al. (2013) Priors
Marker effect estimates Large-effect QTL simulated Many small-effect QTL simulated BayesCπ I didn’t think that example was illustrative enough, so I simulated some data. Here, we have a large effect QTL present. You can see RR-BLUP shrinks this thing way down, whereas BayesCpi, the variable selection method, allows it to have an effect probably closer to reality. RR-BLUP
Comparing marker effects between models
G-BLUP Similar to tradition BLUP with pedigrees Calculate genomic relationship matrix Use genomic relationships in mixed-linear model to predict breeding value of relatives
Training Pop. Training Pop. Selection candidates Selection candidates Relationships between TP and selection candidates leveraged for prediction
Equivalency between RR-BLUP and G-BLUP From MVN distribution properties: Only valid with the normal prior!
Predicting prediction accuracy Daetwyler et al. (2008) Lian et al. (2014) N = training pop size h2 = trait heritability Me = effective number of loci r2 = LD between marker and QTL (see Lian ref)
Factors affecting prediction accuracy Training population size Trait heritability Influence of G x E, precision of measurements Marker density Effective population size of breeding population i.e., genetic diversity of breeding population Genetic relationship between training population and selection candidates Statistical model
Effect of relationships: Predicting across populations 1180 polymorphic markers Validation sets Subpop 2 PC 2 Subpop 1 Training sets Here is a typical example. Here we have a PCA plot from marker data of barley lines from three different breeding programs. PC 1 BuschAg University of MN NDSU 6-row
Effect of relationships: Presence of relatives in TP Pred accuracy Mean relationship of top ten relatives Clark et al. (2012)
Models typically similar in accuracy Models also equivalent in: Bernardo and Yu (2007) [Maize] Lorenzana and Bernardo (2009) [Several plant species] Van Raden et al. (2009) [Holstein] Hayes (2009) [Holstein] RR-BLUP BayesCpi Bayesian LASSO Accuracy Despite the different assumptions in genetic architecture made by the different models, and the fact the QTL effects are not of equal size and do have different genetic architectures, including epistasis, the simplest model, RR-BLUP, assuming all QTL effects of the same variance often do just as well, especially in empirical studies, as the more “realistic models”. The reason for this is probably that LD within domesticated species is extensive, and therefore several markers absorb the effect of large-effect QTL, making it seem that many markers control a trait, as RR-BLUP assumes.
Why? Extensive LD in plant and animal breeding programs Perfect situation for G-BLUP Long stretches of genome that are identical by descent means relationships calculated with markers are good indicators of relationships at causal polymorphisms. Extensive LD also means it’s hard for variable selection models to zero in on markers in tight LD with casual polymorphisms. Expect variable selection models will be superior when Individuals are unrelated Very large TP (millions?) Very high marker density so that markers in LD with causal polymorphisms
Resources and packages rrBLUP package cran.r-project.org/web/packages/rrBLUP/rrBLUP.pdf Endelman, J.B. 2011. Ridge regression and other kernels for genomic selection with R package rrBLUP. Plant Genome 4:250-255. Endelman, J.B., and J-L. Jannink. 2012. Shrinkage estimation of the realized relationship matrix. G3:2:1045 BLR (Bayesian Linear Regression) package http://bglr.r-forge.r-project.org/ Perez et al. 2010. Genomic-enabled prediction based on molecular markers and pedigree using the Bayesian linear regression package in R. Plant Genome 3:106-116.
References Bernardo, R. 2008. Molecular markers and selection for complex traits in plants: Learning from the last 20 years. Crop Sci 48:1649-1664. Clark, S.A., J.M. Hickey, H.D. Daetwyler and van der Werf, Julius HJ. 2012. The importance of information on relatives for the prediction of genomic breeding values and the implications for the makeup of reference data sets in livestock breeding schemes. Genet. Sel. Evol. 44:. Daetwyler, H.D., B. Villanueva and J.A. Woolliams. 2008. Accuracy of predicting the genetic risk of disease using a genome-wide approach. Plos One 3:. de los Campos, G., J.M. Hickey, R. Pong-Wong, H.D. Daetwyler and M.P.L. Calus. 2013. Whole-genome regression and prediction methods applied to plant and animal breeding. Genetics 193:327-+. Lian, L., A. Jacobson, S. Zhong and R. Bernardo. 2014. Genomewide prediction accuracy within 969 maize biparental populations. Crop Sci. Lorenz, A.J. 2013. Resource allocation for maximizing prediction accuracy and genetic gain of genomic selection in plant breeding: A simulation experiment. G3-Genes Genomes Genetics 3:481-491. Lorenzana, R.E. and R. Bernardo. 2009. Accuracy of genotypic value predictions for marker-based selection in biparental plant populations. Theor. Appl. Genet. 120:151-161. Meuwissen, T.H., B.J. Hayes and M.E. Goddard. 2001. Prediction of total genetic value using genome-wide dense marker maps. Genetics 157:1819-1829. Whittaker, J.C., R. Thompson and M.C. Denham. 2000. Marker-assisted selection using ridge regression. Genet. Res. 75:249-252.