1 QTL mapping in mice, completed. Lecture 12, Statistics 246 March 2, 2004.

Slides:



Advertisements
Similar presentations
Properties of Least Squares Regression Coefficients
Advertisements

Computational Statistics. Basic ideas  Predict values that are hard to measure irl, by using co-variables (other properties from the same measurement.
Hypothesis Testing Steps in Hypothesis Testing:
Irwin/McGraw-Hill © Andrew F. Siegel, 1997 and l Chapter 12 l Multiple Regression: Predicting One Factor from Several Others.
Bayesian inference “Very much lies in the posterior distribution” Bayesian definition of sufficiency: A statistic T (x 1, …, x n ) is sufficient for 
Copyright © 2009 Pearson Education, Inc. Chapter 29 Multiple Regression.
1 12. Principles of Parameter Estimation The purpose of this lecture is to illustrate the usefulness of the various concepts introduced and studied in.
6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.
Model Assessment, Selection and Averaging
Model assessment and cross-validation - overview
1 QTL mapping in mice Lecture 10, Statistics 246 February 24, 2004.
Visual Recognition Tutorial
Resampling techniques Why resampling? Jacknife Cross-validation Bootstrap Examples of application of bootstrap.
Cal State Northridge  320 Ainsworth Sampling Distributions and Hypothesis Testing.
Linear Regression Models Based on Chapter 3 of Hastie, Tibshirani and Friedman Slides by David Madigan.
Resampling techniques
Maximum likelihood (ML) and likelihood ratio (LR) test
Evaluating Hypotheses
1 QTL mapping in mice, cont. Lecture 11, Statistics 246 February 26, 2004.
Lecture 24: Thurs. Dec. 4 Extra sum of squares F-tests (10.3) R-squared statistic (10.4.1) Residual plots (11.2) Influential observations (11.3,
End of Chapter 8 Neil Weisenfeld March 28, 2005.
Lecture 11 Multivariate Regression A Case Study. Other topics: Multicollinearity  Assuming that all the regression assumptions hold how good are our.
Topic 3: Regression.
Independent Sample T-test Often used with experimental designs N subjects are randomly assigned to two groups (Control * Treatment). After treatment, the.
Using ranking and DCE data to value health states on the QALY scale using conventional and Bayesian methods Theresa Cain.
Inferences About Process Quality
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Evaluating Hypotheses.
Review for Exam 2 Some important themes from Chapters 6-9 Chap. 6. Significance Tests Chap. 7: Comparing Two Groups Chap. 8: Contingency Tables (Categorical.
Ensemble Learning (2), Tree and Forest
Psy B07 Chapter 1Slide 1 ANALYSIS OF VARIANCE. Psy B07 Chapter 1Slide 2 t-test refresher  In chapter 7 we talked about analyses that could be conducted.
Review of Statistical Inference Prepared by Vera Tabakova, East Carolina University ECON 4550 Econometrics Memorial University of Newfoundland.
沈致远. Test error(generalization error): the expected prediction error over an independent test sample Training error: the average loss over the training.
The horseshoe estimator for sparse signals CARLOS M. CARVALHO NICHOLAS G. POLSON JAMES G. SCOTT Biometrika (2010) Presented by Eric Wang 10/14/2010.
Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.
Model Inference and Averaging
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.
Bioinformatics Expression profiling and functional genomics Part II: Differential expression Ad 27/11/2006.
PCB 3043L - General Ecology Data Analysis. OUTLINE Organizing an ecological study Basic sampling terminology Statistical analysis of data –Why use statistics?
Maximum Likelihood - "Frequentist" inference x 1,x 2,....,x n ~ iid N( ,  2 ) Joint pdf for the whole random sample Maximum likelihood estimates.
Lecture 4: Statistics Review II Date: 9/5/02  Hypothesis tests: power  Estimation: likelihood, moment estimation, least square  Statistical properties.
Reserve Variability – Session II: Who Is Doing What? Mark R. Shapland, FCAS, ASA, MAAA Casualty Actuarial Society Spring Meeting San Juan, Puerto Rico.
CROSS-VALIDATION AND MODEL SELECTION Many Slides are from: Dr. Thomas Jensen -Expedia.com and Prof. Olga Veksler - CS Learning and Computer Vision.
Lecture 3: Statistics Review I Date: 9/3/02  Distributions  Likelihood  Hypothesis tests.
INTRODUCTION TO Machine Learning 3rd Edition
Chap 6 Further Inference in the Multiple Regression Model
Review of Statistical Inference Prepared by Vera Tabakova, East Carolina University ECON 4550 Econometrics Memorial University of Newfoundland.
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 11: Models Marshall University Genomics Core Facility.
1 Introduction to Statistics − Day 4 Glen Cowan Lecture 1 Probability Random variables, probability densities, etc. Lecture 2 Brief catalogue of probability.
Machine Learning 5. Parametric Methods.
Tutorial I: Missing Value Analysis
Lecture 1: Basic Statistical Tools. A random variable (RV) = outcome (realization) not a set value, but rather drawn from some probability distribution.
Lecture 22: Quantitative Traits II
Lecture 23: Quantitative Traits III Date: 11/12/02  Single locus backcross regression  Single locus backcross likelihood  F2 – regression, likelihood,
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 6 –Multiple hypothesis testing Marshall University Genomics.
Why you should know about experimental crosses. To save you from embarrassment.
Review of Statistical Inference Prepared by Vera Tabakova, East Carolina University.
Hypothesis Testing. Statistical Inference – dealing with parameter and model uncertainty  Confidence Intervals (credible intervals)  Hypothesis Tests.
Computacion Inteligente Least-Square Methods for System Identification.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
LECTURE 11: LINEAR MODEL SELECTION PT. 1 March SDS 293 Machine Learning.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
Identifying QTLs in experimental crosses
Further Inference in the Multiple Regression Model
Gene mapping in mice Karl W Broman Department of Biostatistics
Linear Model Selection and regularization
Cross-validation for the selection of statistical models
Quantitative Trait Locus (QTL) Mapping
Presentation transcript:

1 QTL mapping in mice, completed. Lecture 12, Statistics 246 March 2, 2004

2 REVIEW n backcross mice; M markers in all, with at most a handful expected to be near QTL x ij = genotype (0/1) of mouse i at marker j y i = phenotype (trait value) of mouse i Y i =  + ∑ j=1 M  j x ij +  j Which  j  0 ?  Variable selection in linear models (regression)

3 Variable selection in linear models: a few generalities There is a huge literature on the selection of variables in linear models. Why? First, if we leave out variables which should be in our model, we risk biasing the parameter estimates of those we include, and our “error” has systematic components, reducing our ability to identify those which should be there. Second, when we includes variables in a model which do not need to be there, we reduce the precision of the estimates of the ones which are there. We need to include all the relevant (perhaps important is a better word), and no irrelevant variables in our model. This is an instance of classic statistical trade-off of bias and variance: too few variables, higher bias; too many, higher variance. The literature contains a host of methods of addressing this problem, known as model or variable selection methods. Some are general, i.e. not just for linear models, e.g. the Akaike information criterion (AIC), the Bayes information criterion (BIC), or cross-validation (CV). Others are specifically designed for linear models, e.g. ridge regression, Mallows’ C p, and the lasso. We don’t have the time to offer a thorough review of this topic, but we’ll comment on and compare a few approaches.

4 Special features of our problem We have a known joint distribution of the covariates (marker genotypes), and these are (approximately) Markovian We may have lots of missing covariate information, but it can be dealt with….that is, we can effectively use all our data Our aim is to identify the key players (markers), not to minimize prediction error or find the “true model”. Conclusion: Our problem is not a generic linear model selection problem, and this offers the hope that we can do better than we might in general. END REVIEW

5 Finding QTL as model selection (the loci are part of the model) Select class of models Additive models Additive plus pairwise interactions Regression trees Compare models (  ) BIC  (  ) = logRSS(  )+  (  log n/n) Sequential permutation tests Search model space Forward selection (FS) Backward elimination (BE) FS followed by BE MCMC Assess performance Maximize no QTL found; control false positive rate

6 Model class: additive (linear) models n backcross mice; M markers x ij = genotype (0/1) of mouse i at marker j y i = phenotype (trait value) of mouse i Y i =  + ∑ j=1 M  j x ij +  j The model is characterized by its variable subset  = {j :  j  0}. We could also include some pairwise interactions of variables (loci) in the model.

7 Another model class: regression trees Tree model: here  would include splits on genotypes at specified loci (q 1 etc) and average QT values at tips. Regression trees are good for capturing nonlinearities.

8 Model comparison criteria Within a model class, we want to compare models. One obvious criterion is the residual sum of squares, RSS(  ). Exercise. Prove that for a normal linear model, with unknown regression coefficients and error variance, the maximized log-likelihood l max is essentially -(n/2)logRSS(  ). However, we know that including more variables reduces the RSS, without necessarily improving the quality of our model: we need a penalty for model size or model complexity. There are many such, and we choose to minimize criteria of the form  (  ) = logRSS(  ) + |  |n -1 D(n) for suitable D(n). The case D(n) = n is called AIC, while D(n) = logn is BIC, and D(n) = loglogn is due to Hannan and Quinn. Under certain assumptions, the last two are consistent estimators of a true model, while AIC tends to overfit under the same assumptions. It has different optimality properties. For finding QTL, we prefer BIC, but use D(n) =  logn, which we call BIC , with  > 1.

9 Model comparison: a Bayesian approach We could place prior probabilities on each of the possible models, as well as the model parameters, and use Bayes’ theorem to calculate the posterior probability of the models given the data. It would make good sense (for a non-Bayesian!) to choose the model with the maximum posterior probability. Assume y | ,  ,  2 ~ N(X   ,  2 ), and use the priors   ~ N(0, c  2 (X  ‘X  ) -1 ), and p(  2 |  )  1/  2, while for the model  itself, assume p(  )  (c/d) |  |/2. Now let c  (a diffuse improper prior), and integrate out   and  2. The resulting posterior distribution for  satisfies -(2/n) log p(  |y) = logRSS(  ) + |  |n -1 logd. This is just our general criterion with D(n) = d, which provides further support for it. However, the only real justification for it is its performance. Exercise. Derive the expression involving p(  |y) just given.

10 Sequential permutation tests Let’s begin by seeing how permutation tests are applied in this context, an idea first put forward by Churchill & Doerge. Mouse i has phenotype y i and marker data m i. To test the null hypothesis of no association between phenotype and marker data, we can simply de-couple the two: create many sets of pseudo-data by choosing a random permutation  of the labels {1,.., n}, and forming pairs (y i, m  (i) ), i = 1,…n. To assign p-values to LOD scores, say, we would compare the value for the observed data to the set of values obtained for (say) 1,000 sets of pseudo-data obtained in this way. As always, the question of significance levels arises, as does the issue of multiple testing. At least for the genome-wide calculation of LOD scores, permutation testing gives a neat solution: simply repeat on the pseudo- data whatever multiple testing one does with the actual data (e.g. taking the genome wide maximum LOD). For model selection, permutation tests are carried out to assess the significance associate with adding or dropping a variable. Fixed significance levels are typically used, e.g. 0.05, with no corrections for multiple testing.

11 Some methods for model search Forward selection (FS): selects the variable not in the model which provides the greatest reduction in the penalty function (with a rule for making no change). Rejecting at a fixed significance level is usually adopted. Called “greedy” by computer scientists. Backwards elimination (BE): eliminates the variable whose removal makes the smallest increase in the penalty function (with a rule for making no change). Again, the use of a fixed significance level is standard. FS followed by BE: deliberately overfit, then prune back. We need suitable parameters to implement this approach. Markov chain Monte Carlo (MCMC): provides a way of moving around the class of all models, selecting new or eliminating existing variables according to a stochastic rule. Eventually, we can choose as the preferred model, the one which minimizes the criterion function among all those models visited. A Bayesian would compute empirical posterior probabilities using the series of models visited.

12 We like BIC  : why? For a fixed number of markers, letting n , BIC  is consistent There exists a prior (on models + coefficients) for which BIC  is the -log posterior BIC  is essentially equivalent to use of a threshold on the conditional LOD score It performs well

13 Choice of  Larger  : include more loci, higher false positive rate Smaller  : include fewer loci; lower false positive rate Let L = 95% genome-wide LOD threshold (comparing single QTL models to the null model) Choose  = 2 L/log 10 n. With this choice of , in the absence of QTL, we’ll include at least one extraneous locus, 5% of the time.

14 Simulations comparing strategies Backcross with n = 250 No crossover interference 9 chromosomes, each 100cM Markers at 10cM spacing; complete genotype data 7 QTL, equal effects One pair in coupling (same sign) -One pair in repulsion (opp. sign) -Three unlinked QTL Heritability 50% (env var 1.0) 2,000 simulated replicates

15 Methods compared ANOVA at marker loci Composite interval mapping (CIM) Forward selection with permutation tests Forward selection with BIC  Backward elimination with BIC  FS followed by BE with BIC  MCMC with BIC 

16 Methods, some detail ANOVA at marker loci: essentially 2 sample t-tests, with marker genotype defined groups, used simulated genome-wide threshold Composite interval mapping (CIM): described on next slide Forward selection with permutation tests (1,000 perms,  =0.05) Forward selection to 25 markers, with BIC  Backward elimination, starting from the full model, with BIC  FS to 25 markers, followed by BE, with BIC  MCMC for 1,000 moves with BIC  More details and results for other sample sizes can be found in Broman & Speed, JRSSB, 2002.

17 Composite interval mapping Here an initial subset S of markers is chosen to control for background variation. Then, as in conventional interval mapping, a genome-wide scan is carried out, at each locus testing the hypothesis that a QTL at that locus need not be added to the existing model, against the alternative that it should be added. Markers in S in close proximity to the current locus (e.g. 10cM) are excluded. As in standard interval mapping, LOD scores are computed, and the result plotted. Regions where the LOD exceeds a suitable genome-wide threshold C are thought to contain QTL. Here we simulated to get C. A key question is: how many markers should there be in the initial subset S, and how should they be chosen. There is no sharp answer to this question. We used 3, 5, 7, 9, and 11.

18

19

20

21

22

23

24 Conclusions from the simulation The BIC  methods performed pretty well, best with MCMC, though only slightly better than FS. FS with permutation tests had fewer extraneous than FS with BIC. The CIM ones didn’t do so well, except when the # of variables was 7, though their rate of extraneous loci was quite low. ANOVA was quite poor, in relative terms.

25 Selective genotyping The idea here is to genotype only animlas in the high and low extremes of the phenotypic distribution, but including the phenotypes (without genotypes) of the remainder in the analysis. This involves some loss of power, though not much if the % not genotyped isn’t large. Two issues: why would we include phenotypes of animals not genotyped, and how would we do that. To make the point, we compare: Strategy A: Analyse all genotypes and all phenotypes Strategy B: Analyse all phenotypes, but include genotypes only for individuals in the extremes of the phenotypic distribution Strategy C: Analyse genotypes and phenotypes only for individuals in the extremes of the phenotypic distribution Exercise. Write down the likelihoods in all three cases.

26 Selective genotyping in a BC

27 Some simulation results The slides that follow are for a hypothetical F 2 intercross with 150 individual mice, a set of 150 markers quite similar to those of our NOD B6 cross, comparing 3 analysis strategies: A: analyse both genotypes and phenotypes for all mice B: analyse phenotypes for all mice, but only genotypes for the phenotypically extreme mice, correctly incorporating the other genotypes as missing C: analyse genotypes and phenotypes for phenotypically extreme mice only, and not saying so.

28 Type I error for given genome-wide LOD thresholds Conclusion: strategy C needs a different genome-wide threshold.

29 Power for a given QTL effect  Solid: strategy A; Dashed: strategy B, with 50% genotyped

30 Power for a given proportion not genotyped Solid: strategy A, Dashed: strategy B with  = Proportion not genotyped

31 Summary QTL mapping is a model selection problem Key issue: the comparison of models Large scale simulations are important More refined procedures do not necessarily give improved results BIC  with forward selection followed by backward elimination works quite well Real savings to be made by genotyping only extremes phenotypically

32 Acknowledgements Karl Broman, Johns Hopkins Nusrat Rabbee, UCB