1 Model choice Gil McVean, Department of Statistics Tuesday 17 th February 2007.

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Computational Statistics. Basic ideas  Predict values that are hard to measure irl, by using co-variables (other properties from the same measurement.
© Department of Statistics 2012 STATS 330 Lecture 32: Slide 1 Stats 330: Lecture 32.
Objectives 10.1 Simple linear regression
Log-linear and logistic models Generalised linear model ANOVA revisited Log-linear model: Poisson distribution logistic model: Binomial distribution Deviances.
Uncertainty and confidence intervals Statistical estimation methods, Finse Friday , 12.45–14.05 Andreas Lindén.
Chapter 8 – Logistic Regression
6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.
Model generalization Test error Bias, variance and complexity
Logistic Regression Example: Horseshoe Crab Data
Model Assessment and Selection
Model Assessment, Selection and Averaging
Model assessment and cross-validation - overview
Models with Discrete Dependent Variables
Bayesian inference Gil McVean, Department of Statistics Monday 17 th November 2008.
Maximum likelihood estimates What are they and why do we care? Relationship to AIC and other model selection criteria.
Linear Methods for Regression Dept. Computer Science & Engineering, Shanghai Jiao Tong University.
Resampling techniques Why resampling? Jacknife Cross-validation Bootstrap Examples of application of bootstrap.
Log-linear and logistic models Generalised linear model ANOVA revisited Log-linear model: Poisson distribution logistic model: Binomial distribution Deviances.
Log-linear and logistic models
Nemours Biomedical Research Statistics April 23, 2009 Tim Bunnell, Ph.D. & Jobayer Hossain, Ph.D. Nemours Bioinformatics Core Facility.
Chapter 11 Multiple Regression.
Lecture 11 Multivariate Regression A Case Study. Other topics: Multicollinearity  Assuming that all the regression assumptions hold how good are our.
Linear and generalised linear models
Notes on Logistic Regression STAT 4330/8330. Introduction Previously, you learned about odds ratios (OR’s). We now transition and begin discussion of.
Genetic Association and Generalised Linear Models Gil McVean, WTCHG Weds 2 nd November 2011.
Chapter 15: Model Building
Regression Model Building Setting: Possibly a large set of predictor variables (including interactions). Goal: Fit a parsimonious model that explains variation.
Classification and Prediction: Regression Analysis
Logistic Regression with “Grouped” Data Lobster Survival by Size in a Tethering Experiment Source: E.B. Wilkinson, J.H. Grabowski, G.D. Sherwood, P.O.
Logistic regression for binary response variables.
Review of Lecture Two Linear Regression Normal Equation
Logistic Regression and Generalized Linear Models:
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Chapter 13: Inference in Regression
Simple Linear Regression
PATTERN RECOGNITION AND MACHINE LEARNING
Lecture 6 Generalized Linear Models Olivier MISSA, Advanced Research Skills.
Lecture 12 Model Building BMTRY 701 Biostatistical Methods II.
© Department of Statistics 2012 STATS 330 Lecture 26: Slide 1 Stats 330: Lecture 26.
Model Inference and Averaging
Statistical modelling Gil McVean, Department of Statistics Tuesday 24 th Jan 2012.
Lecture 3: Inference in Simple Linear Regression BMTRY 701 Biostatistical Methods II.
Correlation and Regression SCATTER DIAGRAM The simplest method to assess relationship between two quantitative variables is to draw a scatter diagram.
Repeated Measures  The term repeated measures refers to data sets with multiple measurements of a response variable on the same experimental unit or subject.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.
Logistic regression. Analysis of proportion data We know how many times an event occurred, and how many times did not occur. We want to know if these.
November 5, 2008 Logistic and Poisson Regression: Modeling Binary and Count Data LISA Short Course Series Mark Seiss, Dept. of Statistics.
© Department of Statistics 2012 STATS 330 Lecture 20: Slide 1 Stats 330: Lecture 20.
© Department of Statistics 2012 STATS 330 Lecture 22: Slide 1 Stats 330: Lecture 22.
Université d’Ottawa - Bio Biostatistiques appliquées © Antoine Morin et Scott Findlay :32 1 Logistic regression.
ALISON BOWLING MAXIMUM LIKELIHOOD. GENERAL LINEAR MODEL.
Tutorial I: Missing Value Analysis
Review of statistical modeling and probability theory Alan Moses ML4bio.
Statistical Methods. 2 Concepts and Notations Sample unit – the basic landscape unit at which we wish to establish the presence/absence of the species.
Logistic Regression and Odds Ratios Psych DeShon.
R Programming/ Binomial Models Shinichiro Suna. Binomial Models In binomial model, we have one outcome which is binary and a set of explanatory variables.
2011 Data Mining Industrial & Information Systems Engineering Pilsung Kang Industrial & Information Systems Engineering Seoul National University of Science.
Model Comparison.
LOGISTIC REGRESSION. Purpose  Logistical regression is regularly used when there are only two categories of the dependent variable and there is a mixture.
Stats Methods at IC Lecture 3: Regression.
Transforming the data Modified from:
BINARY LOGISTIC REGRESSION
Logistic regression.
Notes on Logistic Regression
Statistics in MSmcDESPOT
Linear Model Selection and regularization
Parametric Methods Berlin Chen, 2005 References:
Logistic Regression with “Grouped” Data
Presentation transcript:

1 Model choice Gil McVean, Department of Statistics Tuesday 17 th February 2007

2 Questions to ask… What is a generalised linear model? What does it mean to say that a variable is a ‘significant’ predictor? Is the best model just the one with all the ‘significant’ predictors? How do I choose between competing models as explanations for observed data? How can I effectively explore model space when there are many possible explanatory variables?

3 What is a linear model? In a linear model, the expectation of the response variable is defined as a linear combination of explanatory variables Explanatory variables can include any function of the original data But the link between E(Y) and X (or some function of X ) is ALWAYS linear and the error is ALWAYS Gaussian Response variable InterceptLinear relationships with explanatory variables Interaction term Gaussian error

4 What is a GLM? There are many settings where the error is non-Gaussian and/or the link between E(Y) and X is not necessarily linear –Discrete data (e.g. counts in multinomial or Poisson experiments) –Categorical data (e.g. Disease status) –Highly-skewed data (e.g. Income, ratios) Generalised linear models keep the notion of linearity, but enable the use of non- Gaussian error models g is called the link function –In linear models, the link function is the identity The response variable can be drawn from any distribution of interest (the distribution function) –In linear models this is Gaussian

5 Example: logistic regression When only two types of outcome are possible (e.g. disease/not-disease) we can model counts by the binomial If we want to perform inference about the factors that influence the probability of ‘success’ it is usual to use the logistic model The link function here is the logit

6 Fitting the model to data In linear modelling we can use the beautiful compactness of linear algebra to find MLEs and estimates of the variance for parameters Consider an n by k+1 data matrix, X, where n is the number of observations and k is the number of explanatory variables –the first column is ‘1’ for the intercept term The MLEs for the coefficients (  ) can be estimated using In GLMs, there is usually no such compact analytical expression for the MLEs –Use numerical methods to maximise the likelihood

7 Example: testing for genotype association In a cohort study, we observe the number of individuals in a population that get a particular disease We want to ask whether a particular genotype is associated with increased risk The simplest test is one in which we consider a single coefficient for the genotypic value GenotypeAAAaAA Genotypic value 012 Frequency in population 22  2 Probability of disease p0p0 p1p1 p2p  0 = -4  1 = 2

8 Cont. Suppose in a given study we observe the following counts We can fit a GLM using the logit link function and binomial probabilities We have genotype data stored in the vector gt and disease status in the vector status Genotype012 Counts with disease Counts without disease

9 Interpreting results Call: glm(formula = status ~ gt, family = binomial) Deviance Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) <2e-16 *** gt <2e-16 *** --- Signif. codes: 0 ‘***’ ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: on 1999 degrees of freedom Residual deviance: on 1998 degrees of freedom AIC: Number of Fisher Scoring iterations: 6 MLE for coefficients Standard error for estimates Estimate/std. error P-value from normal distribution Measure of contribution of individual observations to overall goodness of fit (for MLE model) Measure of goodness of fit of null (compared to saturated model) Measure of goodness of fit of fitted model Penalised likelihood used in model choice Number of iterations used to find MLE

10 Extending the analysis In this simple example I had a single genetic locus about which I was asking questions In general, you may have data on 1000s (or millions) of loci, as well as additional covariates such as age, sex, weight, height, smoking habit,… We cannot jointly estimate the parameters for all factors (and even if we could we wouldn’t be able to interpret them) We would like to build a repatively simple model that captures the key features and relationships

11 Example Suppose we wish to be able to predict whether someone will be likely to develop diabetes in later age We have clinical records on 200 individuals listing whether they became diabetic We also have information on the absence/presence of specific mutations at 20 genes that have been hypothesised as being associated with late-onset diabetes How should we build a model?

12 What do we want the model to do? A good model will –Allow us to identify key features of the underlying process –Enable estimation of key parameters –Be sufficiently simple that it remains coherent –Provide predictive power for future observations A bad model will –Be so complex as to be impossible to understand –Be fitted too closely to the observed data –Have poor predictive power

13 Variable selection We wish to identify from among the many variables measured those that are relevant to outcome We wish our end point to be a model of sufficient complexity to capture the essential details, but no more –Occam’s razor We wish to be able to work in situations where there may be more variables than data points

14 The problem of over-fitting Models fitted too closely to past events will be poor at predicting future events Adding additional parameters to a model will only increase the likelihood (it can never decrease it) The most complex model is the one that maximises the probability of observing the data on which the model is fitted

15 Example In a simulated data set the first 10 genes have ‘real’ effects of varying magnitude and the second 10 genes have no effect Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) data$gene data$gene ** data$gene * data$gene data$gene data$gene data$gene data$gene data$gene data$gene ** data$gene data$gene data$gene data$gene data$gene data$gene data$gene data$gene data$gene data$gene Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Truth

16 Model performance Which explanatory variables are explanatory? Individual – ranked by predicted value With disease Without

17 Why can’t we just use likelihood ratios? A natural approach would be to perform parameter estimation under the complete model and then remove non-significant variables More generally, we could use the asymptotic theory of likelihood ratio tests to ask whether we can reject simple models in favour of more complex ones There are two problems –Likelihood ratio tests only deal with nested models –It turns out that this approach tends to give too much weight to the more complex model More complex model mle Under M 1

18 Model scoring We need to find a way of scoring models One approach is to include a penalty for the number of parameters fitted. This gives penalised likelihood schemes Another is to estimate the prediction potential through cross-validation

19 Penalised likelihood A simple approach to guard against over-fitting is to add a penalty to the log likelihood whenever an additional parameter is added All penalties are of the form The two most popular penalties are Log likelihoodPenalty relating to sample size, n, and number of fitted parameters, p i Akaike information criterion Bayesian (Schwartz) information criterion

20 Example Which is the best model: gene1, gene2 or gene1+gene2? ModelLog likelihood#ParametersAICBIC Constant Constant+gene Constant+gene Constant+gene1+gene

21 A note on likelihoods Note that the log likelihood in a linear model (i.e. with a normal distribution) with the coefficients estimated by maximum likelihood can be written as a function of the sample size and the sum of the squares of the residuals (RSS) In generalised linear models the log likelihood at the fitted coefficients is of the form

22 Cross-validation The idea is to use a subset of the data to train the model and assess its prediction performance on the remaining data Predictive power is assess through some function of the residuals (the difference between observed and predicted values) The approach throws away information, but does have useful robustness properties Approaches include –Leave-one-out cross validation (leave out each observation in turn and predict its value from the remaining) –k-fold cross validation (divide the data into k sets and for each subset use the k-1 other subsets for parameter estimation).

23 Choosing the score We need a way of scoring the predictive value of the model One possibility is to compute some function of the residuals (the difference between predicted and observed values) –Mean absolute error –Mean square error (Brier’s score) –Root mean square error An alternative is to compute the (log) likelihood for the novel data. –Poor prediction will mean that future observations are very unlikely

24 Model search Having defined ways of scoring models, we need to efficiently search through model space In the diabetes example there are possible models excluding any interaction terms There are no generic methods of model searching (beyond exhaustive enumeration) that guarantee finding the optimum However, stepwise methods generally perform well

25 Forward selection Start with the simplest model – just an intercept Find the single ‘best’ variable by searching over all of the possible variables Add in the next best Repeat the procedure until the score begins to decrease

26 Backward elimination Start with the full model Find the single ‘worst’ variable (i.e. the one whose removal leads to the greatest increase in model score) Remove the next worst Repeat the elimination procedure until the score begins to decrease

27 Issues Often it is useful to combine the procedures – starting with the addition step then attempting to remove each of the current variables in turn to see if any have become redundant Note that both algorithms are deterministic In general, hill-climbing methods that include some stochasticity (e.g. simulated annealing) will find better solutions because the allow a broader range of search paths

28 Example Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) data$gene data$gene *** data$gene * data$gene data$gene data$gene ** data$gene * data$gene Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 This is not a real effect This is not a ‘significant’ effect

29 Example: 10-fold cross validation For the complete model, with all 20 genes included, the mean absolute error is (the root mean square error is 0.473) By comparison the null model (intercept alone) has a MAE of (RMSE of The model selected by the stepwise method using AIC has a MAE of (RMSE of 0.456)

30 Ridge regression and lasso In penalised likelihood, there is a fixed penalty for the number of parameters However, we might consider alternative penalties, for example some function of the magnitude of the regression coefficients For example, The parameter determines the strength of the penalty –Note that explanatory variables must be scaled to have zero mean and unit variance

31 Example Ridge regression Lasso Note that coefficients can change sign Note that there are sharp transitions from zero to non-zero parameter estimates

32 A look forward to Bayesian methods In Bayesian statistics, model choice is typically drive by the use of Bayes factors Whereas likelihood ratios compare the probability of observing the data at specified parameter values, Bayes factors compare the relative posterior probabilities of two models to their prior ratios Unlike likelihood ratios, Bayes factors can be used to compare any models – i.e. they need not be nested The log Bayes factor is often referred to as the weight of evidence It is worth noting that model selection by BIC is an approximation to model selection by Bayes factors