Linear Modelling III Richard Mott Wellcome Trust Centre for Human Genetics.

Slides:



Advertisements
Similar presentations
SJS SDI_21 Design of Statistical Investigations Stephen Senn 2 Background Stats.
Advertisements

Test of (µ 1 – µ 2 ),  1 =  2, Populations Normal Test Statistic and df = n 1 + n 2 – 2 2– )1– 2 ( 2 1 )1– 1 ( 2 where ] 2 – 1 [–
Topic 12: Multiple Linear Regression
Computational Statistics. Basic ideas  Predict values that are hard to measure irl, by using co-variables (other properties from the same measurement.
The Multiple Regression Model.
SPH 247 Statistical Analysis of Laboratory Data 1April 2, 2013SPH 247 Statistical Analysis of Laboratory Data.
1 Chapter 4 Experiments with Blocking Factors The Randomized Complete Block Design Nuisance factor: a design factor that probably has an effect.
Chapter 4 Randomized Blocks, Latin Squares, and Related Designs
Copyright © 2009 Pearson Education, Inc. Chapter 29 Multiple Regression.
CmpE 104 SOFTWARE STATISTICAL TOOLS & METHODS MEASURING & ESTIMATING SOFTWARE SIZE AND RESOURCE & SCHEDULE ESTIMATING.
6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.
Multilevel modeling in R Tom Dunn and Thom Baguley, Psychology, Nottingham Trent University
The General Linear Model. The Simple Linear Model Linear Regression.
1 Chapter 2 Simple Linear Regression Ray-Bing Chen Institute of Statistics National University of Kaohsiung.
October 20, 2009 Session 8Slide 1 PSC 5940: Estimating the Fit of Multi-Level Models Session 8 Fall, 2009.
MULTIPLE REGRESSION. OVERVIEW What Makes it Multiple? What Makes it Multiple? Additional Assumptions Additional Assumptions Methods of Entering Variables.
Linear Modelling I Richard Mott Wellcome Trust Centre for Human Genetics.
Richard Mott Wellcome Trust Centre for Human Genetics
The Simple Linear Regression Model: Specification and Estimation
1 Chapter 3 Multiple Linear Regression Ray-Bing Chen Institute of Statistics National University of Kaohsiung.
Mixed models Various types of models and their relation
Linear and generalised linear models
Linear and generalised linear models
Basics of regression analysis
Linear and generalised linear models Purpose of linear models Least-squares solution for linear models Analysis of diagnostics Exponential family and generalised.
Maximum likelihood (ML)
Linear Regression/Correlation
Review for Final Exam Some important themes from Chapters 9-11 Final exam covers these chapters, but implicitly tests the entire course, because we use.
Lecture 10A: Matrix Algebra. Matrices: An array of elements Vectors Column vector Row vector Square matrix Dimensionality of a matrix: r x c (rows x columns)
Introduction to Linear Regression and Correlation Analysis
Regression Analysis (2)
BIOL 582 Lecture Set 19 Matrices, Matrix calculations, Linear models using linear algebra.
© 2002 Prentice-Hall, Inc.Chap 14-1 Introduction to Multiple Regression Model.
Lecture 3: Inference in Simple Linear Regression BMTRY 701 Biostatistical Methods II.
Chapter 12 Multiple Linear Regression Doing it with more variables! More is better. Chapter 12A.
Introduction to Linear Regression
Linear Modelling III Richard Mott Wellcome Trust Centre for Human Genetics.
Psychology 301 Chapters & Differences Between Two Means Introduction to Analysis of Variance Multiple Comparisons.
Repeated Measurements Analysis. Repeated Measures Analysis of Variance Situations in which biologists would make repeated measurements on same individual.
1 Multiple Regression Analysis y =  0 +  1 x 1 +  2 x  k x k + u.
Exercise 1 You have a clinical study in which 10 patients will either get the standard treatment or a new treatment Randomize which 5 of the 10 get the.
Environmental Modeling Basic Testing Methods - Statistics III.
Chapter 22: Building Multiple Regression Models Generalization of univariate linear regression models. One unit of data with a value of dependent variable.
Roger B. Hammer Assistant Professor Department of Sociology Oregon State University Conducting Social Research Specification: Choosing the Independent.
Tutorial I: Missing Value Analysis
1 Statistics 262: Intermediate Biostatistics Regression Models for longitudinal data: Mixed Models.
Linear Modelling II Richard Mott Wellcome Trust Centre for Human Genetics.
G Lecture 71 Revisiting Hierarchical Mixed Models A General Version of the Model Variance/Covariances of Two Kinds of Random Effects Parameter Estimation.
Biostatistics Regression and Correlation Methods Class #10 April 4, 2000.
Chapter 11: Categorical Data n Chi-square goodness of fit test allows us to examine a single distribution of a categorical variable in a population. n.
1 AAEC 4302 ADVANCED STATISTICAL METHODS IN AGRICULTURAL RESEARCH Part II: Theory and Estimation of Regression Models Chapter 5: Simple Regression Theory.
Statistics 350 Lecture 2. Today Last Day: Section Today: Section 1.6 Homework #1: Chapter 1 Problems (page 33-38): 2, 5, 6, 7, 22, 26, 33, 34,
Econometrics III Evgeniya Anatolievna Kolomak, Professor.
Remember the equation of a line: Basic Linear Regression As scientists, we find it an irresistible temptation to put a straight line though something that.
Linear model. a type of regression analyses statistical method – both the response variable (Y) and the explanatory variable (X) are continuous variables.
Stats Methods at IC Lecture 3: Regression.
Chapter 12 Simple Linear Regression and Correlation
Statistical Data Analysis - Lecture /04/03
Mixed models and their uses in meta-analysis
Evgeniya Anatolievna Kolomak, Professor
Richard Mott Wellcome Trust Centre for Human Genetics
Correlation and Simple Linear Regression
Correlation and Simple Linear Regression
OVERVIEW OF LINEAR MODELS
Simple Linear Regression
OVERVIEW OF LINEAR MODELS
Simple Linear Regression and Correlation
Multivariate Linear Regression
Andrea Friese, Silvia Artuso, Danai Laina
MGS 3100 Business Analysis Regression Feb 18, 2016
Presentation transcript:

Linear Modelling III Richard Mott Wellcome Trust Centre for Human Genetics

Synopsis Comparing non-nested models Building Models Automatically Mixed Models

Comparing Non-nested models There is no equivalent to the partial F test when comparing non-nested models. The main issue is how to compensate for the fact that a model with more parameters will tend to fit the data better (in the sense of explaining more variance) than a model with fewer parameters But a model with many parameters tends to be a poorer predictor of new observations, because of over-fitting However, several criteria have been proposed for model comparison AIC (Akaike Information Criterion) where L = the likelihood for the data, p= # parameters For Linear Models, where n is the number of observations and L is the maximized value of the likelihood function for the estimated model. Among models with the same number of observations n, choose the model which minimises the simplified AIC

BIC (Bayesian Information Criterion) The BIC penalizes p more strongly than the AIC - ie it prefers models with fewer paramters In R: f <- lm( formula, data) aic <- AIC(f) bic <- AIC(f, k = log(nrow(data)))

Building Models Automatically Example: – We wish to find a set of SNPs that jointly explain variation in a quantitative trait – There will be (usually) far more SNPs s than data points n – There are a vast number 2 s of possible models – No model can contain more than n-1 parameters (model saturation) – Forward Selection: Start with a small model and augment the model step by step so long as the improvement in fit satisfies a criterion (eg AIC or BIC, or partial F test) At each step, add the variable which maximises the improvement in fit – Backwards Elimination: Start with a very large model and subtract terms from the model step by step At each step, delete the variable that minimises the decrease in fit – Forward-Backward At each step, either add or delete a term depending on which optimises a criterion in R, stepAIC – Model Averaging Rather than try to find a single model, integrate over many plausible models – Bootstrapping – Bayesian Model Averaging

Random Effects Mixed Models So far our models have had fixed effects – each parameter can take any value independently of the others – each parameter estimate uses up a degree of freedom – models with large numbers of parameters have problems saturation - overfitting poor predictive properties numerically unstable and difficult to fit in some cases we can treat parameters as being sampled from a distribution – random effects only estimate the parameters needed to specify that distribution can result in a more robust model

Example of a Mixed Model Testing genetic association across a large number of of families – y i = trait value of i’th individual –  i = genotype of individual at SNP of interest –  i = family identifier (factor) – e i = error, variance  2 – y =  + e – If we treat family effects  as fixed then we must estimate a large number of parameters – Better to think of these effects as having a distribution N(0,  2 ) for some variance  which must be estimated individuals from the same family have the same f and are correlated – cov =  2 individuals from different families are uncorrelated – genotype parameters  still treated as fixed effects mixed model

Mixed Models y =  + e Fixed effects model – E(y) =  – Var(y) = I  2 I is identity matrix Mixed model – E(y) =  – Var(y) = I  2 + F  2 F is a matrix, F ij = 1 if i, j in same family Need to estimate both  2 and  2

Generalised Least Squares y = Xb + e Var(y) = V (symmetric covariance matrix) V = I  2 (uncorrelated errors) V = I  2 + F  2 (single grouping random effect) V = I  2 + G  2 (G = genotype identity by state) GLS solution, if V is known unbiased, efficient (minimum variance) consistent (tends to true value in large samples) asymptotically normally distributed

Multivariate Normal Distribution Joint distribution of a vector of correlated observations Another way of thinking about the data Estimation of parameters in a mixed model is special case of likelihood analysis of a MVN y ~ MVN( , V) – m is vector of expected values – V is variance-covariance matrix – If V = I  2 then elements of y are uncorrelated and equivalent to n independent Normal variables – probability density/Likelihood is

Henderson’s Mixed Model Equations General linear mixed model.  are fixed effect u are random effects X, Z design matrices

Mixed Models Fixed effects models are special cases of mixed models Mixed models sometimes are more powerful (as fewer parameters) – But check that the assumption effects are sampled from a Normal distribution is reasonable – Differences between estimated parameters and statistical significance in fixed vs mixed models Shrinkage: often random effect estimates from a mixed model have smaller magnitudes than from a fixed effects model.

Mixed Models in R Several R packages available, e.g. – lme4general purpose mixed models package – emma for genetic association in structured populations Formulas in lme4, eg test for association between genotype and trait H 0 : E(y) =  vs H 1 : E(y) =  Var(y) = I  2 + F  2 – h0 = lmer(y ~ 1 +(1|family), data=data) – h1 = lmer(y ~ genotype +(1|family), data=data) – anova(h0,h1)

Formulas in lmer() y ~ fixed.effects + (1|Group1) + (1|Group2) etc – random intercept models y ~ fixed.effects +(1|Group1) + (x|Group1) – random slope model

Example: HDL Data > library(lme4) > b=read.delim("Biochemistry.txt”)  cc=complete.cases(b$Biochem.HDL) > f0=lm(Biochem.HDL ~ Family, data=b,subset=cc) > f1=lmer(Biochem.HDL ~ (1|Family), data=b,subset=cc) > anova(f0) Analysis of Variance Table Response: Biochem.HDL Df Sum Sq Mean Sq F value Pr(>F) Family < 2.2e-16 *** Residuals Signif. codes: 0 ‘***’ ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 > f1 Linear mixed model fit by REML Formula: Biochem.HDL ~ (1 | Family) Data: b Subset: cc AIC BIC logLik deviance REMLdev Random effects: Groups Name Variance Std.Dev. Family (Intercept) Residual Number of obs: 1857, groups: Family, 176 Fixed effects: Estimate Std. Error t value (Intercept) > plot(resid(f0),resid(f1))

Some Essential R tips

Comparing Models containing Missing Data – R will silently omit rows of data containing missing elements, and adjust the df accordingly – R will only compare models using anova() if the models have been fitted to identical observations. – Sporadic missing values in the explanatory variables can cause problems, because models may have a different numbers of complete cases – Solution is to use the R function complete.cases() to identify those rows with complete data in the most general model, and specify these rows for modelling: cc <- complete.cases(data.frame) f <-lm( formula, data, subset=cc)

Joining two data frames on a common column – Frequently data to be analysed are in several data frames, with a column of unique subject ids used to match the rows. – Unfortunately, often the subjects differ slightly between data frames, eg frame1 might contain phenotype data and frame2 contain genotypes. Usually these two sets don’t agree perfectly because there will be subjects with genotypes but no phenotypes and vice versa – Solution 1: Make a combined data frame containing the intersection of the subjects intersected <- sort(intersect(frame2$id, frame1$id)) combined <- cbind( frame1[match(intersect,frame1$id),cols1], frame2[match(intersect,frame2$id),cols2]) – Solution 2: Make a combined data frame containing the union of the subjects union <- unique(sort(c(as.character(frame1$id), as.character(frame2$id))) combined <- cbind( frame1[match(union,frame1$id),cols1],frame2[match(union,frame2$id),cols2]) – cols1 is a list of the column names to include in frame1, – cols2 is a list of the column names to include in frame2