Linear Modelling I Richard Mott Wellcome Trust Centre for Human Genetics.

Slides:



Advertisements
Similar presentations
11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.
Advertisements

Chap 12-1 Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chapter 12 Simple Regression Statistics for Business and Economics 6.
Hypothesis Testing Steps in Hypothesis Testing:
Inference for Regression
6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.
Linear regression models
Correlation and Regression
Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L10.1 CorrelationCorrelation The underlying principle of correlation analysis.
Classical Regression III
© 2010 Pearson Prentice Hall. All rights reserved Least Squares Regression Models.
Linear Modelling I Richard Mott Wellcome Trust Centre for Human Genetics.
Linear Modelling III Richard Mott Wellcome Trust Centre for Human Genetics.
Correlation and Regression. Spearman's rank correlation An alternative to correlation that does not make so many assumptions Still measures the strength.
Chapter 12 Simple Regression
The Simple Regression Model
Lesson #32 Simple Linear Regression. Regression is used to model and/or predict a variable; called the dependent variable, Y; based on one or more independent.
Final Review Session.
1 Review of Correlation A correlation coefficient measures the strength of a linear relation between two measurement variables. The measure is based on.
Nemours Biomedical Research Statistics April 2, 2009 Tim Bunnell, Ph.D. & Jobayer Hossain, Ph.D. Nemours Bioinformatics Core Facility.
Chapter Topics Types of Regression Models
Measures of Association Deepak Khazanchi Chapter 18.
11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.
Simple Linear Regression and Correlation
Review for Exam 2 Some important themes from Chapters 6-9 Chap. 6. Significance Tests Chap. 7: Comparing Two Groups Chap. 8: Contingency Tables (Categorical.
McGraw-Hill/Irwin Copyright © 2013 by The McGraw-Hill Companies, Inc. All rights reserved. A PowerPoint Presentation Package to Accompany Applied Statistics.
Simple Linear Regression Analysis
Statistical hypothesis testing – Inferential statistics II. Testing for associations.
Correlation & Regression
Introduction to Linear Regression and Correlation Analysis
CPE 619 Simple Linear Regression Models Aleksandar Milenković The LaCASA Laboratory Electrical and Computer Engineering Department The University of Alabama.
Simple Linear Regression Models
7.1 - Motivation Motivation Correlation / Simple Linear Regression Correlation / Simple Linear Regression Extensions of Simple.
Inferences in Regression and Correlation Analysis Ayona Chatterjee Spring 2008 Math 4803/5803.
OPIM 303-Lecture #8 Jose M. Cruz Assistant Professor.
Statistics for Business and Economics 7 th Edition Chapter 11 Simple Regression Copyright © 2010 Pearson Education, Inc. Publishing as Prentice Hall Ch.
Topic 14: Inference in Multiple Regression. Outline Review multiple linear regression Inference of regression coefficients –Application to book example.
Introduction to Linear Regression
Correlation and Regression Used when we are interested in the relationship between two variables. NOT the differences between means or medians of different.
Testing Multiple Means and the Analysis of Variance (§8.1, 8.2, 8.6) Situations where comparing more than two means is important. The approach to testing.
Chapter 11 Linear Regression Straight Lines, Least-Squares and More Chapter 11A Can you pick out the straight lines and find the least-square?
1 Chapter 12 Simple Linear Regression. 2 Chapter Outline  Simple Linear Regression Model  Least Squares Method  Coefficient of Determination  Model.
Multiple Regression and Model Building Chapter 15 Copyright © 2014 by The McGraw-Hill Companies, Inc. All rights reserved.McGraw-Hill/Irwin.
Lecture 8 Simple Linear Regression (cont.). Section Objectives: Statistical model for linear regression Data for simple linear regression Estimation.
Inference for Regression Simple Linear Regression IPS Chapter 10.1 © 2009 W.H. Freeman and Company.
6-1 Introduction To Empirical Models Based on the scatter diagram, it is probably reasonable to assume that the mean of the random variable Y is.
Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Chapter 13 Multiple Regression Section 13.3 Using Multiple Regression to Make Inferences.
© Copyright McGraw-Hill Correlation and Regression CHAPTER 10.
STA 286 week 131 Inference for the Regression Coefficient Recall, b 0 and b 1 are the estimates of the slope β 1 and intercept β 0 of population regression.
Lecture 10: Correlation and Regression Model.
Copyright (C) 2002 Houghton Mifflin Company. All rights reserved. 1 Understandable Statistics Seventh Edition By Brase and Brase Prepared by: Lynn Smith.
1 1 Slide The Simple Linear Regression Model n Simple Linear Regression Model y =  0 +  1 x +  n Simple Linear Regression Equation E( y ) =  0 + 
Formula for Linear Regression y = bx + a Y variable plotted on vertical axis. X variable plotted on horizontal axis. Slope or the change in y for every.
Copyright © 2011 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill/Irwin Multiple Regression Chapter 14.
Chapter 11: Categorical Data n Chi-square goodness of fit test allows us to examine a single distribution of a categorical variable in a population. n.
The “Big Picture” (from Heath 1995). Simple Linear Regression.
Chapter 11 Linear Regression and Correlation. Explanatory and Response Variables are Numeric Relationship between the mean of the response variable and.
CHAPTER 7 Linear Correlation & Regression Methods
Chapter 11 Simple Regression
Relationship with one independent variable
Richard Mott Wellcome Trust Centre for Human Genetics
Correlation and Simple Linear Regression
Correlation and Regression
6-1 Introduction To Empirical Models
Correlation and Simple Linear Regression
Relationship with one independent variable
Simple Linear Regression
Linear Regression and Correlation
Linear Regression and Correlation
Estimating the Variance of the Error Terms
Presentation transcript:

Linear Modelling I Richard Mott Wellcome Trust Centre for Human Genetics

Synopsis Linear Regression Correlation Analysis of Variance Principle of Least Squares

Correlation

Correlation and linear regression Is there a relationship? How do we summarise it? Can we predict new obs? What about outliers?

Correlation Coefficient r -1 < r < 1 r=0 no relationship r=0.6 r=1 perfect positive linear r=-1 perfect negative linear

Examples of Correlation (taken from Wikipedia)

Calculation of r Data

Correlation in R > cor(bioch$Biochem.Tot.Cholesterol,bioch$Biochem.HDL,use="complete") [1] > cor.test(bioch$Biochem.Tot.Cholesterol,bioch$Biochem.HDL,use="complete") Pearson's product-moment correlation data: bioch$Biochem.Tot.Cholesterol and bioch$Biochem.HDL t = , df = 1746, p-value < 2.2e-16 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: sample estimates: cor > pt( ,df=1746,lower.tail=FALSE) # T distribution on 1746 degrees of freedom [1] e-28

Linear Regression Fit a straight line to data a intercept b slope e i error – Normally distributed – E(e i ) = 0 – Var(e i ) =  2

Example: simulated data R code > # simulate 30 data points > x <- rnorm(30) > e <- rnorm(30) > x <- 1:30 > e <- rnorm(30,0,5) > y < *x + e > # fit the linear model > f <- lm(y ~ x) > # plot the data and the predicted line > plot(x,y) > abline(reg=f) > print(f) Call: lm(formula = y ~ x) Coefficients: (Intercept) x

Least Squares Estimate a, b by least squares Minimise sum of squared residuals between y and the prediction a+bx Minimise

Why least squares? LS gives simple formulae for the estimates for a, b If the errors are Normally distributed then the LS estimates are “optimal” In large samples the estimates converge to the true values No other estimates have smaller expected errors LS = maximum likelihood Even if errors are not Normal, LS estimates are often useful

Analysis of Variance (ANOVA) LS estimates have an important property: they partition the sum of squares (SS) into fitted and error components total SS = fitting SS + residual SS only the LS estimates do this Component Degrees of freedom Mean Square (ratio of SS to df) F-ratio (ratio of FMS/RMS) Fitting SS1 Residual SSn-2 Total SSn-1

ANOVA in R ComponentSS Degrees of freedom Mean SquareF-ratio Fitting SS Residual SS Total SS > anova(f) Analysis of Variance Table Response: y Df Sum Sq Mean Sq F value Pr(>F) x < 2.2e-16 *** Residuals > pf(965,1,28,lower.tail=FALSE) [1] e-23

Hypothesis testing no relationship between y and x Assume errors e i are independent and normally distributed N(0,  2 ) If H 0 is true then the expected values of the sums of squares in the ANOVA are degrees freedom Expectation F ratio = (fitting MS)/(residual MS) ~ 1 under H 0 F >> 1 implies we reject H 0 F is distributed as F(1,n-2)

Degrees of Freedom Suppose are iid N(0,1) Then ie n independent variables What about ? These values are constrained to sum to 0: Therefore the sum is distributed as if it comprised one fewer observation, hence it has n-1 df (for example, its expectation is n-1) In particular, if p parameters are estimated from a data set, then the residuals have p constraints on them, so they behave like n-p independent variables

The F distribution If e 1 ….e n are independent and identically distributed (iid) random variables with distribution N(0,  2 ), then: e 1 2 /  2 … e n 2 /  2 are each iid chi-squared random variables with chi-squared distribution on 1 degree of freedom    The sum S n =  i e i 2 /  2 is distributed as chi-squared  n  If T m is a similar sum distributed as chi-squared  m , but independent of S n, then (S n /n)/(T m /m) is distributed as an F random variable F(n,m) Special cases: – F(1,m) is the same as the square of a T-distribution on m df – for large m, F(n,m) tends to  n 

ANOVA – HDL example > ff <- lm(bioch$Biochem.HDL ~ bioch$Biochem.Tot.Cholesterol) > ff Call: lm(formula = bioch$Biochem.HDL ~ bioch$Biochem.Tot.Cholesterol) Coefficients: (Intercept) bioch$Biochem.Tot.Cholesterol > anova(ff) Analysis of Variance Table Response: bioch$Biochem.HDL Df Sum Sq Mean Sq F value Pr(>F) bioch$Biochem.Tot.Cholesterol Residuals > pf(1044,1,28,lower.tail=FALSE) [1] e-23 HDL = *Cholesterol

correlation and ANOVA r 2 = FSS/TSS = fraction of variance explained by the model r 2 = F/(F+n-2) – correlation and ANOVA are equivalent – Test of r=0 is equivalent to test of b=0 – T statistic in R cor.test is the square root of the ANOVA F statistic – r does not tell anything about magnitudes of estimates of a, b – r is dimensionless

Effect of sample size on significance Total Cholesterol vs HDL data Example R session to sample subsets of data and compute correlations seqq <- seq(10,300,5) corr <- matrix(0,nrow=length(seqq),ncol=2) colnames(corr) <- c( "sample size", "P-value") n <- 1 for(i in seqq) { res <- rep(0,100) for(j in 1:100) { s <- sample(idx,i) data <- bioch[s,] co <- cor.test(data$Biochem.Tot.Cholesterol, data$Biochem.HDL,na="pair") res[j] <- co$p.value } m <- exp(mean(log(res))) cat(i, m, "\n") corr[n,] <- c(i, m) n <- n+1 }

Calculating the right sample size n The R library “pwr” contains functions to compute the sample size for many problems, including correlation pwr.r.test() and ANOVA pwr.anova.test()

Problems with non-linearity All plots have r=0.8 (taken from Wikipedia)

Multiple Correlation The R cor function can be used to compute pairwise correlations between many variables at once, producing a correlation matrix. This is useful for example, when comparing expression of genes across subjects. Gene coexpression networks are often based on the correlation matrix. in R mat <- cor(df, na=“pair”) – computes the correlation between every pair of columns in df, removing missing values in a pairwise manner – Output is a square matrix correlation coefficients

One-Way ANOVA Model y as a function of a categorical variable taking p values – eg subjects are classified into p families – want to estimate effect due to each family and test if these are different – want to estimate the fraction of variance explained by differences between families – ( an estimate of heritability)

One-Way ANOVA LS estimators average over group i

One-Way ANOVA Variance is partitioned in to fitting and residual SS total SS n-1 fitting SS between groups p-1 residual SS with groups n-p degrees of freedom

One-Way ANOVA ComponentSS Degrees of freedom Mean Square (ratio of SS to df) F-ratio (ratio of FMS/RMS) Fitting SSp-1 Residual SSn-p Total SSn-1 Under H o : no differences between groups F ~ F(p-1,n-p)

One-Way ANOVA in R fam <- lm( bioch$Biochem.HDL ~ bioch$Family ) > anova(fam) Analysis of Variance Table Response: bioch$Biochem.HDL Df Sum Sq Mean Sq F value Pr(>F) bioch$Family < 2.2e-16 *** Residuals Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 > ComponentSS Degrees of freedom Mean Square (ratio of SS to df) F-ratio (ratio of FMS/RMS) Fitting SS Residual SS Total SS

Factors in R Grouping variables in R are called factors When a data frame is read with read.table() – a column is treated as numeric if all non-missing entries are numbers – a column is boolean if all non-missing entries are T or F (or TRUE or FALSE) – a column is treated as a factor otherwise – the levels of the factor are the set of distinct values – A column can be forced to be treated as a factor using the function as.factor(), or as a numeric vector using as.numeric() – BEWARE: If a numeric column contains non-numeric values (eg “N” being used instead of “NA” for a missing value, then the column is interpreted as a factor

Linear Modelling in R The R function lm() fits linear models It has two principal arguments (and some optional ones) f <- lm( formula, data ) – formula is an R formula – data is the name of the data frame containing the data (can be omitted, if the variables in the formula include the data frame)

formulae in R Biochem.HDL ~ Biochem$Tot.Cholesterol – linear regression of HDL on Cholesterol – 1 df Biochem.HDL ~ Family – one-way analysis of variance of HDL on Family – 173 df The degrees of freedom are the number of independent parameters to be estimated

ANOVA in R f <- lm(Biochem.HDL ~ Tot.Cholesterol, data=biochem) [OR f <- lm(biochem$Biochem.HDL ~ biochem$Tot.Cholesterol)] a <- anova(f) f <- lm(Biochem.HDL ~ Family, data=biochem) a <- anova(f)

Non-parametric Methods So far we have assumed the errors in the data are Normally distributed P-values and estimates can be inaccurate if this is not the case Non-parametric methods are a (partial) way round this problem Make fewer assumptions about the distribution of the data – independent – identically distributed

Non-Parametric Correlation Spearman Rank Correlation Coefficient Replace observations by their ranks eg x= ( 5, 1, 4, 7 ) -> rank(x) = (3,1,2,4) Compute sum of squared differences between ranks in R: – cor( x, y, method=“spearman”) – cor.test(x,y,method=“spearman”)

Spearman Correlation > cor.test(xx,y, method=“pearson”) Pearson's product-moment correlation data: xx and y t = , df = 28, p-value = alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: sample estimates: cor > cor.test(xx,y,method="spearman") Spearman's rank correlation rho data: xx and y S = , p-value = alternative hypothesis: true rho is not equal to 0 sample estimates: rho

Non-Parametric One-Way ANOVA Kruskall-Wallis Test Useful if data are highly non-Normal – Replace data by ranks – Compute average rank within each group – Compare averages – kruskal.test( formula, data )

Permutation Tests as non-parametric tests Example: One-way ANOVA: – permute group identity between subjects – count fraction of permutations in which the ANOVA p-value is smaller than the true p-value a = anova(lm( bioch$Biochem.HDL ~ bioch$Family)) p = a[1,5] pv = rep(0,1000) for( i in 1:1000) { perm = sample(bioch$Family,replace=FALSE) a = anova(lm( bioch$Biochem.HDL ~ perm )) pv[i] = a[1,5] } pval = mean(pv <p)