Richard Mott Wellcome Trust Centre for Human Genetics

Slides:



Advertisements
Similar presentations
Test of (µ 1 – µ 2 ),  1 =  2, Populations Normal Test Statistic and df = n 1 + n 2 – 2 2– )1– 2 ( 2 1 )1– 1 ( 2 where ] 2 – 1 [–
Advertisements

11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.
A. The Basic Principle We consider the multivariate extension of multiple linear regression – modeling the relationship between m responses Y 1,…,Y m and.
Chap 12-1 Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chapter 12 Simple Regression Statistics for Business and Economics 6.
ANOVA and Linear Models. Data Data is from the University of York project on variation in British liquids. Data is from the University of York project.
Hypothesis Testing Steps in Hypothesis Testing:
6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.
Linear regression models
Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L10.1 CorrelationCorrelation The underlying principle of correlation analysis.
Linear Modelling I Richard Mott Wellcome Trust Centre for Human Genetics.
Linear Modelling III Richard Mott Wellcome Trust Centre for Human Genetics.
Correlation and Regression. Spearman's rank correlation An alternative to correlation that does not make so many assumptions Still measures the strength.
Chapter 12 Simple Regression
Lesson #32 Simple Linear Regression. Regression is used to model and/or predict a variable; called the dependent variable, Y; based on one or more independent.
1 Review of Correlation A correlation coefficient measures the strength of a linear relation between two measurement variables. The measure is based on.
Chapter Topics Types of Regression Models
11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.
Simple Linear Regression and Correlation
Simple Linear Regression Analysis
Linear Modelling I Richard Mott Wellcome Trust Centre for Human Genetics.
Correlation & Regression
Simple Linear Regression Models
Inferences in Regression and Correlation Analysis Ayona Chatterjee Spring 2008 Math 4803/5803.
Introduction to Linear Regression
Correlation and Regression Used when we are interested in the relationship between two variables. NOT the differences between means or medians of different.
Testing Multiple Means and the Analysis of Variance (§8.1, 8.2, 8.6) Situations where comparing more than two means is important. The approach to testing.
Chapter 11 Linear Regression Straight Lines, Least-Squares and More Chapter 11A Can you pick out the straight lines and find the least-square?
Multiple Regression and Model Building Chapter 15 Copyright © 2014 by The McGraw-Hill Companies, Inc. All rights reserved.McGraw-Hill/Irwin.
Inference for Regression Simple Linear Regression IPS Chapter 10.1 © 2009 W.H. Freeman and Company.
6-1 Introduction To Empirical Models Based on the scatter diagram, it is probably reasonable to assume that the mean of the random variable Y is.
© Copyright McGraw-Hill Correlation and Regression CHAPTER 10.
STA 286 week 131 Inference for the Regression Coefficient Recall, b 0 and b 1 are the estimates of the slope β 1 and intercept β 0 of population regression.
1 1 Slide The Simple Linear Regression Model n Simple Linear Regression Model y =  0 +  1 x +  n Simple Linear Regression Equation E( y ) =  0 + 
The “Big Picture” (from Heath 1995). Simple Linear Regression.
Chapter 11 Linear Regression and Correlation. Explanatory and Response Variables are Numeric Relationship between the mean of the response variable and.
Inference about the slope parameter and correlation
Chapter 20 Linear and Multiple Regression
Regression and Correlation
Regression Analysis AGEC 784.
11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.
Chapter 12 Simple Linear Regression and Correlation
CHAPTER 7 Linear Correlation & Regression Methods
Non-Parametric Tests 12/1.
Non-Parametric Tests 12/1.
Inference for Regression
Correlation – Regression
Non-Parametric Tests 12/6.
Chapter 11 Simple Regression
Non-Parametric Tests.
Regression model with multiple predictors
Relationship with one independent variable
Chapter 13 Simple Linear Regression
Correlation and Simple Linear Regression
Statistical Inference
Correlation and Regression
6-1 Introduction To Empirical Models
Review for Exam 2 Some important themes from Chapters 6-9
Chapter 12 Simple Linear Regression and Correlation
Correlation and Simple Linear Regression
Association, correlation and regression in biomedical research
Relationship with one independent variable
Simple Linear Regression
Simple Linear Regression and Correlation
Linear Regression and Correlation
Linear Regression and Correlation
Estimating the Variance of the Error Terms
Chapter Thirteen McGraw-Hill/Irwin
Correlation and Simple Linear Regression
Correlation and Simple Linear Regression
Presentation transcript:

Richard Mott Wellcome Trust Centre for Human Genetics Linear Modelling I Richard Mott Wellcome Trust Centre for Human Genetics

Synopsis Linear Regression Correlation Analysis of Variance Principle of Least Squares

Correlation

Correlation and linear regression Is there a relationship? How do we summarise it? Can we predict new obs? What about outliers?

Correlation Coefficient r r=0 no relationship r=0.6 r=1 perfect positive linear r=-1 perfect negative linear

Examples of Correlation (taken from Wikipedia)

Calculation of r Data

Correlation in R > cor(bioch$Biochem.Tot.Cholesterol,bioch$Biochem.HDL,use="complete") [1] 0.2577617 > cor.test(bioch$Biochem.Tot.Cholesterol,bioch$Biochem.HDL,use="complete") Pearson's product-moment correlation data: bioch$Biochem.Tot.Cholesterol and bioch$Biochem.HDL t = 11.1473, df = 1746, p-value < 2.2e-16 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: 0.2134566 0.3010088 sample estimates: cor 0.2577617 > pt(11.1473,df=1746,lower.tail=FALSE) # T distribution on 1746 degrees of freedom [1] 3.154319e-28

Linear Regression Fit a straight line to data a intercept b slope ei error Normally distributed E(ei) = 0 Var(ei) = s2

Example: simulated data R code > # simulate 30 data points > x <- rnorm(30) > e <- rnorm(30) > x <- 1:30 > e <- rnorm(30,0,5) > y <- 1 + 3*x + e > # fit the linear model > f <- lm(y ~ x) > # plot the data and the predicted line > plot(x,y) > abline(reg=f) > print(f) Call: lm(formula = y ~ x) Coefficients: (Intercept) x -0.08634 3.04747

Least Squares Estimate a, b by least squares Minimise sum of squared residuals between y and the prediction a+bx Minimise

Why least squares? LS gives simple formulae for the estimates for a, b If the errors are Normally distributed then the LS estimates are “optimal” In large samples the estimates converge to the true values No other estimates have smaller expected errors LS = maximum likelihood Even if errors are not Normal, LS estimates are often useful

Analysis of Variance (ANOVA) LS estimates have an important property: they partition the sum of squares (SS) into fitted and error components total SS = fitting SS + residual SS only the LS estimates do this Component Degrees of freedom Mean Square (ratio of SS to df) F-ratio (ratio of FMS/RMS) Fitting SS 1 Residual SS n-2 Total SS n-1

ANOVA in R Component SS Degrees of freedom Mean Square F-ratio Fitting SS 20872.7 1 965 Residual SS 605.6 28 21.6 Total SS 21478.3 29 > anova(f) Analysis of Variance Table Response: y Df Sum Sq Mean Sq F value Pr(>F) x 1 20872.7 20872.7 965 < 2.2e-16 *** Residuals 28 605.6 21.6 > pf(965,1,28,lower.tail=FALSE) [1] 3.042279e-23

Hypothesis testing no relationship between y and x Assume errors ei are independent and normally distributed N(0,s2) If H0 is true then the expected values of the sums of squares in the ANOVA are degrees freedom Expectation F ratio = (fitting MS)/(residual MS) ~ 1 under H0 F >> 1 implies we reject H0 F is distributed as F(1,n-2)

Degrees of Freedom Suppose are iid N(0,1) Then ie n independent variables What about ? These values are constrained to sum to 0: Therefore the sum is distributed as if it comprised one fewer observation, hence it has n-1 df (for example, its expectation is n-1) In particular, if p parameters are estimated from a data set, then the residuals have p constraints on them, so they behave like n-p independent variables

The F distribution If e1….en are independent and identically distributed (iid) random variables with distribution N(0,s2), then: e12/s2 … en2/s2 are each iid chi-squared random variables with chi-squared distribution on 1 degree of freedom c12 The sum Sn = Si ei2/s2 is distributed as chi-squared cn2 If Tm is a similar sum distributed as chi-squared cm2, but independent of Sn, then (Sn/n)/(Tm/m) is distributed as an F random variable F(n,m) Special cases: F(1,m) is the same as the square of a T-distribution on m df for large m, F(n,m) tends to cn2

ANOVA – HDL example HDL = 0.2308 + 0.4456*Cholesterol > ff <- lm(bioch$Biochem.HDL ~ bioch$Biochem.Tot.Cholesterol) > ff Call: lm(formula = bioch$Biochem.HDL ~ bioch$Biochem.Tot.Cholesterol) Coefficients: (Intercept) bioch$Biochem.Tot.Cholesterol 0.2308 0.4456 > anova(ff) Analysis of Variance Table Response: bioch$Biochem.HDL Df Sum Sq Mean Sq F value Pr(>F) bioch$Biochem.Tot.Cholesterol 1 149.660 149.660 1044 Residuals 1849 265.057 0.143 > pf(1044,1,28,lower.tail=FALSE) [1] 1.040709e-23 HDL = 0.2308 + 0.4456*Cholesterol

correlation and ANOVA r2 = FSS/TSS = fraction of variance explained by the model r2 = F/(F+n-2) correlation and ANOVA are equivalent Test of r=0 is equivalent to test of b=0 T statistic in R cor.test is the square root of the ANOVA F statistic r does not tell anything about magnitudes of estimates of a, b r is dimensionless

Effect of sample size on significance Total Cholesterol vs HDL data Example R session to sample subsets of data and compute correlations seqq <- seq(10,300,5) corr <- matrix(0,nrow=length(seqq),ncol=2) colnames(corr) <- c( "sample size", "P-value") n <- 1 for(i in seqq) { res <- rep(0,100) for(j in 1:100) { s <- sample(idx,i) data <- bioch[s,] co <- cor.test(data$Biochem.Tot.Cholesterol, data$Biochem.HDL,na="pair") res[j] <- co$p.value } m <- exp(mean(log(res))) cat(i, m, "\n") corr[n,] <- c(i, m) n <- n+1

Calculating the right sample size n The R library “pwr” contains functions to compute the sample size for many problems, including correlation pwr.r.test() and ANOVA pwr.anova.test()

Problems with non-linearity All plots have r=0 Problems with non-linearity All plots have r=0.8 (taken from Wikipedia)

Multiple Correlation The R cor function can be used to compute pairwise correlations between many variables at once, producing a correlation matrix. This is useful for example, when comparing expression of genes across subjects. Gene coexpression networks are often based on the correlation matrix. in R mat <- cor(df, na=“pair”) computes the correlation between every pair of columns in df, removing missing values in a pairwise manner Output is a square matrix correlation coefficients

One-Way ANOVA Model y as a function of a categorical variable taking p values eg subjects are classified into p families want to estimate effect due to each family and test if these are different want to estimate the fraction of variance explained by differences between families – ( an estimate of heritability)

One-Way ANOVA LS estimators average over group i

One-Way ANOVA Variance is partitioned in to fitting and residual SS total SS n-1 fitting SS between groups p-1 residual SS with groups n-p degrees of freedom

One-Way ANOVA Under Ho: no differences between groups F ~ F(p-1,n-p) Component SS Degrees of freedom Mean Square (ratio of SS to df) F-ratio (ratio of FMS/RMS) Fitting SS p-1 Residual SS n-p Total SS n-1 Under Ho: no differences between groups F ~ F(p-1,n-p)

One-Way ANOVA in R fam <- lm( bioch$Biochem.HDL ~ bioch$Family ) > anova(fam) Analysis of Variance Table Response: bioch$Biochem.HDL Df Sum Sq Mean Sq F value Pr(>F) bioch$Family 173 6.3870 0.0369 3.4375 < 2.2e-16 *** Residuals 1727 18.5478 0.0107 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 > Component SS Degrees of freedom Mean Square (ratio of SS to df) F-ratio (ratio of FMS/RMS) Fitting SS 6.3870 173 0.0369 3.4375 Residual SS 18.5478 1727 0.0107 Total SS 24.9348 1900

Factors in R Grouping variables in R are called factors When a data frame is read with read.table() a column is treated as numeric if all non-missing entries are numbers a column is boolean if all non-missing entries are T or F (or TRUE or FALSE) a column is treated as a factor otherwise the levels of the factor are the set of distinct values A column can be forced to be treated as a factor using the function as.factor(), or as a numeric vector using as.numeric() BEWARE: If a numeric column contains non-numeric values (eg “N” being used instead of “NA” for a missing value, then the column is interpreted as a factor

Linear Modelling in R The R function lm() fits linear models It has two principal arguments (and some optional ones) f <- lm( formula, data ) formula is an R formula data is the name of the data frame containing the data (can be omitted, if the variables in the formula include the data frame)

formulae in R Biochem.HDL ~ Biochem$Tot.Cholesterol linear regression of HDL on Cholesterol 1 df Biochem.HDL ~ Family one-way analysis of variance of HDL on Family 173 df The degrees of freedom are the number of independent parameters to be estimated

ANOVA in R f <- lm(Biochem.HDL ~ Tot.Cholesterol, data=biochem) [OR f <- lm(biochem$Biochem.HDL ~ biochem$Tot.Cholesterol)] a <- anova(f) f <- lm(Biochem.HDL ~ Family, data=biochem)

Non-parametric Methods So far we have assumed the errors in the data are Normally distributed P-values and estimates can be inaccurate if this is not the case Non-parametric methods are a (partial) way round this problem Make fewer assumptions about the distribution of the data independent identically distributed

Non-Parametric Correlation Spearman Rank Correlation Coefficient Replace observations by their ranks eg x= ( 5, 1, 4, 7 ) -> rank(x) = (3,1,2,4) Compute sum of squared differences between ranks in R: cor( x, y, method=“spearman”) cor.test(x,y,method=“spearman”)

Spearman Correlation > cor.test(xx,y, method=“pearson”) Pearson's product-moment correlation data: xx and y t = 0.0221, df = 28, p-value = 0.9825 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: -0.3566213 0.3639062 sample estimates: cor 0.004185729 > cor.test(xx,y,method="spearman") Spearman's rank correlation rho S = 2473.775, p-value = 0.01267 alternative hypothesis: true rho is not equal to 0 rho 0.4496607

Non-Parametric One-Way ANOVA Kruskall-Wallis Test Useful if data are highly non-Normal Replace data by ranks Compute average rank within each group Compare averages kruskal.test( formula, data )

Permutation Tests as non-parametric tests Example: One-way ANOVA: permute group identity between subjects count fraction of permutations in which the ANOVA p-value is smaller than the true p-value a = anova(lm( bioch$Biochem.HDL ~ bioch$Family)) p = a[1,5] pv = rep(0,1000) for( i in 1:1000) { perm = sample(bioch$Family,replace=FALSE) a = anova(lm( bioch$Biochem.HDL ~ perm )) pv[i] = a[1,5] } pval = mean(pv <p)