Correlation and Regression

Slides:



Advertisements
Similar presentations
T-tests, ANOVA and regression Methods for Dummies February 1 st 2006 Jon Roiser and Predrag Petrovic.
Advertisements

T tests, ANOVAs and regression
Topic 12: Multiple Linear Regression
The Simple Regression Model
Tests of Significance for Regression & Correlation b* will equal the population parameter of the slope rather thanbecause beta has another meaning with.
Chap 12-1 Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chapter 12 Simple Regression Statistics for Business and Economics 6.
Jennifer Siegel. Statistical background Z-Test T-Test Anovas.
Inference for Regression
Linear regression models
Chapter 15 (Ch. 13 in 2nd Can.) Association Between Variables Measured at the Interval-Ratio Level: Bivariate Correlation and Regression.
© 2010 Pearson Prentice Hall. All rights reserved Least Squares Regression Models.
Chapter 12 Simple Regression
SIMPLE LINEAR REGRESSION
Topics: Regression Simple Linear Regression: one dependent variable and one independent variable Multiple Regression: one dependent variable and two or.
Lecture 5: Simple Linear Regression
Business Statistics - QBM117 Statistical inference for regression.
Correlation and Regression Analysis
Lorelei Howard and Nick Wright MfD 2008
Simple Linear Regression Analysis
Relationships Among Variables
Lecture 5 Correlation and Regression
Correlation and Regression
Lecture 16 Correlation and Coefficient of Correlation
Psy B07 Chapter 1Slide 1 ANALYSIS OF VARIANCE. Psy B07 Chapter 1Slide 2 t-test refresher  In chapter 7 we talked about analyses that could be conducted.
Section #6 November 13 th 2009 Regression. First, Review Scatter Plots A scatter plot (x, y) x y A scatter plot is a graph of the ordered pairs (x, y)
© 2002 Prentice-Hall, Inc.Chap 14-1 Introduction to Multiple Regression Model.
OPIM 303-Lecture #8 Jose M. Cruz Assistant Professor.
Simple Linear Regression One reason for assessing correlation is to identify a variable that could be used to predict another variable If that is your.
Correlation and Regression Used when we are interested in the relationship between two variables. NOT the differences between means or medians of different.
Basic Concepts of Correlation. Definition A correlation exists between two variables when the values of one are somehow associated with the values of.
© Copyright McGraw-Hill Correlation and Regression CHAPTER 10.
Multiple Regression. Simple Regression in detail Y i = β o + β 1 x i + ε i Where Y => Dependent variable X => Independent variable β o => Model parameter.
Environmental Modeling Basic Testing Methods - Statistics III.
Multiple Regression I 1 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 4 Multiple Regression Analysis (Part 1) Terry Dielman.
Correlation & Regression Analysis
Intro to Statistics for the Behavioral Sciences PSYC 1900 Lecture 7: Regression.
Significance Tests for Regression Analysis. A. Testing the Significance of Regression Models The first important significance test is for the regression.
The “Big Picture” (from Heath 1995). Simple Linear Regression.
Stats Methods at IC Lecture 3: Regression.
Linear Regression Essentials Line Basics y = mx + b vs. Definitions
Inference about the slope parameter and correlation
Regression Analysis.
Chapter 14 Inference on the Least-Squares Regression Model and Multiple Regression.
Linear Regression.
Correlation and Simple Linear Regression
Correlation – Regression
Chapter 11 Simple Regression
Elementary Statistics
Correlation and Simple Linear Regression
Kin 304 Inferential Statistics
I271B Quantitative Methods
Correlation and Regression
6-1 Introduction To Empirical Models
Linear Regression/Correlation
Correlation and Simple Linear Regression
Statistics for the Social Sciences
Undergraduated Econometrics
What if. . . You were asked to determine if psychology and sociology majors have significantly different class attendance (i.e., the number of days.
Correlation and Regression
Simple Linear Regression
Basic Practice of Statistics - 3rd Edition Inference for Regression
Simple Linear Regression and Correlation
Product moment correlation
SIMPLE LINEAR REGRESSION
Simple Linear Regression
Created by Erin Hodgess, Houston, Texas
Correlation and Simple Linear Regression
Correlation and Simple Linear Regression
Presentation transcript:

Correlation and Regression Dharshan Kumaran Hanneke den Ouden

Aims Is there a relationship between x and y? What is the strength of this relationship? Pearson’s r Can we describe this relationship and use this to predict y from x? y=ax+b Is the relationship we have described statistically significant? t test Relevance to SPM GLM

Relation between x and y Correlation: is there a relationship between 2 variables? Regression: how well a certain independent variable predict dependent variable? CORRELATION  CAUSATION In order to infer causality: manipulate independent variable and observe effect on dependent variable Regression: Clearly for a given subject if we wanted to predict Y with no knowledge of X, then we would predict Y bar (mean). So regression aims to analyse data for a relationship between x and y such that for a given x we can make a more accurate prediction of y than y bar. Y = age of death X = number of cigs smoked

Observation ‘clouds’ Positive correlation Negative correlation Y Y Y Y Y X Y X X Positive correlation Negative correlation No correlation

Variance vs Covariance Do two variables change together? Variance ~ DX * DX Variance is just a definition. The reason we’re squaring it is to have it get a positive value whether dx is negative or positive, so that we can sum them and positives and negatives will not cancel out. Variance is spread around mean, covariance is measure of how much x and y change together; very similar: multiply 2 variables rather than square 1 Covariance ~ DX * DY

Covariance When X and Y : cov (x,y) = pos. When X and Y : cov (x,y) = neg. When no constant relationship: cov (x,y) = 0

Example Covariance What does this number tell us? x ( )( ) 3 2 1 4 6 9 y i - ( )( ) 3 2 1 4 6 9 = å 7 What does this number tell us?

Pearson’s R Covariance does not really tell us anything Solution: standardise this measure Pearson’s R: standardise by adding std to equation: Can only compare covariances between different variable to see which is greater.

Pearson’s R How is r always between –1 and 1? X – xav / sd = z Standardising: by subtracting x av and y av you centre around 0,0, and then by dividing by sd you have distribution with sd = 1

Limitations of r When r = 1 or r = -1: We can predict y from x with certainty all data points are on a straight line: y = ax + b r is actually r = true r of whole population = estimate of r based on data r is very sensitive to extreme values: X av = 3, y av = 2 If extreme value such as y = 5 is possible (graph) X – 1,2,3,4,5, Y – 1,2,3,4,0

In the real world… r is never 1 or –1  find best fit of a line in a cloud of observations: Principle of least squares ε Notice that x is not in here, because it is independent; there is no variation; how well do we predict dependent y form independent x. again squaring it is a sort of trick to prevent positive and negative values to cancel out. (could do that in other ways, like take absolute value) We want to find the line for which the sum of least squares is as small as possible. Thus we want to get at a and b May remember from math classes; in order to minimise a certain formula, you take the derivative and make that equal to 0. ε = residual error = , true value = , predicted value y = ax + b

The relationship between x and y (1) : Finding a and b Population: Model: Solution least squares minimisation: take a step back before coming back to the principle of least squares for the population we can say… where epsilon=residual For our linear regression model we can say…equation of straight line Want to calculate a and b Solve least squares to a=cov x,y /var x which rewritten as rSy/Sx From our model b=y-ax Hence since we know our predicted line goes thru x bar/ybar, it follows b=y bar- a x bar

The relationship between x and y (2) Just Rewrite equation: - See why if there’s no correlation, for any x, first term will go to 0 and simply predict y av for y hat Where are we up to? Up to now, we have calculated correlation coefficient r and also defined the relationship of y to x by a regression line. First Q is how to plot the line What can we do with the line? – prediction eg eaten x turnips and age at death (but before this…need to show that our model is statistically significant……)

What can the model explain? S2 S2y S2(yi - i) Before moving on to the question of significance, would be nice to know how much of the variability in age at death our model can explain Look at diagram- explain all the lines So we can see diagramatically total variance=….. We are interested in predicted variance / total variance Total variance = predicted variance + error variance ) ˆ ( 2 i y s - + = 2

predicted variance: 2 predicted Explained variance = total Hence predicted variance= …….. e.g. Smoking: if r is 0.5, this means that 25% of age of death can be predicted from number of cigs smoked predicted Explained variance = total

1 ) ˆ 1 ( s r - + = s r = s = r Error variance: 2 y 2 Substitute this into equation above ) ˆ ( i y s 2 - = r 1 Now let’s look at the error (or residual) variance since it will be important when we come to look at the question of significance. 2 ) ˆ 1 ( y s r - + =

Is the model significant? We’ve determined the form of the relationship (y = ax + b) and it’s strength (r). Does a prediction based on this model do a better job that just predicting the mean? So this Is where we are:

) ˆ 1 ( s r - + = Analogy with ANOVA 2 y Total variance = predicted variance + error variance In a one-way ANOVA, we have SSTotal = SSBetween + SSWithin 2 ) ˆ 1 ( y s r - + = In order to address this question, we need to draw an analogy with ANOVA. ANOVA is a test looking at significance of difference of several means against a null hypothesis. In ANOVA we have: In our regression model we have: We know that SS/n =variance. Hence we want to perform an ANOVA on our model with the null hypothesis being that r=o ie no correlation

F statistic (for our model) MS Eff F = ( df mod el , dferror ) MSErr 2 ˆ y s r MSEff=SSbg/dF MSErr=SSwg/dF /1 Again from ANOVA we calculate an effect statistic by dividing Mean Squared of effect by MS error So we apply this to our model: The MS effect term in our model has 1 degree of freedom MS error has N-2 - ( 1 r ˆ 2 ) s 2 / (N-2) y

F and t statistic - F r ˆ ( N 2 ) = 1 - r ˆ r ˆ ( N - 2 ) t = 1 - r ˆ df mod el , dferror ) 1 - r ˆ 2 Alternatively (as F is the square of t): So we have: Which can also be expressed as a t value. r ˆ ( N - 2 ) So all we need to know is N and r!! t = ( N - 2 ) 1 - r ˆ 2

Basic assumptions Linear relationship Homogeneity of variance (Y) e ~ N(0,s2) No errors in measurement of X Independent sampling In order to model our data in this way we have to make several basic assumptions: 1) First assumption that you make is that x and y can be modelled in a linear relationship: examine data plot. 2) X and y are normally distributed 3) Variance is homogeneous: ie the variance of Y for each X is constant in the population. This has a bearing on our fourth assumption: 4) We also have to assume that the residuals (ie observed Y- predicted Y) form a normal distribution 5) There are no errors in the measurement of x (ie it is the independent variable) 6) Samples (ie data points) are taken independently from one another

SPM- GLM ú û ù ê ë é + = e x y b1 b2 bn Regression model Y1 = x11 b1 + x12 b2 +…+ x1n bn + e1 Y2 = x21 b1 + x22 b2 +…+ x2n bn + e2 : Ym = xm1 b1 + xm2 b2 +…+ xmn bn + em . Multiple regression model So how is this relevant to SPM? We have talked about simple linear regression eg smoking/age at death Say you wanted to consider other factors such as amount of fat eaten and exercise done, multiple regression: Multiple regression is when you have one dependent variable and several independent predictor variables: In multiple regression, have several indep predictor variables x1-xn with individual coefficients (B1-Bn) Hence….. Each equation is related to one sampling point. So we can write this in matrix notation ú û ù ê ë é + = m 2 1 m1 21 11 e b1 x y b2 bn 12 22 2n 1n m2 mn In matrix notation

SPM !!!! ú û ù ê ë é + = m 2 1 m1 21 11 e b1 x y b2 bn 12 22 2n 1n m2 mn So this is effectively the GLM- linear regression and multiple regression are specific examples of it. So in SPM column Y refers to a single voxel and time points 1-m. X is your design matrix containing the predictors that you think may explain the observed data Betas are the parameters that define the contribution of each component of the design matrix to the model Epsilons are the residuals relating to each time point Summary- linear regr is simplest example of GLM which is crucial to SPM. Observed data = design matrix * parameters + residuals

The End Any questions?* *See Will, Dan and Lucy