Correlation and Regression

Correlation and Regression
Dharshan Kumaran Hanneke den Ouden

Aims Is there a relationship between x and y?
What is the strength of this relationship? Pearson’s r Can we describe this relationship and use this to predict y from x? y=ax+b Is the relationship we have described statistically significant? t test Relevance to SPM GLM

Relation between x and y
Correlation: is there a relationship between 2 variables? Regression: how well a certain independent variable predict dependent variable? CORRELATION  CAUSATION In order to infer causality: manipulate independent variable and observe effect on dependent variable Regression: Clearly for a given subject if we wanted to predict Y with no knowledge of X, then we would predict Y bar (mean). So regression aims to analyse data for a relationship between x and y such that for a given x we can make a more accurate prediction of y than y bar. Y = age of death X = number of cigs smoked

Observation ‘clouds’ Positive correlation Negative correlation
Y Y Y Y Y X Y X X Positive correlation Negative correlation No correlation

Variance vs Covariance
Do two variables change together? Variance ~ DX * DX Variance is just a definition. The reason we’re squaring it is to have it get a positive value whether dx is negative or positive, so that we can sum them and positives and negatives will not cancel out. Variance is spread around mean, covariance is measure of how much x and y change together; very similar: multiply 2 variables rather than square 1 Covariance ~ DX * DY

Covariance When X and Y : cov (x,y) = pos.
When X and Y : cov (x,y) = neg. When no constant relationship: cov (x,y) = 0

Example Covariance What does this number tell us? x ( )( ) 3 2 1 4 6 9
y i - ( )( ) 3 2 1 4 6 9 = å 7 What does this number tell us?

Pearson’s R Covariance does not really tell us anything
Solution: standardise this measure Pearson’s R: standardise by adding std to equation: Can only compare covariances between different variable to see which is greater.

Pearson’s R How is r always between –1 and 1? X – xav / sd = z
Standardising: by subtracting x av and y av you centre around 0,0, and then by dividing by sd you have distribution with sd = 1

Limitations of r When r = 1 or r = -1:
We can predict y from x with certainty all data points are on a straight line: y = ax + b r is actually r = true r of whole population = estimate of r based on data r is very sensitive to extreme values: X av = 3, y av = 2 If extreme value such as y = 5 is possible (graph) X – 1,2,3,4,5, Y – 1,2,3,4,0

In the real world… r is never 1 or –1
 find best fit of a line in a cloud of observations: Principle of least squares ε Notice that x is not in here, because it is independent; there is no variation; how well do we predict dependent y form independent x. again squaring it is a sort of trick to prevent positive and negative values to cancel out. (could do that in other ways, like take absolute value) We want to find the line for which the sum of least squares is as small as possible. Thus we want to get at a and b May remember from math classes; in order to minimise a certain formula, you take the derivative and make that equal to 0. ε = residual error = , true value = , predicted value y = ax + b

The relationship between x and y (1) : Finding a and b
Population: Model: Solution least squares minimisation: take a step back before coming back to the principle of least squares for the population we can say… where epsilon=residual For our linear regression model we can say…equation of straight line Want to calculate a and b Solve least squares to a=cov x,y /var x which rewritten as rSy/Sx From our model b=y-ax Hence since we know our predicted line goes thru x bar/ybar, it follows b=y bar- a x bar

The relationship between x and y (2)
Just Rewrite equation: - See why if there’s no correlation, for any x, first term will go to 0 and simply predict y av for y hat Where are we up to? Up to now, we have calculated correlation coefficient r and also defined the relationship of y to x by a regression line. First Q is how to plot the line What can we do with the line? – prediction eg eaten x turnips and age at death (but before this…need to show that our model is statistically significant……)

What can the model explain?
S2 S2y S2(yi - i) Before moving on to the question of significance, would be nice to know how much of the variability in age at death our model can explain Look at diagram- explain all the lines So we can see diagramatically total variance=….. We are interested in predicted variance / total variance Total variance = predicted variance + error variance ) ˆ ( 2 i y s - + = 2

predicted variance: 2 predicted Explained variance = total
Hence predicted variance= …….. e.g. Smoking: if r is 0.5, this means that 25% of age of death can be predicted from number of cigs smoked predicted Explained variance = total

1 ) ˆ 1 ( s r - + = s r = s = r Error variance: 2 y 2
Substitute this into equation above ) ˆ ( i y s 2 - = r 1 Now let’s look at the error (or residual) variance since it will be important when we come to look at the question of significance. 2 ) ˆ 1 ( y s r - + =

Is the model significant?
We’ve determined the form of the relationship (y = ax + b) and it’s strength (r). Does a prediction based on this model do a better job that just predicting the mean? So this Is where we are:

) ˆ 1 ( s r - + = Analogy with ANOVA 2 y
Total variance = predicted variance + error variance In a one-way ANOVA, we have SSTotal = SSBetween SSWithin 2 ) ˆ 1 ( y s r - + = In order to address this question, we need to draw an analogy with ANOVA. ANOVA is a test looking at significance of difference of several means against a null hypothesis. In ANOVA we have: In our regression model we have: We know that SS/n =variance. Hence we want to perform an ANOVA on our model with the null hypothesis being that r=o ie no correlation

F statistic (for our model)
MS Eff F = ( df mod el , dferror ) MSErr 2 ˆ y s r MSEff=SSbg/dF MSErr=SSwg/dF /1 Again from ANOVA we calculate an effect statistic by dividing Mean Squared of effect by MS error So we apply this to our model: The MS effect term in our model has 1 degree of freedom MS error has N-2 - ( 1 r ˆ 2 ) s 2 / (N-2) y

F and t statistic - F r ˆ ( N 2 ) = 1 - r ˆ r ˆ ( N - 2 ) t = 1 - r ˆ
df mod el , dferror ) 1 - r ˆ 2 Alternatively (as F is the square of t): So we have: Which can also be expressed as a t value. r ˆ ( N - 2 ) So all we need to know is N and r!! t = ( N - 2 ) 1 - r ˆ 2

Basic assumptions Linear relationship Homogeneity of variance (Y)
e ~ N(0,s2) No errors in measurement of X Independent sampling In order to model our data in this way we have to make several basic assumptions: 1) First assumption that you make is that x and y can be modelled in a linear relationship: examine data plot. 2) X and y are normally distributed 3) Variance is homogeneous: ie the variance of Y for each X is constant in the population. This has a bearing on our fourth assumption: 4) We also have to assume that the residuals (ie observed Y- predicted Y) form a normal distribution 5) There are no errors in the measurement of x (ie it is the independent variable) 6) Samples (ie data points) are taken independently from one another

SPM- GLM ú û ù ê ë é + = e x y b1 b2 bn
Regression model Y1 = x11 b1 + x12 b2 +…+ x1n bn + e1 Y2 = x21 b1 + x22 b2 +…+ x2n bn + e2 : Ym = xm1 b1 + xm2 b2 +…+ xmn bn + em . Multiple regression model So how is this relevant to SPM? We have talked about simple linear regression eg smoking/age at death Say you wanted to consider other factors such as amount of fat eaten and exercise done, multiple regression: Multiple regression is when you have one dependent variable and several independent predictor variables: In multiple regression, have several indep predictor variables x1-xn with individual coefficients (B1-Bn) Hence….. Each equation is related to one sampling point. So we can write this in matrix notation ú û ù ê ë é + = m 2 1 m1 21 11 e b1 x y b2 bn 12 22 2n 1n m2 mn In matrix notation

SPM !!!! ú û ù ê ë é + = m 2 1 m1 21 11 e b1 x y b2 bn 12 22 2n 1n m2 mn So this is effectively the GLM- linear regression and multiple regression are specific examples of it. So in SPM column Y refers to a single voxel and time points 1-m. X is your design matrix containing the predictors that you think may explain the observed data Betas are the parameters that define the contribution of each component of the design matrix to the model Epsilons are the residuals relating to each time point Summary- linear regr is simplest example of GLM which is crucial to SPM. Observed data = design matrix * parameters + residuals

The End Any questions?* *See Will, Dan and Lucy

Correlation and Regression

Similar presentations

Presentation on theme: "Correlation and Regression"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Correlation and Regression

Similar presentations

Presentation on theme: "Correlation and Regression"— Presentation transcript:

Similar presentations

About project

Feedback