REGRESSION AND CORRELATION

REGRESSION AND CORRELATION
An Introduction to REGRESSION AND CORRELATION

How do we measure the association of 2 continuous, numeric scale variables?
Example: Observations are available on a sample of 30 individuals systolic blood pressure (SBP) age We are interested in the relationship between SBP and age for these patients (descriptive) and for the population which they represent (inferential).

Data on 30 individuals: individual SBP AGE (i) (Y) (x) 1 144 39 16 130
48 2 220 47 17 135 45 3 138 18 114 4 145 19 116 20 5 162 65 124 6 142 46 21 136 36 7 170 67 22 50 8 42 23 120 9 158 24 10 154 56 25 160 44 11 64 26 53 12 150 27 63 13 140 59 28 29 14 110 34 125 15 128 30 175 69

Note: We have 30 pairs of observations which we can denote as: (x1,y1) = (39,144) (x2,y2 ) = (47, 220) … (x30,y30) = (69, 175) Where xi refers to age for the i th subject yi to SBP for the i th subject

• These data pairs may be considered as points in two dimensional space, so that we may plot them on a graph. Scatter diagram of age and systolic blood pressure Note: age and SBP seem to be related: Younger subjects tend to have lower SBP older subjects higher SBP. SBP (mm Hg) 240 220 200 180 160 140 120 20 30 40 50 60 70 80 AGE in years

How can this relationship be measured?
y y No relationship between x and y. Spread is even in all directions. x x Linear relationship: A line indicates the main direction of the spread of points. y Non-linear relationship between x and y. A curve best describes the relationship. x

Math Review: Equation for a Line
y bo= y-intercept = value of y when x=0 b x b1= slope = Dy / Dx (change in y)/(change in x)

b1 = “slope” = Dy / Dx = (change in y) / (change in x)
Slope > 0: positive slope (as x increases, y increases) D x Slope = 0 D y Slope < 0: negative slope (as x increases, y decreases) D x

Now, given a set of data, how can we get the line that best fits or best represents the data?
When it is appropriate to predict one variable (y) from another variable (x) -- there is some directionality in the relationship – then : Commonly use a technique know as Least Squares Regression to estimate intercept: b0 slope: b1 denote the estimates b0 and b1, respectively ^ ^ (referred to as beta-nought-hat and beta-one-hat)

For each observed value xi, we have an observed yi, and the
We are looking for that line which minimizes the vertical distances to the data points. • ì í ï î d i d i • For each observed value xi, we have an observed yi, and the “predicted” value yi, on the line: yi = b0+ b1xi The vertical distances are : di = (yi – yi). ^ ^ ^ ^ ^

xi = observed x for ith subject yi = observed y for ith subject
That is, we have: xi = observed x for ith subject yi = observed y for ith subject yi = predicted y for ith subject ^ y (xi,yi) yi ^ yi (xi,yi) ^ x xi

Sdi2 = S(yi – yi)2 The squared distances are: di2 = (yi – yi)2
– and the sum of squared deviations from the line – (sound familiar?) is Sdi2 = S(yi – yi)2 We want the line such that is minimized. ^ ^

The unbiased estimates of b0 and b1 which are
the least squares estimates and the minimum variance estimates Are: Use calculus in previous equations to solve

Example: Using the data on 30 individuals where we measured AGE (x) SBP (y) n = 30, y = , x = 45.13 We get:

Thus, the equation for this straight line is given by
240 220 200 180 160 140 120 20 30 40 50 60 70 80 AGE

Now, If yi = yi for all i, then SSE=0  perfect fit to line As the fit gets worse, SSE gets larger SSE serves as measure of fit to line ^

One of the assumptions for regression analysis is that of homoscedasticity:
the variance of y is the same for any x that is, the spread of values for y at each level of x remains ~constant y Spread of y|x Spread of y ignoring x x

An estimate of s2 is given by:
Lose 2 df: for estimating b0 and b1 The standard error, sy|x is a measure of the spread of y around it’s predicted value y for each value of x. ^

In our example: And the estimated standard error is: That is, for any given age x, the standard error of SBP is estimated as mmhg.

To address the question of association of x and y
We want to know if the slope is zero: Ho: b1=0 Ha: b10 240 220 200 180 160 140 120 20 30 40 50 60 70 80 AGE

Now, if we assume that for any fixed value of x y is normally distributed Then we can show that: In practice, since s2 is unknown Use sy|x2 in place of s2 Use the t-distribution, with n-2 df For hypothesis testing and CI

With these assumptions, to test
Ho: b1=0 Ha: b10 Test statistic:

In our example: The achieved significance is then: With p<.05, Reject Ho and conclude that age (x) provides significant information for predicting SBP (y).

In Minitab, enter the data in 2 columns, for SBP and AGE, and select:
Stat  Regression  Regression Response is Y variable Predictor is X variable

Regression Analysis: spb versus age
The regression equation is spb = age Predictor Coef SE Coef T P Constant age S = R-Sq = 43.2% R-Sq(adj) = 41.2% Analysis of Variance Source DF SS MS F P Regression Error Total

You’ll note that a significance test is also provided for b0:
H0: b0=0 vs. Ha: b00 T P We are rarely interested in tests of b0. It is often outside of the range of the data (e.g., here the youngest age is ~20) In this case it can be interpreted as the predicted SBP at age=0 – not meaningful. It is inappropriate to interpret regression relationships outside the range of the observed data.

A better model might exist (e.g, one with a curvilinear term)
but there is a linear component. • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • y • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • Here, a curve would provide a better fit Linear model fits better than y = y ^

or note : if H b = is not rejected it means either
1 = is not rejected it means either • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • x provides little or no help in predicting y The true relationship between x and y is not linear. or Note: even when Ho: b1=0 is rejected, some other non-linear model may be better

Part 2: The Correlation Coefficient
Provides a measure of how 2 random variables are associated, without assuming any direction to the association (i.e., no sense that x is predictive of y, just that they are related) Also a measure of the strength of the straight-line relationship between X and Y It can also be shown that:

Characteristics of correlation coefficient r:
-1 implies perfect negative correlation 0 implies no correlation 1 implies perfect positive correlation r is dimensionless – it is independent of units of x or y r always has same sign as slope r is the sample estimator of the population correlation r

divide the data into 4 quadrants by lines at the means of x and y
If we divide the data into 4 quadrants by lines at the means of x and y and for each point, examine the direction of the deviation from these means: for (xi, yi) examine sign (+/-) of: (xi – mx) and (yi – my) for each quadrant … II I + x i - m x i - m - y i - m y i - m + + m y y i - m y i - m - - + x i - m x i - m III IV - x m x

Covariance between x and y:
- m y i - m x i - m ( ) y Quadrant I + + + II - + - III - - + IV + - - II I + Covariance between x and y: x i - m x i - m - y i - m y i - m + + m y y i - m y i - m - - + x i - m x i - m III IV - x m x

Correlation between x and y:
Now, if points look like: Since most points are in QI and QIII: sxy> 0  r > 0, b1 >0 Since most points are in QII and QIV: sxy< 0  r < 0, b1 <0

Since points are in all 4 quadrants: sxy= 0 
r = 0, b1 = 0

(a) (b) Correlation, r , in (a) is greater than r in (b), since points are closer to line in (a) This is true, even when the slopes are the same.

Testing Hypotheses on Correlation:
To test Ho: r = 0 vs. Ha: r  0 Use: It is identical to testing for b1 = 0 ^

In Minitab: Stat  Basic Stats  Correlation
Correlations: sbp, age Pearson correlation of sbp and age = 0.658 P-Value = 0.000

Use this to compute r = .432 = .657
Note that the Regression Analysis results provide a value for r2 (see slide 25): R-Sq = 43.2% Use this to compute r = .432 = .657 We also have the significance test for zero correlation: Ho: r=0 vs. Ha: r0 Since it is identical to the test of zero slope: T P

Regression and Correlation Analysis are closely related
Correlation evaluates the strength of a linear association Does not impose any directionality on the relationship Regression evaluates strength of a linear relationship (slope of line) Direction is imposed ( e.g., age  SBP rather than the reverse) Significance test on slope, b1, is equivalent to significance test on correlation r ^

REGRESSION AND CORRELATION

Similar presentations

Presentation on theme: "REGRESSION AND CORRELATION"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

REGRESSION AND CORRELATION

Similar presentations

Presentation on theme: "REGRESSION AND CORRELATION"— Presentation transcript:

Similar presentations

About project

Feedback