Correlation
The statistic: Definition is called Pearsons correlation coefficient
1.-1 ≤ r ≤ 1, |r| ≤ 1, r 2 ≤ 1 2.|r| = 1 (r = +1 or -1) if the points (x 1, y 1 ), (x 2, y 2 ), …, (x n, y n ) lie along a straight line. (positive slope for +1, negative slope for -1) Properties
Proof Uses the Cauchy-Schwarz inequality
Let then and if v i = bu i for some b and i = 1, 2, …, n. Cauchy-Schwarz Inequality
Let then This is a quadratic function of b and has a minimum when Proof:
or hence
Thus and i.e. v i = b min u i for i = 1, 2, …, n. if
Finally or i.e.
Also i.e. if and only if or
Note: and
Properties of Pearson’s correlation coefficient r 1.The value of r is always between –1 and If the relationship between X and Y is positive, then r will be positive. 3.If the relationship between X and Y is negative, then r will be negative. 4.If there is no relationship between X and Y, then r will be zero. 5.The value of r will be +1 if the points, ( x i, y i ) lie on a straight line with positive slope. 6.The value of r will be +1 if the points, ( x i, y i ) lie on a straight line with positive slope.
r =1
r = 0.95
r = 0.7
r = 0.4
r = 0
r = -0.4
r = -0.7
r = -0.8
r = -0.95
r = -1
The test for independence (zero correlation) The test statistic: Reject H 0 if |t| > t a/2 (df = n – 2) H 0 : X and Y are independent H A : X and Y are correlated The Critical region This is a two-tailed critical region, the critical region could also be one-tailed
Example In this example we are studying building fires in a city and interested in the relationship between: 1. X = the distance of the closest fire hall and the building that puts out the alarm and 2. Y = cost of the damage (1000$) The data was collected on n = 15 fires.
The Data
Scatter Plot
Computations
Computations Continued
The correlation coefficient The test for independence (zero correlation) The test statistic: We reject H 0 : independence, if |t| > t = H 0 : independence, is rejected
Relationship between Regression and Correlation
Recall and since
The test for independence (zero correlation) Uses the test statistic: H 0 : X and Y are independent H A : X and Y are correlated Note: and
1.The test for independence (zero correlation) H 0 : X and Y are independent H A : X and Y are correlated are equivalent The two tests 2.The test for zero slope H 0 : = 0. H A : ≠ 0
The Coefficient of Determination
The Residual Sum of Squares in Regression Note:
Proof Total Variance in Y = Variance Unexplained +Variance Explained
Proportion of Variance Unexplained = Proportion of Variance Explained = 1 - Proportion of Variance Unexplained = r 2 r 2 is called the Coefficient of Determination
92.3% = Proportion of Variance in Y (Cost of Damage) explained by X (distance to closes fire hall). Proportion of Variance Unexplained = 1 - r 2 r = Example: Fire Example r 2 = the Coefficient of Determination = = = = (7.7%)