The Coefficient of Determination: r 2 Section 3.3.2
Starter Write a description of what it means to say that there is a negative association between two variables. (Don’t tell me about the graph!) If there is a strong negative association: –What value would you expect r to take on? –What would you expect the graph to look like?
Objective Use the calculator LinReg command to find the equation of a LSRL Interpret the meaning of r 2 in the context of the data
To find r and the LSRL on the calculator: Enter the explanatory variable data (FEMUR) into L 1 Enter the response variable data (HUMER) into L 2 Tap Zoom:9 to see the scatterplot as usual In the STAT:CALC menu, choose 8:LinReg(a+bx) –Follow the command with L 1,L 2,Y 1 Find Y 1 under VARS:Y-VARS:Function… Tap ENTER to run the command –The screen will show a, b, r and r 2 –If you don’t see r and r 2, enter DIAGNOSTIC:ON Tap GRAPH to see the LSRL on the scatterplot To predict y when x=47 –Trace to x=47 on the Y 1 graph or enter Y 1 (47)
The Sanchez Data Again Put the Sanchez lists (GASDA and GASFT) into L 1 and L 2 Run LinReg again with these data –Sketch the scatterplot and regression line –Write the value of r and r 2 –Write the LSRL equation in context Predict the amount of gas used in a month with 19 degree-days
Answers You should find a = and b =.189 –So: gas used = x coldness –Note that the equation is in the context of the problem You should find r =.995 –This is a very strong positive linear association You should also find r 2 =.991 –Save this for later use You should find that 19 degree-days predicts 468 cu ft of gas (4.68 hundreds)
Variability in Linear Associations Consider the response variable “humerus length” in the archaeopteryx data –Were all specimens the same length? –They are not, so why? There are two possibilities: Larger or smaller animals still should have the same proportions, so the association described by the LSRL leads to larger or smaller y values OR… Random variation – in other words, chance! So which is it? Actually it’s both –So how can we quantify the two causes?
Quantifying Random Variation All the y values vary about the y-mean –Some are greater, some are less than –The sum of all these deviations is zero –But the sum of the squares is not zero See the example on page 146 So do they also vary about the LSRL, or do they lie exactly on the line? –If all the points lie on the line, then the y-variability must all have come from the linear association. –If points randomly miss the line, then there are non-zero deviations, and the sum of their squares is not zero. See the example on page 147 So the ratio of these two area sums can be a measure of random variation.
Finding the Ratio of Areas The first area (squares of deviations about ) is called SSM: Sum of Squares about Mean The second area (squares of deviations about y-hat) is called SSE: Sum of Squares for Error Then the ratio SSE / SSM tells us how much of the variability is due to random chance So 1 – SSE/SSM tells us how much of the y-variation is due to the association –The author expresses this as (SSM-SSE)/SSM on page 147 –Note that if all points are on the LSRL, then SSE = 0 so 100% of the y- variability is due to the linear association Here’s the punch line: –First find r, the correlation constant, by linear regression –Then it turns out that r 2 is equal to the fraction (SSM-SSE)/SSM It is called the coefficient of determination The author chooses to skip the proof; so do I! –Note that when you run LinReg to find r, you also get r 2 at the same time. So r 2 expresses the proportion of y-variation that is due to the linear association and 1 – r 2 is the proportion that is due to random chance.
The Sanchez Data We previously pasted the Sanchez lists (GASDA & GASFT) into L 1 & L 2 and ran LinReg to find r and r 2 Write a sentence that answers this question: What proportion of the variability in the gas usage data is attributable to the linear association with coldness of weather (as measured in degree-days)?
Since r 2 =.991, we conclude that about 99% of the variability in gas usage can be accounted for by the least squares regression line equation and about 1% is due to random chance.
Objective Use the calculator LinReg command to find the equation of a LSRL Interpret the meaning of r 2 in the context of the data
Homework Read pages 144 – 150 Do problems 36, 37, 38