Unit 2 Exploring Data: Comparisons and Relationships Topic 11 Least Squares Regression II (page 227)
Lists you will need in your calculator for this topic: DISTI ENROL FACSZ GEST LFEXP LONG PERTV REV
OVERVIEW This topic extends your study of least squares regression. You will examine the impact that a single observation can have on a regression analysis, learn how to use residual plots to indicate when the linear relationship is not appropriate, and discover transformation of variables as a way to use regression even when the relationship between the variables is not linear.
Do the Preliminaries (pages 227 & 228)
Essential Question What is the distinction and the importance of outliers and influential observations, in the context of regression analysis?
Activity 11-1 Gestation and Longevity (pages 228 to 232) [Copy lists named GEST and LONG.] Gestation = ________ + ________ * Longevity y = a + b x 21.71 13.13
gestation longevity
Gestation = 21.71 + 13.13 * Longevity For every additional year of longevity, one expects the animal's gestation period to _________________ by ________ days. The proportion of the variability in animals’ gestation period that is explained by the regression line is ______________. r2 increase 13 44%
(d) longevity
Is there any relationship between residuals and longevities? _____ [explain] NO It seems as though the predictions are generally closer when the longevity is very small.
elephant 98 does not Gestation = 21.71 + 13.13 * Longevity (e) The __________________ is clearly an outlier both in longevity and in gestation period. Its residual value is _____________ days. This animal ________________ have the largest residual value (in absolute value). elephant 98 does not Elephant: gestation = 645 days & longevity = 40 years
Elephant gestation longevity
Elephant longevity
Outliers in regression lines are observations with _________ residuals (in absolute value), ie. outliers fall far from the __________________, not following the pattern of the ________________ apparent in the others. large regression line relationship
giraffe 272 longer Gestation = 21.71 + 13.13 * Longevity The animal with the largest (in absolute value) residual is the _____________. Its residual value is _____________ days. It has a _____________ gestation period for an animal with its longevity. giraffe 272 longer Giraffe: gestation = 425 days & longevity = 10 years
Giraffe gestation longevity
Giraffe longevity
9.03 13.56 50% y = a + b x (g) Remove the giraffe’s information. Gestation = ________ + ________ * Longevity y = a + b x r2 = ____________ 9.03 13.56 50%
NO gestation longevity (h) Is the new regression line substantially different from the actual one? ______ gestation longevity
(i) Return the giraffe’s information and remove the elephant’s information. Gestation = ________ + ________ * Longevity y = a + b x r2 = _____________ 44.97 11.06 27%
elephant gestation longevity (j) The removal of the __________________ affected the graph more. gestation longevity
influential affect extreme potentially elephant influential In a least squares regression, an ____________ observation is an observation whose removal would substantially ______ the regression line. Observations with __________ values of the predictor variable are ___________ influential. In this activity, the _______________ is an _____________ observation due to its _________________ long lifetime. affect extreme potentially elephant influential exceptionally
(k) Change the elephant’s information from 645 days to 45 days. Gestation = ________ + ________ * Longevity y = a + b x r2 = _____________ 110.2 5.26 9%
See the influence of the change of the elephant’s information from 645 days to 45 days. gestation longevity
Elephant influential observation Giraffe outlier gestation longevity
Giraffe outlier Elephant influential observation longevity
Activity 11-12: College Enrollments (page 242) Assignment Activity 11-12: College Enrollments (page 242) Assignment Activity 11-15: Turnpike Tolls (continued) (page 243)
Essential Question How can residual plots be used to indicate whether a linear model is satisfactory for describing the relationship between two variables?
Activity 11-2 Residual Plots (pages 232 to 235)
d b a c (a) _______ 1. MPG rating vs. weight for sports cars _______ 2. distance from sun vs. position number for planets _______ 3. rent vs. price for Monopoly properties _______ 4. airfare vs. distance for selected destinations b a c
d (1) (b) __________ The residuals are randomly scattered __________ The residuals are largely randomly scattered except for two very large negative residuals __________ The residuals show a distinct curved pattern __________ The residuals show a clear linear pattern with three severe outliers c (4) b (2) a (3)
1 4 evenly 2 3 curve (c) Plots ______ and ______ summarize the relationship in the data about as well as possible. … because the points fall roughly ___________ about the least squares regression line. Plots ______ and ______ fail to capture important aspects of the relationship. … because they would best be described by some type of _____________. evenly 2 3 curve
NO closer the data about as well as possible correspond (d) Do the scatterplots where the lines summarize the data about as well as possible correspond to the highest values of r2? ______ … because more points fall _________ to the line in the other two plots, although they are not best modeled by a linear fit. NO closer
not random pattern linear Residual plots can indicate when a linear model does ______ adequately describe the relationship in the data. When a straight line is a reasonable model, the residual plot should reveal a seemingly ____________ scattering of points. When a nonlinear model would fit the data better, the residual plot reveals a _____________ of some kind. The value of r2 alone is not sufficient for assessing the fit of the ___________ model. not random pattern linear
Essential Question How can variables be transformed to create a linear relationship between variables?
Activity 11-3 Televisions and Life Expectancy (continued) (pages 235 to 237) When a straight line is not the best mathematical description of a relationship, one can _______________ one or both variables to make the association more __________. transform linear
(a)
life expectancy = ________ + _______ * Per TV (a) [Use the lists named LFEXP and PERTV.] life expectancy = ________ + _______ * Per TV y = a + b x r2 = _____________ The correlation coefficient between life expectancy and people per TV is ___________. Does the relationship appear to be linear? _______ 70.72 - 0.12 0.6461 -0.8038 NO
SetUpEditor PERTV, LPRTV Create a new variable on your calculator named LPRTV for logarithm of the number of people per television. Enter on your home screen … log(LPERTV) ➔ LPRTV Then press ENTER. Use the Set Up Editor to check your values. SetUpEditor PERTV, LPRTV Go to Edit to check your values.
life expectancy = ________ + __________ * Log(Per TV) Enter on your home screen … LinReg(a+bx) LLPRTV, LLFEXP, Y1 Then press ENTER. life expectancy = ________ + __________ * Log(Per TV) y = a + b x r2 = _______________ The correlation coefficient between life expectancy and log of the number of people per TV is ______________. r 80.59 -13.33 0.8505 -0.9222
Usually written as a percent. life expectancy = 80.59 + -13.33 * Log(Per TV) y = a + b x r2 = 0.8505 r = -0.9222 (d) The proportion of the variability in the countries’ life expectancies that is explained by the regression equation with the log of the number of people per television is ___________. 85% Usually written as a percent.
life expectancy = 80.59 + -13.33 * Log(Per TV) y = a + b x For a country with 10 people / TV, the equation predicts a life expectancy of _______ years. Please note, it is NOT PerTV, it is Log (PerTV). To get the correct number of years, you must enter … Y1(log(10)) 67
life expectancy = 80.59 + -13.33 * Log(Per TV) y = a + b x For a country with 100 people / TV, the equation predicts a life expectancy of ______ years. To get the correct number of years, you must enter … Y1(log(100)) Part (f) 54 minus part (e) 67 equals ______ years, which is the ______________________________. 54 -13 slope coefficient
NO (g) Does the residual plot reveal any clear pattern? ______ Log (Per TV)
The residual plot is randomly scattered! Log (Per TV)
transformed life expectancy = 70.72 + -0.12 * Per TV life expectancy = 80.59 + -13.33 * Log(Per TV) (h) The linear regression model is a better fit with the _____________________ data. transformed
WRAP-UP This topic has extended your study of least squares regression. You have examined outliers and influential observations, noting the effects that they can have on regression analysis. You have also learned to use residual plots to judge whether a nonlinear model might better describe the relationship between two variables, and you have discovered how to transform variables when such a nonlinear model is called for.
WRAP-UP This unit and the previous one have addressed exploratory analyses of data. In the next unit you will begin to study background ideas related to the general issue of drawing inferences from data. You will find that drawing meaningful inferences depends on having collected data well in the first place, and you will study again the ideas of random sampling and randomization that you encountered earlier. Your study of randomness will lay the foundation for procedures of statistical inference.
There could be other variations … what are they? Review from Topic 10 … r tells how well the data fits the line , hence the name, “line of best fit.” r 2 gives the percent of the variability of the “y’s” accounted for in the model . For example, if r 2 = 60 % , then 40 % is not accounted for in the model . There could be other variations … what are they?
Review on Residuals from Topic 10 … Residuals : try to put the points in a rectangle . Are they randomly scattered or are they in a pattern ? The BEST two indicators of a good fitting line are r2 and residuals . Like the viewing screen on calculator.
Review from Topic 10 … r works only for linear models. r2 works for all models. r2 gets better with higher order models.
Review from Topic 11 … Residual plots can indicate when a linear model does not adequately describe the relationship in the data. When a straight line is a reasonable model, the residual plot should reveal a seemingly random scattering of points. When a nonlinear model would fit the data better, the residual plot reveals a pattern of some kind. The value of r2 alone is not sufficient for assessing the fit of the linear model.
Activity 11-5: Planetary Measurements (continued) Assignment Activity 11-5: Planetary Measurements (continued) (page 238)
Your topic is due! Quiz on Topic 11: Least Squares Regression II Prepare for a test on the second part of Unit 2 … topics 8 to 11.