Chapter 9 Regression Wisdom *Subsets *Extrapolation *Outliers, Leverage, and Influence Points *Lurking Variables
Subsets The data should be homogeneous (of the same or a similar kind or nature) If the data is made up of two or more groups that have been thrown together, it is usually best to fit different linear models to each group Residual plots can help find subsets in the data
Cereal – without subgroups
Cereal – with subgroups
Extrapolation Although linear models provide an easy way to predict values of y for a given value of x, it is unsafe to predict for values of x far from the ones used to find the linear model Such extrapolation may pretend to see into the future, but the predictions should not be trusted Example: data was collected from 1945 – 2000 in Massachusetts of the number of women in elected positions. We should NOT use the model to predict how many women will hold office in 2015
Homework a: 1900 – 1940 there is a linear pattern; 1940 – 1970 the data is curved up; 1970 – 2000 there is a strong linear pattern b: relatively strong from 1970 – 2000 c: no, on the whole graph. If we look at 1970 – 2000 then yes there would be a high correlation d: no. its not straight enough
Homework a: plot the data from 1955 – 1995. The scatterplot has a slight curve. Check the residual plot!! Residual plot has a pattern to it, so it is not a good place to use a linear model. If you did find an equation the predicted value would be 25.3 years. b: not too much. The data is not straight enough to use a linear regression. c: 50 years is too far from the data to make a prediction
Homework a: knowing only the R2 value is not enough to use a linear regression. We need to check a residual plot and the 3 conditions (straight enough, quantitative variables, and outliers) b: no, a linear model might not even fit
Homework a: for every degree the temp rises the cost will go down $2.13 b: The cost when the temp is 00F c: Too high, the residual is negative showing the model overestimates the cost. d: cost = $111. 70 e: actual = $106.70 f: No, the residual plot has a curve to it. The data are probably not linear g: no, there would be no change. The relationship does not depend on the units
Outlier Any data point that stands away from the others In regression, outliers can be extraordinary in two ways having a large residual having a high leverage
Remember Linear models do not fit values with large residuals well Large residuals always need a second look
Leverage Data points whose x-values are far from the mean of x are said to exert leverage of a linear model. High leverage points pull the line close to them large effect of the line can completely determine the slope and the y-intercept with a high enough leverage their residuals can be deceptively small
Leverage Points A linear regression goes through the point Think of this point as the fulcrum of a lever The father away a point is from the fulcrum the more leverage it has High Leverage has the potential to change the regression line
High Leverage Points How to decide if the point will change the regression model Find the regression model with and without the leverage point The point is influential if there is a big change in the model
Influence Depends on both leverage and residual high leverage point whose y value is on the line the point is NOT influential moderate leverage point with a very high residual the point is influential YOU HAVE TO CHECK THE MODELS!!
Unusual Points Unusual points can sometimes tell us more about a model or data than any other point A model based on 1 point is unlikely to be helpful to understand the rest of the data Looking at 1 point against the rest of the data is the best way to understand the point
Warning! Do NOT throw away points!!!! Take out unusual points to look at the model without them Throwing them away can give us a false sense of how accurate the model is Look for the unusual points in the scatterplot they can hide in the residual plots
Checking In Each of these scatterplots shows an unusual point. For each, tell whether the point is a high leverage point, would have a rage residual, or is influential.
Causation No matter how strong the association… No matter how large the R2 value… No matter how straight the line is… you can NOT conclude from the regression alone that one variable CAUSES another
Lurking Variable Only for observational data opposed to data from a designed experiment We can not be sure that a lurking variable is not the cause of a strong or weak association
Life Expectancy The relationship between life expectance (years) and availability of doctors (measured as √(doctors/person)) for the countries of the world
Life Expectancy The relationship between life expectancy (years) and the availability of TVs ((measured as √(TVs/person)) for the countries of the world
Means vary less than individual values Warning!! Summarized Data: can give a false sense of how good an association is Means vary less than individual values Weight (lb) against height (in) for a sample of men. R2 = 41.5% Mean weight (lb) against height (in). R2 = 80.1%
Homework # 10 slope = -.1 for every mph you increase your mpg decreases by .1 y-int = 32 the y-int would be your mpg at 0 mph. the residuals are negative, so the model is overestimating mpg 27 mpg predicted = 27.5 mpg + 1 (residual) = 28.5 mpg strong but not linear no. the residual plot shows the data is not linear
Homework # 11 a high leverage, low residual no, not influential to the slope correlation would decrease the slope would stay about the same because the point is on the line
Homework # 11 b 1.high leverage, small residual (remember the point is pulling the line towards it) 2. yes, influential 3. correlation would weekend and become less negative 4. the slope would increase toward 0
Homework # 11 c some leverage, high residual slightly influential correlation would increase because scatter would decrease slope would increase
Homework # 11 d low/no leverage, high residual not influential correlation would become stronger slope would increase
Homework # 15 stronger, the point has high leverage and is influential so its pulling the line toward it. slope and correlation would both increase you could take the humans out. Now your data is for non-human mammals. moderately strong for every year an animal is expected to line it has to live 15.5 days in its mother before being born 270.4 days
Homework # 16 hippos would make the association stronger because it is farther from the pattern increase no, there must be a good reason to take out points yes, the slope changed from 15.5 to 11.6. that is a big difference
Homework # 19 No! There is a high leverage point with point: without point: There is a large change in R2 and the slope
Homework #20 only 7% of the variation in time is accounted for by the regression on year we can’t say with such a bad regression probably not, the point doesn’t have much leverage 15.9% is better, it appears that swimmers are taking 14 minutes off there time each year
Homework # 22 2 subgroups: 1965 – 1985; linear and positive 1994 – 1998; linear and flat (horizontal)
Homework # 23 a) the graph is clearly nonlinear, however from about 1972 and on appears to be a positive linear relationship b) In 2010 CPI = $218.60
Homework # 24 not including Costa Rica the data has a strong negative linear association Costa Rica has 25 babies/woman. It has to be a mistake, because it is impossible r = .814 and R2 = 66.4% without Costa Rica w/Costa Rica w/out C.R. e) the model with C.R. is not appropriate, the residual plot has some pattern. Without C.R. the residual plot has an even amount of scatter with no pattern f) slope: the life expectancy goes down 4.36 years for every baby a woman has. the y-intercept says a woman with no children should live to be 86.8 years old. g) there could be a lurking variable also effecting life expectancy