Download presentation
Presentation is loading. Please wait.
Published byJayson Wilkins Modified over 9 years ago
1
Chapter 8 Linear Regression HOW CAN A MODEL BE CREATED WHICH REPRESENTS THE LINEAR RELATIONSHIP BETWEEN TWO QUANTITATIVE VARIABLES?
2
Fat Versus Protein: An Example The following is a scatterplot of total fat versus protein for 30 items on the Burger King menu: How many grams of fat would an item with 25 grams of protein have? Slid e 8- 2
3
What is Linear Regression Remember that correlation suggests there is a “linear” relationship between two variables. We can say more about the linear relationship between two quantitative variables with a model. The linear relationship is modeled by a straight line through the data. The data points do not all line up on the line, but a straight line summarizes the overall direction of the data.
4
Regression and Residuals Some points will be above the line some points will be below the line. The estimate made from a model is the predicted value (denoted as ŷ ). The difference between a predicted value and the actual value is known as the residual
5
Residuals (cont.) A negative residual means the predicted value’s too big (an overestimate). A positive residual means the predicted value’s too small (an underestimate). Slid e 8- 5
6
Line of Best Fit Some residuals are positive (above the predicted line) and some are negative (below the predicted line). To find how well the line fits we add up the residuals. If we add the negatives and the positives, they cancel each other out. Therefore we add the squared residual values. The line of best fit is the line where the sum of the squared residuals is the smallest. The regression line is also know as the Least Squared Regression Line (LSRL)
7
Line of best fit It is written as Ŷ = a + bx ŷ= b 0 +b 1 x
8
Slope of the regression line Our slope is always in units of y per unit of x
9
Y intercept Our intercept is always in units of y
10
Residuals Revisited The model assumes all points are on the straight line. The points of data that are not on the line are those that have not been modeled. Data = Model + Residual Residual = Data – Model In symbols
11
Example Given the regression line for the previous scatter plot Ŷ = 6.413 + 0.9769x Predicted Fat = 6.413 + 0.9769protein What does the slope represent? What does the y intercept mean?
12
Example continued Given the regression line for the previous scatter plot Ŷ = 6.413 + 0.9769x Predicted Fat = 6.413 + 0.9769protein How much fat would we expect an item with 12 grams of protein to have? How much protein would an item with 15 grams of fat have?
13
Example continued Given the regression line for the previous scatter plot Ŷ = 6.413 + 0.9769x Predicted Fat = 6.413 + 0.9769protein A Double Whopper sandwich has 48 grams of Protein and 58 grams of fat. What is the residual in fat for this sandwich?
14
Example Burger King The following are select items from the Burger King Menu with grams of fat and total calories ItemCaloriesGrams of fat Whopper65037 Whopper with cheese73044 Big King53031 Hamburger2309 Cheeseburger27012 Tendergrill chicken Sandwich46021 Original chicken Sandwich66040 Big fish Sandwich52028 BK Veggie Burger39016
15
Example Continued What is the regression line for the data? What is the slope in the context of the problem? What is the y-intercept in the context of the problem? A sandwich with 15 grams of fat would be expected to have how many calories? A sandwich with 450 calories would be expected to have how many grams of fat? A Bacon Cheeseburger has 13 grams of fat and 290 total calories, what is the residual in calories for this sandwich?
16
Conditions Required 1. Quantitative Variable condition 2. Straight enough condition 3. Outlier condition
17
R-Squared R 2 – gives the fraction of the data’s variation accounted for by the model and 1 - R 2 is the fraction of the original variation left in the residuals. Example: Burger King sandwich example r is 0.9881 r 2 is 0.9763 97.63% of the calorie content in Burger King Sandwiches is explained by the fat content. 2.37% comes from other factors.
18
Residual Plot A diagram of the residuals of the regression line. A noticeable pattern in the residual plot may indicate that the regression line is not a good model. The residual plot of a better fit model will have appropriate scatter
19
What not to do Don’t fit a straight line to a non linear relationship Beware of extraordinary points Don’t extrapolate beyond the data Don’t infer that x causes y just because there is a good linear model for their relationship Don’t choose a model based on r 2 alone.
20
Breakfast Cereals, sugar and Calories The following is data from 77 different breakfast cereals comparing the relationship of sugar in the cereal and the amount of calories with each cereal. R = 0.564 Calories mean – 107.0 SD – 19.5 Sugar mean – 7.0 grams, SD – 4.4 What is the slope of regression line? What is the y – intercept? Write the regression equation? Interpret
21
Urban planning We want to estimate the costs per person associated with traffic delays 2002 Urban mobility report (70 cities in 2000) Annual cost person mean - $298.96 SD - $180.83 Average speed per person mean – 54.34 mph, SD 4.494 mph R = -0.90 Write an equation to model this situation What does the slope mean?
22
What to watch out for in Regression Interpreting beyond the data – extrapolating Influential points Lurking variables Linear regression that is not “linear” – what to do
23
Extrapolation We cannot assume that a linear relationship in the data exists beyond the range of the data. Once we venture into new x territory, such a prediction is called an extrapolation.
24
Slide 9- 24 Extrapolation (cont.) A regression of mean age at first marriage for men vs. year fit to the first 4 decades of the 20 th century does not hold for later years:
25
Influential Outliers We say that a point is influential if omitting the point from the scatterplot completely gives a different model.
26
Slide 9- 26 Outliers, Leverage, and Influence (cont.) The following scatterplot shows that something was awry in Palm Beach County, Florida, during the 2000 presidential election…
27
Lurking Variable No matter how straight the line, no matter how strong the association, or how high the R- squared value is, there is no way to conclude from regression alone that one variable causes the other. There is always the possibility that some third variable is driving both of the variables being observed.
28
What to do when the linear regression line is not straight Re-express the data with logs, square roots, reciprocals We will look at square roots and logarithms, primarily Example: taking the square root of the response variable and re-expressing the data in a scatterplot and examining the residual plot. Example: Re-expressing data using a combination of logarithms, log(x), log (y) Fit a line to the curved graph – more difficult
29
Slide 10- 29 The Ladder of Powers Ratios of two quantities (e.g., mph) often benefit from a reciprocal. The reciprocal of the data An uncommon re-expression, but sometimes useful. Reciprocal square root -1/2 Measurements that cannot be negative often benefit from a log re-expression. We’ll use logarithms here “0” Counts often benefit from a square root re- expression. Square root of data values ½ Data with positive and negative values and no bounds are less likely to benefit from re- expression. Raw data 1 Try with unimodal distributions that are skewed to the left. Square of data values 2 CommentNamePower
30
Slide 10- 30 Plan B: Attack of the Logarithms (cont.)
31
Slide 10- 31 Why Not Just a Curve? If there’s a curve in the scatterplot, why not just fit a curve to the data?
32
Slide 10- 32 Why Not Just a Curve? (cont.) The mathematics and calculations for “curves of best fit” are considerably more difficult than “lines of best fit.” Besides, straight lines are easy to understand. We know how to think about the slope and the y-intercept.
33
Example: Data collected in the study of water pollution from commercial and domestic waste DayOxygen Demand 1109 2149 3 5191 7213 10224
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.