Unit 2 Exploring Data: Comparisons and Relationships Topic 10 Least Squares Regression I (page 205)
Lists you will need in your calculator for this topic: FOUND GSAL INCM PRIVF PRIVT PUBF PUBT TUIT
OVERVIEW In previous topics you studied scatterplots as visual displays of the relationship between two quantitative variables and the correlation coefficient as a numerical measure of the linear association between them. With this topic you will begin to investigate least squares regression as a formal mathematical model often used to describe the relationship between quantitative variables.
Do the Preliminaries (page 206)
Denominate numbers are numbers that have their units attached to the number. Please be sure to always include the units to give the numbers meaning.
Essential Question What is the best way to summarize the relationship between two quantitative variables? Where do we use predictions in our lives and how do we decide on what we are predicting?
Activity 10-1 Air Fares (pages 206 to 209)
(a)My prediction for airfare from Baltimore, would be $________. [explain] (b)Another variable that might be useful for predicting the airfare to a certain destination, would be ___________________. your prediction … your explanation … your variable
distance in miles airfare in $ Atlanta (576, 178) [Make your own scatterplot of the data from page 207.]
distance in miles airfare in $ n = 12
(c)Based on this scatterplot, does it seem that knowing the distance to a destination would be useful for predicting the airfare? _________ [explain] It appears to be a fairly strong positive association between distance and airfare … as the distance increases, the price of the airfare increases. YES The simplest model to use to predict one variable from another variable is to assume that a straight line s ummarizes the relationship between the variables.
distance in miles airfare in $ (d)
distance in miles airfare in $ One idea is to draw YOUR line so it connects the most “left” and most “right” points.
distance in miles airfare in $ Another idea is to draw YOUR line so it divides all the points in half.
distance in miles airfare in $ … or maybe like this.
distance in miles airfare in $ Finally, another idea is to draw YOUR line so it connects the most points possible.
distance in miles airfare in $ … or maybe like this.
distance in miles airfare in $ I like this one! But you pick your own. Use a ruler to draw YOUR line on the scatterplot in YOUR notes.
(e)Based on my line, if distance is 300 miles, then the air fare is approximately _________. OR … if x 1 = 300, then y 1 = _______.
distance in miles airfare in $ $149
(e)Based on my line, if distance is 300 miles, then the air fare is approximately _________. OR … if x 1 = 300, then y 1 = _______. (f)Based on my line, if distance is 1500 miles, then the air fare is approximately _________. OR … if x 2 = 1500, then y 2 = _______. $
distance in miles airfare in $ $226
(e)Based on my line, if distance is 300 miles, then the air fare is approximately _________. OR … if x 1 = 300, then y 1 = _______. (f)Based on my line, if distance is 1500 miles, then the air fare is approximately _________. OR … if x 2 = 1500, then y 2 = _______. $ $
The equation of a line can be represented as _____ = ___________, where ____ denotes the variable being predicted (reason for hat) (response variable - vertical axis), ____ denotes the variable being used for the prediction (explanatory variable - horizontal axis), ____ is the value of the y-intercept of the line, and ____ is the value of the slope of the line. In this case, x represents ______________ and y represents _____________. ŷ a + b x distance ŷ x a b airfare
(g)slope = b (h)y-intercept = a (i)airfare = ________ + ____________ * distance $ $0.06 / mile
Which line is the “ best ” ? … read the bottom of page 209 and top of page 210! The equation of the _________________ line or _______________ line also known as the ________ of ___________ is ŷ = a + b x where the slope coefficient ____ and the intercept coefficient ____ are determined from the sample data. b = a = Best Fit least squares regression Line b a
Essential Question How can you make predictions with a regression line? Where do we use predictions in our lives and how do we decide on what we are predicting?
Activity 10-2 Air Fares (continued) (pages 210 to 214)
(a) Enter the data from activity 10-1 on page 207 to make lists named AIRF & DIST meanstand.dev.correlation airfare (y) distance (x) (b) (b)
meanstand.dev.correlation airfare (y) distance (x) (c) b = a = = $0.12 per mile= $81.40 airfare = $ $0.12 / mile * distance This is the cost when distance is 0 miles.
(d) (e) (f) [Please be sure to follow the directions in the book.] a = __________________ b = __________________ r 2 = __________________ r = __________________ airfare = $ $0.12 / mile * distance One of the primary uses of regression is for ___________. prediction
(g)If distance is 300 miles, then the regression line predicts airfare to be … airfare= $ $0.12 / mile * distance = $ $0.12 / mile * (300 miles) (h)If distance is 1500 miles, then the regression line predicts airfare to be … airfare= $ $0.12 / mile * distance = $ $0.12 / mile * (1500 miles) = $ = $263.27
distance in miles airfare in $ (i) (i)
distance in miles airfare in $ (i) (i)
(j)Create a scatterplot on your calculator of airfare vs. distance
reate a scatterplot on your calculator of airfare vs. distance, then graph the least squares line by entering: LinReg (a + bx) L DIST, L AIRF, Y 1
(j)Create a scatterplot on your calculator of airfare vs. distance, then graph the least squares line by entering: (k)900 miles ➙ airfare is $_______. (your guess using the trace button on your calculator) (l)900 miles ➙ airfare is __________. (the regression line prediction) $ your guess LinReg (a + bx) L DIST, L AIRF, Y 1
STOP At this point you should know 3 ways to find this value!
1.Use the regression equation and enter 900 miles for distance: airfare= $ $0.117 / mile * distance = $ $0.117 / mile * (900 miles) = $ After storing the regression equation in Y 1 enter: Y 1 (900) = $ After storing the regression equation in Y= and graphing the equation, press 2 nd TRACE, then 1:value, then enter 900 and ENTER: Y = $ Take note that $0.117 was used Instead of $0.12.
$ (m)2842 miles ➙ airfare is __________. I would not take this seriously, because … I believe it is too excessive of a fare. It does not appear to be a very reasonable fare either. The airfare was actually $198 at that time. Extrapolation is predicting y values of x beyond t hose contained in the data. There is no reason to believe that a relationship between two variables remains roughly linear beyond the range of values contained in a data set, therefore extrapolation is not advisable. Interpolation is predicting y values of x between t hose contained in the data.
(n) In order to get this table with your calculator you can: a) enter in each value one at a time, or b) press 2 nd WINDOW, TblStart = 900 and ∆ Tbl = 1, then press 2 nd GRAPH to see the values. distance airfare
(o)Do you notice a pattern in these predictions? _______ Each prediction is _______ higher than the preceding prediction. Does this number look familiar? _______ [explain] distance airfare YES $0.12 YES This number is the slope coefficient of the regression line equation.
(p)Airfare will rise ______ for each additional 100 miles that a destination is farther away. $12 This demonstrates that one can interpret the __________ coefficient of the least squares line as the predicted change in the ____ - variable (airfare) for a _______ - unit change in the ____ - variable (distance). y one x slope
Assignment Activity 10-5: Cars Fuel Efficiency (continued) (page 220) Assignment Activity 10-7: College Tuitions (continued) (page 221) Where do we use predictions in our lives and how do we decide on what we are predicting?
Essential Question What is the proportion of variability and what is its meaning? Where do we use predictions in our lives and how do we decide on what we are predicting?
Activity 10-3 Air Fares (continued) (pages 214 to 218)
(a)576 miles ➙ airfare is __________. (the regression line prediction) $150.88
distance in miles airfare in $ Atlanta (576, 178) $ predicted value
(b)The actual airfare to Atlanta was $178. $ $ = $ actual value predicted value
distance in miles airfare in $ $27.12 is the difference.
Statistical modeling thinks of each data point as being composed of two parts: (1)the part that is _____________ by the model, called the ___, (2)and the “_____________” part, called the ____________. The FITTED VALUE for an observation is the y-value that the regression line would ____________ for the x-value of that observation. The RESIDUAL is the _______________ between the actual y-value and the fitted value. The residual measures the vertical distance from the observed y-value to the regression line. explained fit leftover residual predict difference
distance in miles airfare in $ $27.12 is the difference.
RESIDUAL = Actual - Fitted
distance in miles airfare in $ $27.12 is the difference.
destinationdistanceairfarefittedresidualdeviation Atlanta Boston Chicago Dallas Detroit Denver Miami New Orleans New York Orlando Pittsburgh St. Louis (c)
destinationdistanceairfarefittedresidualdeviation Atlanta Boston Chicago Dallas Detroit Denver Miami New Orleans New York Orlando Pittsburgh St. Louis (c) $ = $11.30 Actual - Fitted = RESIDUAL
destinationdistanceairfarefittedresidualdeviation Atlanta Boston Chicago Dallas Detroit Denver Miami New Orleans New York Orlando Pittsburgh St. Louis (c) $ (-61.10) = $ Actual - RESIDUAL = Fitted
(d) The city with the largest residual value (in absolute value) is ______________, its distance is _______ miles and its airfare is ______. The regression line err in predicting its airfare was an _____________ by _______.
destinationdistanceairfarefittedresidualdeviation Atlanta Boston Chicago Dallas Detroit Denver Miami New Orleans New York Orlando Pittsburgh St. Louis |-71.77| = 71.77
(d) The city with the largest residual value (in absolute value) is ______________, its distance is _______ miles and its airfare is ______. The regression line err in predicting its airfare was an _____________ by _______. St. Louis 737 $98 overestimate$71.77
distance in miles airfare in $
(e)For the observations with positive residual values, their actual airfare was ___________ than the predicted airfare. (f)For the observations with negative residual values, their points fall ___________ the regression line. greater below
(g)Find the deviation from the mean for Dallas then record the value in the table. airfare mean = $ The deviation from the mean for Dallas is: $ = $ Actual - MEAN = Deviation
destinationdistanceairfarefittedresidualdeviation Atlanta Boston Chicago Dallas Detroit Denver Miami New Orleans New York Orlando Pittsburgh St. Louis
(h)There are cities where the overall mean airfare results in a closer prediction to the actual airfare than the regression line. The cities are:
destinationdistanceairfarefittedresidualdeviation Atlanta Boston Chicago Dallas Detroit Denver Miami New Orleans New York Orlando Pittsburgh St. Louis
(h)There are _ cities where the overall mean airfare results in a closer prediction to the actual airfare than the regression line. The cities are: Atlanta Detroit Pittsburgh St. Louis 4
(i)Most cities have a smaller … Does this suggest that predictions from the regression line are generally better than the airfare mean? _______ [explain] YES residual than their deviation from the mean. The least squares regression equation takes the explanatory (x) variable into account.
What does this really mean … the least squares regression equation takes the explanatory (x) variable into account. Well, when the deviation from the airfare mean was calculated, only the airfare (response or y) variable was taken into account. The distance (explanatory or x) variable is not even in the calculation.
destinationdistanceairfarefittedresidualdeviation Atlanta Boston Chicago Dallas Detroit Denver Miami New Orleans New York Orlando Pittsburgh St. Louis sum = 0 Sum the residual column.
(j)Please note that you MUST have already calculated the linear regression for airfare vs. distance. Press the following keys: 2nd LIST MATH 5:sum( Then enter the list RESID. sum( L RESID 2 ) sum of squared residuals = ______________ (prediction errors using regression line) $ 2 14, Now that’s weird! Dollars squared.
(k) Press the following keys: 2nd LIST MATH 5:sum( The calculation will look like this … sum(( L AIRF - y) 2 ) or sum(( L AIRF ) 2 ) sum of squared deviations in airfare from overall mean = ______________ (prediction errors using regression line) $ 2 38, There it is again!
(l)Now let’s get rid of that weird dollar squared! Take the answer from part (j) and divide it by the answer from part (k). = = $ 2 14, part (j) ÷ $ 2 38, part (k)
This ratio of the sum of squared residuals and the sum of squared deviations is the proportion of the variability in the response variable that is left unexplained (residual) by the regression model. Subtracting this value from 1 gives the proportion of variability in the response variable that is explained by the regression model.
(m)Square the correlation coefficient between between distance and airfare from Activity 10-2 part (b). = ( ) 2 Activity 10-2 part (b) Does this value look familiar? _______ [explain] This is the same proportion of variability in the response variable that is explained by the regression model. YES
The proportion of variability in the y-variable explained by the regression model with the x-variable is more efficiently calculated as the square of the correlation coefficient, written r 2. This proportion provides a measure of how closely the points fall to the least squares line and thus also provides an indication of how confident one can be of predictions made with the line.
The proportion of the variability in airfares explained by the regression line with distance is … (r 2 ) 63.2%.
Assignment Activity 10-6: Governors’ Salaries (continued) (pages 220 & 221) What is the proportion of variability and what is its meaning?
Essential Question What are the limitations of the predictions of a regression line? Where do we use predictions in our lives and how do we decide on what we are predicting?
Activity 10-4 College Tuitions (continued) (pages 218 & 219 )
(a) [Use the lists named PUBF and PUBT.] [Get a scatterplot and regression line on your calculator.] Public: r = r 2 = tuition = ___________________ + _____________ * (founding year) y = a + b x
Public: r = r 2 = tuition = ___________________ + _____________ * (founding year) y = a + b x $13, $9.59/yr
(b) [Use the lists named PRIVF and PRIVT.] [Get a scatterplot and regression line on your calculator.] Private: r = r 2 = tuition = _________________ + _______________ * (founding year) y = a + b x
Private: r = r 2 = tuition = _________________ + _______________ * (founding year) y = a + b x $ 84, $37.10/yr
(c) Are the equations similar for the public and private schools? tuition = - $13, $9.59/yr * (founding year) tuition = $ 84, $37.10/yr * (founding year) NO
Are the r 2 values similar for the public and private schools? Public:r = r 2 = Private:r = r 2 = YES
(d)The _____________ line appears to be doing a better job of summarizing the relationship between tuition and founding year. [explain]
public
private
(d)The _____________ line appears to be doing a better job of summarizing the relationship between tuition and founding year. [explain] private The points follow a linear relationship more closely.
$ 5, (e)public: school founded in 1900 tuition is ________________ (regression line prediction) private: school founded in 1900 tuition is ________________ (regression line prediction) Judging from the scatterplot, the school prediction is more reasonable. [explain] $ 14, private The points fall closer to the line in the area of 1900 on this scatterplot.
Linear regression models are NOT appropriate for all sets of data. The correlation coefficient and r 2 values do not necessarily attest to how well a linear model describes the association. It is important to look at visual displays of the data.
WRAP-UP This topic has led you to study a formal mathematical model for describing the relationship between two quantitative variables. In studying least squares regression, you have encountered a variety of related terms and concepts. These ideas include the use of regression in prediction, the danger of extrapolation, the interpretation of the slope coefficient, …
… the concepts of fitted values and residuals, and the interpretation of r 2 as the proportion of variability explained by the regression line. Understanding all of these ideas is important to applying regression techniques thoughtfully and appropriately.
In the next topic you will continue your study of least squares regression. You will explore the distinction between outliers and influential observations, discover the utility of residual plots, and consider transformations of variables.
r tells how well the data fits the line, hence the name, “ line of best fit. ” r 2 gives the percent of the variability of the “ y ’ s ” accounted for in the model. For example, if r 2 = 60 %, then 40 % is not accounted for in the model. There could be other variations … what are they? More on Correlation and Variability
Residuals : try to put the points in a _____________. Are they _____________ scattered or are they in a _____________? The BEST two indicators of a good fitting line are _____ and ____________. rectangle Here is the deal on Residuals randomly pattern r2r2 residuals Like the viewing screen on calculator.
r works only for _________ models. r 2 works for _____ models. r 2 gets better with _________ order models. linear Additional Notes: all higher
Assignment Activity 10-17: Incorrect Conclusions (page 226) Assignment Activity 10-12: Turnpike Tolls (page 224) Where do we use predictions in our lives and how do we decide on what we are predicting?
Assignment Activity 10-18: Airfares (continued) (page 226) Note:The original regression equation: airfare = $ $ 0.12 * distance Compare all answers to the original regression equation!!! [PK Hint: Use the lists AIRF and DIST to transform the data accordingly.] [PK Help: Ask me for an easy way to adjust these lists.]
Your topic is due! Statistical Scrabble Activity You will use your class data from Topic 2. Quiz on Topic 10: Least Squares Regression I Where do we use predictions in our lives and how do we decide on what we are predicting?