Chapter 8: Linear Regression

Slides:



Advertisements
Similar presentations
Chapter 3 Examining Relationships Lindsey Van Cleave AP Statistics September 24, 2006.
Advertisements

Linear Regression.  The following is a scatterplot of total fat versus protein for 30 items on the Burger King menu:  The model won’t be perfect, regardless.
Chapter 8 Linear regression
Chapter 8 Linear regression
Linear Regression Copyright © 2010, 2007, 2004 Pearson Education, Inc.
Copyright © 2010 Pearson Education, Inc. Chapter 8 Linear Regression.
Residuals Revisited.   The linear model we are using assumes that the relationship between the two variables is a perfect straight line.  The residuals.
Chapter 8 Linear Regression.
Copyright © 2009 Pearson Education, Inc. Chapter 8 Linear Regression.
Chapter 8 Linear Regression © 2010 Pearson Education 1.
Statistics Residuals and Regression. Warm-up In northern cities roads are salted to keep ice from freezing on the roadways between 0 and -9.5 ° C. Suppose.
CHAPTER 8: LINEAR REGRESSION
Copyright © 2010, 2007, 2004 Pearson Education, Inc. Chapter 8 Linear Regression.
Chapter 7 Linear Regression.
AP Statistics Chapter 8: Linear Regression
Chapter 8: Linear Regression
Linear Regression.
Relationship of two variables
Copyright © 2008 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 8 Linear Regression.
Copyright © 2014, 2011 Pearson Education, Inc. 1 Chapter 19 Linear Patterns.
Objective: Understanding and using linear regression Answer the following questions: (c) If one house is larger in size than another, do you think it affects.
Relationships If we are doing a study which involves more than one variable, how can we tell if there is a relationship between two (or more) of the.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. Chapter 8 Linear Regression.
Chapter 8 Linear Regression *The Linear Model *Residuals *Best Fit Line *Correlation and the Line *Predicated Values *Regression.
Chapter 8 Linear Regression. Slide 8- 2 Fat Versus Protein: An Example The following is a scatterplot of total fat versus protein for 30 items on the.
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 8 Linear Regression (3)
Chapter 8 Linear Regression HOW CAN A MODEL BE CREATED WHICH REPRESENTS THE LINEAR RELATIONSHIP BETWEEN TWO QUANTITATIVE VARIABLES?
Chapter 8 Linear Regression. Objectives & Learning Goals Understand Linear Regression (linear modeling): Create and interpret a linear regression model.
CHAPTER 8 Linear Regression. Residuals Slide  The model won’t be perfect, regardless of the line we draw.  Some points will be above the line.
Residuals Recall that the vertical distances from the points to the least-squares regression line are as small as possible.  Because those vertical distances.
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 8- 1.
LEAST-SQUARES REGRESSION 3.2 Least Squares Regression Line and Residuals.
Chapter 8 Linear Regression. Fat Versus Protein: An Example 30 items on the Burger King menu:
Linear Regression Chapter 8. Fat Versus Protein: An Example The following is a scatterplot of total fat versus protein for 30 items on the Burger King.
Copyright © 2010 Pearson Education, Inc. Chapter 8 Linear Regression.
Copyright © 2009 Pearson Education, Inc. Chapter 8 Linear Regression.
1-1 Copyright © 2015, 2010, 2007 Pearson Education, Inc. Chapter 7, Slide 1 Chapter 7 Linear Regression.
Statistics 8 Linear Regression. Fat Versus Protein: An Example The following is a scatterplot of total fat versus protein for 30 items on the Burger King.
Part II Exploring Relationships Between Variables.
Training Activity 4 (part 2)
Chapter 8 Linear Regression.
CHAPTER 3 Describing Relationships
CHAPTER 3 Describing Relationships
Chapter 8 Linear Regression.
Chapter 7 Linear Regression.
Chapter 8 Linear Regression Copyright © 2010 Pearson Education, Inc.
Chapter 8 Part 2 Linear Regression
No notecard for this quiz!!
Chapter 3: Describing Relationships
Chapter 3: Describing Relationships
Unit 4 Vocabulary.
Chapter 3 Describing Relationships Section 3.2
Chapter 3: Describing Relationships
Chapter 3: Describing Relationships
Chapter 3: Describing Relationships
CHAPTER 3 Describing Relationships
CHAPTER 3 Describing Relationships
Chapter 3: Describing Relationships
Chapter 3: Describing Relationships
CHAPTER 3 Describing Relationships
3.2 – Least Squares Regression
CHAPTER 3 Describing Relationships
Chapter 3: Describing Relationships
Chapter 3: Describing Relationships
CHAPTER 3 Describing Relationships
Algebra Review The equation of a straight line y = mx + b
Chapter 3: Describing Relationships
Chapter 3: Describing Relationships
9/27/ A Least-Squares Regression.
Chapter 3: Describing Relationships
Presentation transcript:

Chapter 8: Linear Regression By Dara Lee and Michelle Smith Period 1

The linear model is just an equation of a straight line through data. The points in a scatterplot don’t all line up, but a straight line can summarize the general pattern. The model can help understand how the variables are associated. Linear Model

Residuals An estimate from a model is called the predicted value (ŷ) The difference between observed (y) and predicted values (ŷ) is called the residual (e) Residual=Observed-Predicted (e=y-ŷ) A negative residual means the predicted value is too big. A positive residual means the predicted value is too small. To see if a linear model is appropriate, the residuals plot should be scattered with no interesting features, no direction, no shape, no bends, and no outliers. Residuals help us to see whether the model makes sense If r=0 there’s no linear relationship R=-1 or 1 means the data fall exactly on one straight line Residuals

The line of best fit is the line for which the sum of the squared residuals (R²) is the smallest. Also known as “line of least squares.” By squaring the residuals, all are made positive for summation. This also emphasizes the largest residuals. The smaller the sum, the better the fit. Equation of the line: ŷ=bo+b1x Equation of the slope: b1=r(Sy/Sx) b0=y-b1x Line of Best Fit

Correlation and the Line The equation for a line that passes through the origin can be written with just a slope and no intercept: y=mx The coordinates of these standard points aren’t written as (x,y)—their coordinates are z-scores: (zx,zy) For every horizontal change in Sx there is a vertical change in r(Sy) Moving one standard deviation away from the mean in “x” moves our estimate “r” standard deviations away from the mean in “y.” In general, moving any number of standard deviations in “x” moves “r” times that number of standard deviations in “y.” Correlation and the Line

How big can Predicted Values Get? Each predicted “y” tends to be closer to its mean (in standard deviations) than its corresponding x was. This property of the linear model is called regression to the mean; the line is called the regression line. How big can Predicted Values Get?

R²: The Variation accounted for “r” is the correlation between two variables. The greater the absolute value of the correlation, the stronger the association. The squared correlation gives the fraction of the data’s variation accounted for by the model, and 1-R² is the fraction of the original variation left in the residuals. An R² of 0 means that none of the variance in the data is in the model; all of it is still in the residuals. Squaring the residuals ensures that all are positive so that they can be added to figure out the line of best fit. The smaller the sum, the better the fit. R²: The Variation accounted for

Assumptions and Conditions Quantitative Variables Condition: Variables cannot be categorical variables. Straight Enough Condition: Scatterplot must look reasonably straight. The linearity can be checked again after the regression, when residuals can be examined. Outlier Condition: No point should be singled out. To spot outliers, you can check the residuals—they may have large residuals. Outliers can dramatically change a regression model. Assumptions and Conditions

Chapter 8 Problem #33 Age (yr) Price Advertised ($) 1 12995, 10950 2 Find the equation of the regression line. Explain the meaning of the slope of the line. Explain the meaning of the intercept of the line. If you want to sell a 7-year-old Corolla, what price seems appropriate? You have a chance to buy one of two cars. They are about the same age and appear to be in equally good condition. Would you rather buy the one with a positive residual or a negative residual? Explain. You see a “For Sale” sign on a 10-year-old Corolla stating the asking price as $1500. What is the residual? Would this regression model be useful in establishing a fair price for a 20-year-old car? Explain. Classified ads in the Ithaca Journal offered several used Toyota Corollas for sale. Listed below are the ages of the cars and the advertised prices. Age (yr) Price Advertised ($) 1 12995, 10950 2 10495 3 10995, 10995 4 6995, 7990 5 8700, 6995 6 5990, 4995 9 3200, 2250, 3995 11 2900, 2995 13 1750

Chapter 8 Problem #33 a) Predicted price= 12319.6 - 924 x years b) Every extra year of age decreases average value by $924 c) The average new Corolla costs $12,319.60 d) $5851.60 e) Negative residual. Its price is below the predicted value for its age. f) -$1579.60 g) No. After age 13, the model predicts negative prices. The relationship is no longer linear. QUESTIONS Find the equation of the regression line. Explain the meaning of the slope of the line. Explain the meaning of the intercept of the line. If you want to sell a 7-year-old Corolla, what price seems appropriate? You have a chance to buy one of two cars. They are about the same age and appear to be in equally good condition. Would you rather buy the one with a positive residual or a negative residual? Explain. You see a “For Sale” sign on a 10-year-old Corolla stating the asking price as $1500. What is the residual? Would this regression model be useful in establishing a fair price for a 20-year-old car? Explain. ANSWER EXPLANATIONS a) First calculate the mean of the age of the used Corollas. 1+…+13, all divided by 9 (total). You should get 6. Second, calculate the mean of the advertised prices. 12995+….+1750, all divided by 17 (total). You should get 6,775.58824… Third, calculate the Standard deviations of the age and the advertised prices. Subtract the mean from each of the data values and list the differences. Square each of the differences and make a list of the squares. Add all these squares together. Then, subtract one from the number of values. For age, you should get 8 (9-1). For price, you should get 16 (17-1). Then divide the sum of the squares by this number (Age: 8, Price: 16). Then take the square root of this number. The standard deviation of the age should be 14.832397…. The standard deviation of price should be 14497.47554… Fourth, multiply the results from the value-mean step from the standard deviation calculation (Subtract the mean from each of the data values and list the differences). Add up all those products and set that number aside. Then multiply together the total-minus-one number (Age: 8, Price: 16), the standard deviation of age, and the standard deviation of price. Divide the sum of the products from this number to get the correlation efficient. You should get that r= -.9453 Fifth plug these numbers into the formula for the regression line. Remember: Equation of the line: ŷ=bo+b1x Equation of the slope: b1=r(Sy/Sx) b0=y(mean)-b1x(mean) So, b1= -.9453(14497.47554/14.832397)= -923.9547 = 924 Therefore b0 = 6775.58824 – (-924)(6) = 12319.6 Giving us the equation ŷ = 12319.6 – 924x b) The slope is -924: Every extra year of age decreases average value by $924 c) The average new Corolla costs $12,319.60 because ŷ = 12319.6 – 924(0) = 12319.6 d) If you want to sell a 7-year-old Corolla, the price of $5851.60 seems appropriate because 12319.6 – 924(7) = 5851.6 e) You would want to buy the car with a negative residual. Remember the equation for residual is observed minus predicted. If the number turns out negative, that means there is ”extra ‘predicted value’ leftover” A.K.A. the model overestimated the price. Having a positive residual means that the model underestimated the price and if you only brought enough money based on the model, you wouldn’t be able to buy the car. f) 10-year-old Corolla: asking price is $1500. 12319.6 – 924(10) = 3079.6Observed – predicted1500-3079.6 = -1579.6 as the residual g) This regression model would not be useful in establishing a fair price for a 20-year-old car because after 13 years, the model predicts negative prices. The relationship is no longer linear. Looking so far ahead to age 20, away from all the other data values, is also known as extrapolation.

Chapter 8 Problem #37 Here are the data used when the association between the amounts of fat and calories in hamburgers were examined. Fat (g) 19 31 34 35 39 43 Calories 410 580 590 570 640 680 660 When a scatterplot was made, the equation of the line of regression was calculated to be: Predicted calories= 211+11.06 x calories/fat gram Explain why you cannot use that model to estimate the fat content of a burger with 600 calories. Using an appropriate model, estimate the fat content of a burger with 600 calories.

Chapter 8 Problem #37 a) The regression was for predicting calories from fat, not the other way around. b) Predicted fat grams= -15 + .083 grams/calories Predict 34.8 grams of fat. QUESTIONS When a scatterplot was made, the equation of the line of regression was calculated to be: Predicted calories= 211+11.06 x calories/fat gram Explain why you cannot use that model to estimate the fat content of a burger with 600 calories. Using an appropriate model, estimate the fat content of a burger with 600 calories. ANSWER EXPLANATIONS Remember that the first regression analysis was to create a scatterplot of calories vs. fat content, NOT fat content vs. calories. A.K.A. The regression was for predicting calories from fat, not for predicting fat from calories. b) First, calculate the mean of the calories. 410+….+660, all divided by 7 (total). You should get 590. Second, calculate the mean of the fat content. 19+…+43, all divided by 7 (total). You should get 34.286. Third, calculate the Standard deviations of the calories and fat content. Subtract the mean from each of the data values and list the differences. Square each of the differences and make a list of the squares. Add all these squares together. Then, subtract one from the number of values. For calories and fat content, you should get 6 (7-1). Then divide the sum of the squares by this number (6). Then take the square root of this number. The standard deviation of the calories should be 220. The standard deviation of the fat content should be 19.116… Fourth, multiply the results from the value-mean step from the standard deviation calculation (Subtract the mean from each of the data values and list the differences). Add up all those products and set that number aside. Then multiply together the total-minus-one number (6), the standard deviation of the calories, and the standard deviation of the fat content. Divide the sum of the products from this number to get the correlation efficient. You should get that r= .9606 Fifth, plug these numbers into the formula for the regression line. Remember: Equation of the line: ŷ=bo+b1x Equation of the slope: b1=r(Sy/Sx) b0=y(mean)-b1x(mean) So, b1= .9606(19.116/220) = .083 Therefore b0 = 34.286 – (.083)(590) = -15 Giving us the equation ŷ = -15 + .083x Sixth, plug in 600 for x. -15 + .083(600) = 34.8