Presentation is loading. Please wait.

Presentation is loading. Please wait.

Chapter 151 Describing Relationships: Regression, Prediction, and Causation.

Similar presentations


Presentation on theme: "Chapter 151 Describing Relationships: Regression, Prediction, and Causation."— Presentation transcript:

1 Chapter 151 Describing Relationships: Regression, Prediction, and Causation

2 Chapter 152 Linear Regression Objective: To quantify the linear relationship between an explanatory variable and a response variable. We can then predict the average response for all subjects with a given value of the explanatory variable. Regression equation: y = a + bx –x is the value of the explanatory variable –y is the average value of the response variable –note that a and b are just the intercept and slope of a straight line –note that r and b are not the same thing, but their signs will agree

3 Chapter 153 Thought Question 1 How would you draw a line through the points? How do you determine which line ‘fits best’?

4 Chapter 154 Linear Equations High School Teacher

5 Chapter 155 The Linear Model Remember from Algebra that a straight line can be written as: In Statistics we use a slightly different notation: We write to emphasize that the points that satisfy this equation are just our predicted values, not the actual data values.

6 Chapter 156 Fat Versus Protein: An Example The following is a scatterplot of total fat versus protein for 30 items on the Burger King menu:

7 Chapter 157 Residuals The model won’t be perfect, regardless of the line we draw. Some points will be above the line and some will be below. The estimate made from a model is the predicted value (denoted as ).

8 Chapter 158 Residuals (cont.) The difference between the observed value and its associated predicted value is called the residual. To find the residuals, we always subtract the predicted value from the observed one:

9 Chapter 159 Residuals (cont.) A negative residual means the predicted value is too big (an overestimate). A positive residual means the predicted value is too small (an underestimate).

10 Chapter 1510 “Best Fit” Means Least Squares Some residuals are positive, others are negative, and, on average, they cancel each other out. So, we can’t assess how well the line fits by adding up all the residuals. Similar to what we did with deviations, we square the residuals and add the squares. The smaller the sum, the better the fit. The line of best fit is the line for which the sum of the squared residuals is smallest.

11 Chapter 1511 Least Squares Used to determine the “best” line We want the line to be as close as possible to the data points in the vertical (y) direction (since that is what we are trying to predict) Least Squares: use the line that minimizes the sum of the squares of the vertical distances of the data points from the line

12 Chapter 1512 The Linear Model (cont.) We write b 1 and b 0 for the slope and intercept of the line. The b’s are called the coefficients of the linear model. The coefficient b 1 is the slope, which tells us how rapidly changes with respect to x. The coefficient b 0 is the intercept, which tells where the line hits (intercepts) the y -axis.

13 Chapter 1513 The Least Squares Line In our model, we have a slope ( b 1 ): –The slope is built from the correlation and the standard deviations: –Our slope is always in units of y per unit of x. –The slope has the same sign as the correlation coefficient.

14 Chapter 1514 The Least Squares Line (cont.) In our model, we also have an intercept ( b 0 ). –The intercept is built from the means and the slope: –Our intercept is always in units of y.

15 Chapter 1515 Example Fill in the missing information in the table below:

16 Chapter 1516 Interpretation of the Slope and Intercept The slope indicates the amount by which changes when x changes by one unit. The intercept is the value of y when x = 0. It is not always meaningful.

17 Chapter 1517 Example The regression line for the Burger King data is Interpret the slope and the intercept. Slope: For every one gram increase in protein, the fat content increases by 0.97g. Intercept: A BK meal that has 0g of protein contains 6.8g of fat.

18 Chapter 1518 Thought Question 2 From a long-term study on several families, researchers constructed a scatterplot of the cholesterol level of a child at age 50 versus the cholesterol level of the father at age 50. You know the cholesterol level of your best friend’s father at age 50. How could you use this scatterplot to predict what your best friend’s cholesterol level will be at age 50?

19 Chapter 1519 In predicting a value of y based on some given value of x... 1. If there is not a linear correlation, the best predicted y-value is y. Predictions 2.If there is a linear correlation, the best predicted y-value is found by substituting the x-value into the regression equation.

20 Chapter 1520 Fat Versus Protein: An Example The regression line for the Burger King data fits the data well: –The equation is –The predicted fat content for a BK Broiler chicken sandwich that contains 30g of protein is 6.8 + 0.97(30) = 35.9 grams of fat.

21 Chapter 1521 Prediction via Regression Line Hand, et al., A Handbook of Small Data Sets, London: Chapman and Hall u The regression equation is y = 3.6 + 0.97x –y is the average age of all husbands who have wives of age x u For all women aged 30, we predict the average husband age to be 32.7 years: 3.6 + (0.97)(30) = 32.7 years u Suppose we know that an individual wife’s age is 30. What would we predict her husband’s age to be? Husband and Wife: Ages ^

22 Chapter 1522 The Least Squares Line (cont.) Since regression and correlation are closely related, we need to check the same conditions for regressions as we did for correlations: –Quantitative Variables Condition –Straight Enough Condition –Outlier Condition

23 Chapter 1523 1. If there is no linear correlation, don’t use the regression equation to make predictions. 2. When using the regression equation for predictions, stay within the scope of the available sample data. 3. A regression equation based on old data is not necessarily valid now. 4. Don’t make predictions about a population that is different from the population from which the sample data were drawn. Guidelines for Using The Regression Equation

24 Chapter 1524 Definitions  Marginal Change – refers to the slope; the amount the response variable changes when the explanatory variable changes by one unit.  Outlier - A point lying far away from the other data points.  Influential Point - An outlier that that has the potential to change the regression line.

25 Chapter 1525 Residuals Revisited Residuals help us to see whether the model makes sense. When a regression model is appropriate, nothing interesting should be left behind. After we fit a regression model, we usually plot the residuals in the hope of finding…nothing.

26 Chapter 1526 Residual Plot Analysis If a residual plot does not reveal any pattern, the regression equation is a good representation of the association between the two variables. If a residual plot reveals some systematic pattern, the regression equation is not a good representation of the association between the two variables.

27 Chapter 1527 Residuals Revisited (cont.) The residuals for the BK menu regression look appropriately boring: Plot

28 Chapter 1528 Coefficient of Determination (R 2 ) Measures usefulness of regression prediction R 2 (or r 2, the square of the correlation): measures the percentage of the variation in the values of the response variable (y) that is explained by the regression line v r=1: R 2 =1:regression line explains all (100%) of the variation in y v r=.7: R 2 =.49:regression line explains almost half (50%) of the variation in y

29 Chapter 1529 Along with the slope and intercept for a regression, you should always report R 2 so that readers can judge for themselves how successful the regression is at fitting the data. Statistics is about variation, and R 2 measures the success of the regression model in terms of the fraction of the variation of y accounted for by the regression. R 2 (cont)

30 Chapter 1530 A Caution Beware of Extrapolation Sarah’s height was plotted against her age Can you predict her height at age 42 months? Can you predict her height at age 30 years (360 months)?

31 Chapter 1531 A Caution Beware of Extrapolation Regression line: y = 71.95 +.383 x height at age 42 months? y = 88 cm. height at age 30 years? y = 209.8 cm. –She is predicted to be 6' 10.5" at age 30.

32 Chapter 1532 Correlation Does Not Imply Causation Even very strong correlations may not correspond to a real causal relationship.

33 Chapter 1533 Evidence of Causation A properly conducted experiment establishes the connection Other considerations: –A reasonable explanation for a cause and effect exists –The connection happens in repeated trials –The connection happens under varying conditions –Potential confounding factors are ruled out –Alleged cause precedes the effect in time

34 Chapter 1534 Evidence of Causation An observed relationship can be used for prediction without worrying about causation as long as the patterns found in past data continue to hold true. We must make sure that the prediction makes sense. We must be very careful of extreme extrapolation.

35 Chapter 1535 Reasons Two Variables May Be Related (Correlated) Explanatory variable causes change in response variable Response variable causes change in explanatory variable Explanatory may have some cause, but is not the sole cause of changes in the response variable Confounding variables may exist Both variables may result from a common cause –such as, both variables changing over time The correlation may be merely a coincidence

36 Chapter 1536 Response causes Explanatory Explanatory: Hotel advertising dollars Response: Occupancy rate Positive correlation? – more advertising leads to increased occupancy rate? u Actual correlation is negative: lower occupancy leads to more advertising

37 Chapter 1537 Explanatory is not Sole Contributor u barbecued foods are known to contain carcinogens, but other lifestyle choices may also contribute Explanatory: Consumption of barbecued foods Response: Incidence of stomach cancer

38 Chapter 1538 Common Response (both variables change due to common cause) u Both may result from an unhappy marriage. Explanatory: Divorce among men Response: Percent abusing alcohol

39 Chapter 1539 Both Variables are Changing Over Time Both divorces and suicides have increased dramatically since 1900. Are divorces causing suicides? Are suicides causing divorces??? The population has increased dramatically since 1900 (causing both to increase). u Better to investigate: Has the rate of divorce or the rate of suicide changed over time?

40 Chapter 1540 The Relationship May Be Just a Coincidence We will see some strong correlations (or apparent associations) just by chance, even when the variables are not related in the population

41 Chapter 1541 A required whooping cough vaccine was blamed for seizures that caused brain damage –led to reduced production of vaccine (due to lawsuits) Study of 38,000 children found no evidence for the accusations (reported in New York Times) –“people confused association with cause-and-effect” –“virtually every kid received the vaccine…it was inevitable that, by chance, brain damage caused by other factors would occasionally occur in a recently vaccinated child” Coincidence (?) Vaccines and Brain Damage

42 Chapter 1542 Key Concepts Least Squares Regression Equation R 2 Correlation does not imply causation Confirming causation Reasons variables may be correlated


Download ppt "Chapter 151 Describing Relationships: Regression, Prediction, and Causation."

Similar presentations


Ads by Google