Download presentation
Presentation is loading. Please wait.
Published byMeagan Harvey Modified over 9 years ago
1
Copyright © 2014 Pearson Education, Inc. All rights reserved Chapter 4 Regression Analysis: Exploring Associations between Variables
2
4 - 2 Copyright © 2014 Pearson Education, Inc. All rights reserved Learning Objectives Be able to write a concise and accurate description of an association between two continuous variables based on a scatterplot. Understand how to use a regression line to summarize a linear association between two continuous variables. Interpret the intercept and slope of a regression line in context and know how to use the regression line to predict mean values of the response variable. Critically evaluate a regression model.
3
Copyright © 2014 Pearson Education, Inc. All rights reserved 4.1 Visualizing Variability with a Scatterplot
4
4 - 4 Copyright © 2014 Pearson Education, Inc. All rights reserved Scatterplots Used to investigate a positive, negative, or no association between two numerical variables. In states where women tend to marry at an older age, men also tend to marry at an older age.
5
4 - 5 Copyright © 2014 Pearson Education, Inc. All rights reserved Positive Trend Older cars tend to have more miles than newer cars. Newer cars tend to have fewer miles than older cars. There is a positive association between car age and miles the car has been driven.
6
4 - 6 Copyright © 2014 Pearson Education, Inc. All rights reserved Negative Trend Countries with higher literacy rates tend to have fewer births per woman. Countries with lower literacy rates tend to have more births per woman. There is a negative association between literacy rate and births per woman.
7
4 - 7 Copyright © 2014 Pearson Education, Inc. All rights reserved No Trend There is no trend between the speed and age of a marathon runner. Knowing the age of a marathon runner does not help predict the runner’s speed. There is no association between a marathon runner’s age and speed.
8
4 - 8 Copyright © 2014 Pearson Education, Inc. All rights reserved Strength of Association If for each value of x, there is a small spread of y values, then there is a strong association between x and y. If for each value of x, there is a large spread of y values, then there is a weak or no association between x and y. If there is a strong (weak) association between x and y, then x is a good (bad) predictor of y.
9
4 - 9 Copyright © 2014 Pearson Education, Inc. All rights reserved Strength of Association
10
4 - 10 Copyright © 2014 Pearson Education, Inc. All rights reserved Linear Trends A trend is linear if there is a line such that the points in general do not stray far from the line. Linear trends are the easiest to work with. There is a positive linear association between number of searches for “Vampire” and number for “Zombie”.
11
4 - 11 Copyright © 2014 Pearson Education, Inc. All rights reserved Other Shapes Nonlinear association can also occur, but this is covered in a more advanced statistics course. Only use techniques from this chapter when there is a linear trend.
12
4 - 12 Copyright © 2014 Pearson Education, Inc. All rights reserved Summary of Analysis of the Scatterplot Look to see if there is a trend or association. Determine the strength of trend. Is the association strong or weak? Look at the shape of the trend. Is it linear? Is it nonlinear?
13
4 - 13 Copyright © 2014 Pearson Education, Inc. All rights reserved Writing Clear Descriptions Based on Association Good: People who have higher salaries tend to travel farther on vacation. A person who has a high salary is predicted to travel far on vacation. Bad: Because they have higher salaries, they travel farther. A person with a high salary will travel farther on vacation.
14
Copyright © 2014 Pearson Education, Inc. All rights reserved 4.2 Measuring Strength of Association with Correlation
15
4 - 15 Copyright © 2014 Pearson Education, Inc. All rights reserved The Correlation Coefficient r The correlation coefficient is a number, r, that measures the strength of the linear association between two variables. -1 ≤ r ≤ 1 If r is close to 1, then there is a strong positive linear association. If r is close to -1, then there is a strong negative linear association. If r is close to 0, then there is a weak or no association.
16
4 - 16 Copyright © 2014 Pearson Education, Inc. All rights reserved Positive Correlation
17
4 - 17 Copyright © 2014 Pearson Education, Inc. All rights reserved Weak or No Correlation
18
4 - 18 Copyright © 2014 Pearson Education, Inc. All rights reserved Negative Correlation
19
4 - 19 Copyright © 2014 Pearson Education, Inc. All rights reserved Interpreting Correlation The correlation between daily swim suits and ski jackets purchased in an apparel store is r = -0.96 There is a strong negative correlation between daily swim suits and ski jackets purchased. On days with strong swim suit sales, one predicts that ski jacket sales would be weak. This does not mean that people who buy swim suits are causing potential ski jacket buyers to not buy.
20
4 - 20 Copyright © 2014 Pearson Education, Inc. All rights reserved Using StatCrunch to Find r Enter Data Stat→Regression→ SimpleLinear Select the Variables Calculate
21
4 - 21 Copyright © 2014 Pearson Education, Inc. All rights reserved Switching x and y r for life expectancy Women vs. Men: r = 0.977 r for life expectancy Men vs. Women: r = 0.977 Switching x and y has no effect on r.
22
4 - 22 Copyright © 2014 Pearson Education, Inc. All rights reserved Correlation, Arithmetic, and Units Multiplying all x’s or all y’s by a constant does not change r. Adding the same constant to all x’s or all y’s does not change r. Changing units such as in→cm or ºF→ºC does not change r. r is unitless.
23
4 - 23 Copyright © 2014 Pearson Education, Inc. All rights reserved Correlation and Linearity and Outliers Only use linear correlation to interpret the data when there is a linear relationship An outlier can strongly influence the correlation.
24
Copyright © 2014 Pearson Education, Inc. All rights reserved 4.3 Modeling Linear Trends
25
4 - 25 Copyright © 2014 Pearson Education, Inc. All rights reserved Least Squares Regression Line The Regression Line is the “best fit” line for the data. The line minimizes the average squared vertical distances. It is only useful with data with a linear model.
26
4 - 26 Copyright © 2014 Pearson Education, Inc. All rights reserved StatCrunch and the Regression Line Enter the data. Select x and y variables. Stat→Regression →Simple Linear Select: Plot the Fitted Line Calculate
27
4 - 27 Copyright © 2014 Pearson Education, Inc. All rights reserved Using the Regression Line Predict the revenue per day when the company spend $500 per month on ads. Predicted Rev = 1.8 + 1.2(5) = 7.8 The company’s daily revenue is predicted to be $780 when it spends $500 per month on ads.
28
4 - 28 Copyright © 2014 Pearson Education, Inc. All rights reserved Interpreting the Slope The slope is the coefficient in front of x in the regression line equation. Rise/Run means that if x is increased by 1, then y is predicted or increases by an average of the slope value. The slope is only meaningful if the data follows a linear model.
29
4 - 29 Copyright © 2014 Pearson Education, Inc. All rights reserved Interpreting the Slope The slope is 1.2. If x is increased by 1, y has an average increase of 1.2. For every $100 the company spends on ads, it averages an additional $120 in revenue.
30
4 - 30 Copyright © 2014 Pearson Education, Inc. All rights reserved Interpreting the y-intercept The y-intercept is the value of y when x is 0. Use the y-intercept to interpret the data only when: It makes sense to have a value of 0 for x. The calculated y-intercept value is meaningful. The data include values equal to or close to 0.
31
4 - 31 Copyright © 2014 Pearson Education, Inc. All rights reserved Interpreting the y-intercept The y-intercept is 1.8. If the company spends no money, $0, on advertising, it is predicted to have an average daily revenue of $180.
32
4 - 32 Copyright © 2014 Pearson Education, Inc. All rights reserved Why Not to Use the y-intercept A sample of high school freshmen and sophomores resulted in a regression equation that relates age to height in inches: predicted height = -9.2 + 4.9x The y-intercept is -9.2. A height of -9.2 inches is meaningless. The sample only included teenagers. The age of 0 years is too far from the ages in the sample. The slope is meaningful. High school freshmen and sophomores grow an average of 4.9 inches per year.
33
4 - 33 Copyright © 2014 Pearson Education, Inc. All rights reserved Correlation is Not Causation A strong correlation is not evidence of a cause-and-effect relationship. Do not use the words, “causes”, “makes”, “will”, “because”, etc. when making regression analysis based conclusions. Do use the words, “predict”, “tends”, and “on average”.
34
4 - 34 Copyright © 2014 Pearson Education, Inc. All rights reserved More on the Regression Line The equation does change when x and y are switched. If the linear model is a “good fit” for the data, then the mean value of y for a given x will nearly lie on the regression line.
35
Copyright © 2014 Pearson Education, Inc. All rights reserved 4.4 Evaluating the Linear Model
36
4 - 36 Copyright © 2014 Pearson Education, Inc. All rights reserved Nonlinear Data If you can’t imagine a line don’t try to find one. If the association is not linear, don’t attempt to find or interpret r or the equation of the least squares regression line.
37
4 - 37 Copyright © 2014 Pearson Education, Inc. All rights reserved Slope and Causation Predicted Salary = 22,000 + 8,000 College Years Wrong: Each year in college results in an additional salary increase of $8,000. Wrong: A person with one more year of college education will earn an extra $8,000. Correct: On average, people with one more year of college education tend to earn an extra $8,000.
38
4 - 38 Copyright © 2014 Pearson Education, Inc. All rights reserved Beware of Outliers Outliers have a strong effect on both the correlation and the equation of the regression line. An outlier that strongly effects the regression line is called an influential point. When there is an influential point present, perform regression analysis both with and without the influential point.
39
4 - 39 Copyright © 2014 Pearson Education, Inc. All rights reserved Example of an Influential Point
40
4 - 40 Copyright © 2014 Pearson Education, Inc. All rights reserved Regression of Aggregate Data Using Aggregate Data for regression means that each point represents the mean of all the y-values with a given x-value. When using aggregate data, be sure to include the word “mean” in all interpretations.
41
4 - 41 Copyright © 2014 Pearson Education, Inc. All rights reserved Aggregate Data There is a weak correlation between math SAT scores and critical reading SAT scores. There is a strong correlation between states’ mean math SAT scores and states’ mean critical reading SAT scores.
42
4 - 42 Copyright © 2014 Pearson Education, Inc. All rights reserved Don’t Extrapolate Only use the regression line to predict y-values for x-values that are within or near the range of the data. Predicted Height = 31.78 + 2.45 Age Predict for a 50 year old: 31.78 + 2.45(50) = 154.28 inches The predicted height of a 50 year old man is over 12 feet tall????
43
4 - 43 Copyright © 2014 Pearson Education, Inc. All rights reserved Coefficient of Determination r 2 r 2 measures how much of the variation in the response variable, y, can be explained by the explanatory variable, x. r 2 is used to help determine which explanatory variable would be best for making predictions about the response variable.
44
4 - 44 Copyright © 2014 Pearson Education, Inc. All rights reserved Coefficient of Determination Example 60.5% of the variation in the value of cars can be explained by the age of the car. The other 39.5% cannot be explained by the age of the car.
45
Copyright © 2014 Pearson Education, Inc. All rights reserved Chapter 4 Case Study
46
4 - 46 Copyright © 2014 Pearson Education, Inc. All rights reserved Scatterplot of City Government Income vs. Private Meter Income Without Brinks Positive weak linear association Predicted Collection = 688497 + 145.5 (City Income)
47
4 - 47 Copyright © 2014 Pearson Education, Inc. All rights reserved Are Brinks Employees Stealing from Parking Meters? New York City contracted Brinks to collect parking meter money. The city suspects that employees are keeping some of it. There is data on the monthly meter collection of honest (not Brinks) collectors vs. the city’s total income for that month.
48
4 - 48 Copyright © 2014 Pearson Education, Inc. All rights reserved Predicted vs. Actual Brinks Collection Predicted Collection = 688497 + 145.5 (City Income) One month, City Income was $7016 and Brinks collected $1,330,143. 688497 + 145.5(7016) = $1,709,325 Discrepancy: 1,709,352 – 1,330,143 = $379,182
49
4 - 49 Copyright © 2014 Pearson Education, Inc. All rights reserved Comparing Brinks vs. Honest Employees Conclusion: Income when Brinks is working is clearly lower than when the honest employees are working.
50
Copyright © 2014 Pearson Education, Inc. All rights reserved Chapter 4 Guided Exercise 1
51
4 - 51 Copyright © 2014 Pearson Education, Inc. All rights reserved Does the Cost of a Flight Depend on the Distance? How much would it cost to fly 500 miles? Use a complete regression analysis.
52
4 - 52 Copyright © 2014 Pearson Education, Inc. All rights reserved Create a Scatterplot Since the cost tends to increase as mileage increases and since there is no apparent strong curvature, the linear model is appropriate.
53
4 - 53 Copyright © 2014 Pearson Education, Inc. All rights reserved The Regression Line Interpret the Slope: 0.08. For every additional mile, on average, the price goes up by $0.08. Interpret the y-intercept: 163 This is the predicted price for a 0 mile flight. The y-intercept is meaningless here.
54
4 - 54 Copyright © 2014 Pearson Education, Inc. All rights reserved Answer the Question How much would it cost to fly 500 miles? Predicted Cost = 162.60 + 0.0796 (miles) 162.60 + 0.0796 (500) = 202.40 A 500 mile flight is predicted to cost $202.40.
55
Copyright © 2014 Pearson Education, Inc. All rights reserved Chapter 4 Guided Exercise 2
56
4 - 56 Copyright © 2014 Pearson Education, Inc. All rights reserved Test Scores: Slope The summary statistics between the midterm and final exam scores are: Midterm: Mean = 75, Standard Dev. = 10 Final: Mean = 75, Standard Dev. = 10 r = 0.7, n = 20 First find the slope:
57
4 - 57 Copyright © 2014 Pearson Education, Inc. All rights reserved Test Scores: y-intercept Midterm: Mean = 75, Standard Dev. = 10 Final: Mean = 75, Standard Dev. = 10 r = 0.7, n = 20, b = 0.7 Then find the y-intercept a from the equation:
58
4 - 58 Copyright © 2014 Pearson Education, Inc. All rights reserved Test Scores: Regression Line Midterm: Mean = 75, Standard Dev. = 10 Final: Mean = 75, Standard Dev. = 10 r = 0.7, n = 20, b = 0.7, a = 22.5 Write out the following equation: Predicted = a + bx Predicted Final Score = 22.5 + 0.7(Midterm Score) Use the equation to predict the final score for a midterm score of 95%. Predicted Final = 22.5 + 0.7(95) = 89 This is less than 95 since the slope is less than 1.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.