Download presentation
Presentation is loading. Please wait.
Published byΣαπφειρη Αρβανίτης Modified over 6 years ago
1
Week 5 Lecture 2 Chapter 8. Regression Wisdom
2
Percentage of Men Smokers (18 – 24 years of age) from 1965 through 2009
The centre for Disease Control and Prevention track cigarette smoking in the US. How has the percentage of people who smoke changed since the danger became clear during the last half of the 20th century?
3
Percentage of Men Smokers (18 – 24 years of age) from 1965 through 2009
The scatterplot shows percentage of smokers among men years of age, as estimated by surveys, from 1965 through 2009. The percent of men age 18–24 who are smokers decreased dramatically between 1965 and 1990, but the trend has not been consistent since then. The association between percent of men age 18–24 who smoke and year is very strong from 1965 to 1990, but is erratic after 1990. A linear model is not an appropriate model for the trend in the percent of males age 18–24 who are smokers. The relationship is not straight. The regression equation is: male smoking % = Year R-sq = (70.47%)
4
Checking the Assumptions of Regression Model
Residual points are normally distributed.
5
Checking the Assumptions of Regression Model
Plot: Residuals vs. Predictor Variable (Year) Nonlinearity is more prominent. Residual points are not randomly plotted around the zero line; they are not evenly spread out. Residual points form a curvature pattern. Regression model is not correct.
6
Checking the Assumptions of Regression Model
No regression analysis is complete without a display of the residuals to check that the linear model is reasonable. Residuals often reveal subtleties that were not clear from a plot of the original data (e.g. scatterplot of y vs. x) Sometimes they reveal violations of the regression conditions that require our attention. It is good to look at both a histogram of residual (or histogram of standardized residuals or the normal QQ plot of residuals) and a scatterplot of the residuals vs. predictor variable.
7
Percentage of Both Men and Women Smokers (18 – 24 years of age) from 1965 through 2009
The centre for Disease Control and Prevention track cigarette smoking in the US. How have the percentages of men and women who smoke changed since the danger became clear during the last half of the 20th century?
8
Scatterplot for Men and Women Smokers (18 – 24 years of age) from 1965 through 2009
Smoking rates for both men and women in the US have decreased significantly over the time period from 1965 to 2009. Smoking rates are generally lower for women than for men. The trend in the smoking rates for women seems a bit straighter than the trend for men. The apparent curvature in the scatterplot for the men could possibly be due to just a few points, and not an indication of a serious violation of the linearity condition.
9
Scatterplot for Men and Women Smokers (18 – 24 years of age) from 1965 through 2009
StatCrunch Command: Graph > Scatter Plot X-variable: Year Y-Variable: Smoking % Group by: Sex Grouping Options: Color points by group Overlay polynomial order: 1 Group properties: Color scheme: Alternate – 7 colors Click Compute
10
Men and Women Smokers (18 – 24 years of age) from 1965 through 2009 Graph on the left: Not taking group into account Graph on the right: Identify by group (male or female)
11
Men and Women Smokers (18 – 24 years of age) from 1965 through 2009 Not taking group into account
Smoking % = Year Sample size: 34 R (correlation coefficient) = R-sq =
12
Analysis of Residual Points
Looks like we have two groups.
13
Analysis of Residual Points
An examination of residuals often leads us to discover groups of observations that are different from the rest. Histogram might show multiple modes. When we discover there is more than one group in a regression, we may decide to analyze the groups separately using a different model for each group.
14
Outliers Any point that stands away from the others can be called an outlier and deserves your special attention. Outlying points can strongly influence a regression. Even a single point far from the body of the data can dominate the analysis.
15
High Leverage Points A data point that has an x-value far from the mean of the x-values is called a high leverage point. Examples:
16
Influential Observations
A data point is influential if omitting from the analysis gives a very different model. Examples: Relationship between Murder rate and poverty level for 51 state (including the state: DC) Note: DC is far from the rest of the data (overall pattern) and is observed in a different direction than the rest. Dependent Variable: Murder Rate Independent Variable: Poverty Rate Murder Rate = Poverty Rate Sample size: 51 R (correlation coefficient) = R-sq = Estimate of error standard deviation:
17
Omitting the Observation for DC
Examples: Relationship between Murder rate and poverty level for 50 state (excluding DC) Dependent Variable: Murder Rate Independent Variable: Poverty Rate Murder Rate = Poverty Rate Sample size: 50 R (correlation coefficient) = R-sq =
18
High Leverage Point BUT Not An Influential Observation
19
Restricted-range Problem
When one of the variables is restricted (you only look at some of the values), the correlation can be surprisingly low. We will visit an example from the web, from David Lane: The demo video is found here:
20
Working with Summary Statistics
Graph below shows that there appears to be a strong, positive, linear association between weight (in pounds) and height (in inches) for men. Graph below shows that if instead of data on individuals we only had the mean weight for each height value, we would see an even stronger association. We see less scattered points. It can give a false impression of how well a line summarizes the data. We have a problem of overestimating or underestimating.
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.