Class 7: Thurs., Sep. 30. Outliers and Influential Observations Outlier: Any really unusual observation. Outlier in the X direction (called high leverage.

Slides:



Advertisements
Similar presentations
Correlation and Linear Regression
Advertisements

AP Statistics Chapters 3 & 4 Measuring Relationships Between 2 Variables.
Class 16: Thursday, Nov. 4 Note: I will you some info on the final project this weekend and will discuss in class on Tuesday.
FPP 10 kind of Regression 1. Plan of attack Introduce regression model Correctly interpret intercept and slope Prediction Pit falls to avoid 2.
Chapter 2: Looking at Data - Relationships /true-fact-the-lack-of-pirates-is-causing-global-warming/
2.6 The Question of Causation. The goal in many studies is to establish a causal link between a change in the explanatory variable and a change in the.
Stat 112: Lecture 15 Notes Finish Chapter 6: –Review on Checking Assumptions (Section ) –Outliers and Influential Points (Section 6.7) Homework.
Class 17: Tuesday, Nov. 9 Another example of interpreting multiple regression coefficients Steps in multiple regression analysis and example analysis Omitted.
Class 8: Tues., Oct. 5 Causation, Lurking Variables in Regression (Ch. 2.4, 2.5) Inference for Simple Linear Regression (Ch. 10.1) Where we’re headed:
Class 15: Tuesday, Nov. 2 Multiple Regression (Chapter 11, Moore and McCabe).
Lecture 25 Multiple Regression Diagnostics (Sections )
Class 5: Thurs., Sep. 23 Example of using regression to make predictions and understand the likely errors in the predictions: salaries of teachers and.
Lecture 23: Tues., April 6 Interpretation of regression coefficients (handout) Inference for multiple regression.
Lecture 25 Regression diagnostics for the multiple linear regression model Dealing with influential observations for multiple linear regression Interaction.
Stat 112: Lecture 14 Notes Finish Chapter 6:
Class 6: Tuesday, Sep. 28 Section 2.4. Checking the assumptions of the simple linear regression model: –Residual plots –Normal quantile plots Outliers.
Lecture 26 Omitted Variable Bias formula revisited Specially constructed variables –Interaction variables –Polynomial terms for curvature –Dummy variables.
Lecture 24: Thurs. Dec. 4 Extra sum of squares F-tests (10.3) R-squared statistic (10.4.1) Residual plots (11.2) Influential observations (11.3,
Lecture 24 Multiple Regression (Sections )
Lecture 24: Thurs., April 8th
Lecture 19 Transformations, Predictions after Transformations Other diagnostic tools: Residual plot for nonconstant variance, histogram to check normality.
CHAPTER 3 Describing Relationships
Stat Notes 4 Chapter 3.5 Chapter 3.7.
Lecture 17 Interaction Plots Simple Linear Regression (Chapter ) Homework 4 due Friday. JMP instructions for question are actually for.
Stat 112: Lecture 16 Notes Finish Chapter 6: –Influential Points for Multiple Regression (Section 6.7) –Assessing the Independence Assumptions and Remedies.
Stat 112: Lecture 9 Notes Homework 3: Due next Thursday
1 Chapter 10 Correlation and Regression We deal with two variables, x and y. Main goal: Investigate how x and y are related, or correlated; how much they.
Chapter 5 Regression. Chapter 51 u Objective: To quantify the linear relationship between an explanatory variable (x) and response variable (y). u We.
Lecture 15 Basics of Regression Analysis
1 10. Causality and Correlation ECON 251 Research Methods.
Looking at data: relationships - Caution about correlation and regression - The question of causation IPS chapters 2.4 and 2.5 © 2006 W. H. Freeman and.
Regression Wisdom.  Linear regression only works for linear models. (That sounds obvious, but when you fit a regression, you can’t take it for granted.)
Regression Analysis. Scatter plots Regression analysis requires interval and ratio-level data. To see if your data fits the models of regression, it is.
1 Chapter 10, Part 2 Linear Regression. 2 Last Time: A scatterplot gives a picture of the relationship between two quantitative variables. One variable.
Lecture PowerPoint Slides Basic Practice of Statistics 7 th Edition.
Stat 112 Notes 15 Today: –Outliers and influential points. Homework 4 due on Thursday.
Topic 10 - Linear Regression Least squares principle - pages 301 – – 309 Hypothesis tests/confidence intervals/prediction intervals for regression.
Chapters 8 & 9 Linear Regression & Regression Wisdom.
Stat 112 Notes 16 Today: –Outliers and influential points in multiple regression (Chapter 6.7)
Chapter 3.3 Cautions about Correlations and Regression Wisdom.
The Practice of Statistics, 5th Edition Starnes, Tabor, Yates, Moore Bedford Freeman Worth Publishers CHAPTER 3 Describing Relationships 3.2 Least-Squares.
WARM-UP Do the work on the slip of paper (handout)
AP STATISTICS LESSON 4 – 2 ( DAY 1 ) Cautions About Correlation and Regression.
 What is an association between variables?  Explanatory and response variables  Key characteristics of a data set 1.
Lecture 5 Chapter 4. Relationships: Regression Student version.
Copyright © 2010 Pearson Education, Inc. Slide The lengths of individual shellfish in a population of 10,000 shellfish are approximately normally.
Outliers and influential data points. No outliers?
Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc. Chap 15-1 Chapter 15 Multiple Regression Model Building Basic Business Statistics 10 th Edition.
Individual observations need to be checked to see if they are: –outliers; or –influential observations Outliers are defined as observations that differ.
Lecture PowerPoint Slides Basic Practice of Statistics 7 th Edition.
MATH 2311 Section 5.4. Residuals Examples: Interpreting the Plots of Residuals The plot of the residual values against the x values can tell us a lot.
Chapter 8 Linear Regression.
2.7 The Question of Causation
Chapter 4.2 Notes LSRL.
Cautions About Correlation and Regression
Cautions about Correlation and Regression
Chapter 2: Looking at Data — Relationships
Chapter 2 Looking at Data— Relationships
AP STAT Section 3.3: Correlation and Regression Wisdom
1. Describe the Form and Direction of the Scatterplot.
residual = observed y – predicted y residual = y - ŷ
Least-Squares Regression
Looking at data: relationships - Caution about correlation and regression - The question of causation IPS chapters 2.4 and 2.5 © 2006 W. H. Freeman and.
Least-Squares Regression
Basic Practice of Statistics - 3rd Edition Regression
Chapter 4: Designing Studies
Least-Squares Regression
CHAPTER 3 Describing Relationships
Warmup A study was done comparing the number of registered automatic weapons (in thousands) along with the murder rate (in murders per 100,000) for 8.
Day 15 Agenda: DG minutes.
Presentation transcript:

Class 7: Thurs., Sep. 30

Outliers and Influential Observations Outlier: Any really unusual observation. Outlier in the X direction (called high leverage point): Has the potential to influence the regression line. Outlier in the direction of the scatterplot: An observation that deviates from the overall pattern of relationship between Y and X. Typically has a residual that is large in absolute value. Influential observation: Point that if it is removed would markedly change the statistical analysis. For simple linear regression, points that are outliers in the x direction are often influential.

Housing Prices and Crime Rates A community in the Philadelphia area is interested in how crime rates are associated with property values. If low crime rates increase property values, the community might be able to cover the costs of increased police protection by gains in tax revenues from higher property values. The town council looked at a recent issue of Philadelphia Magazine (April 1996) and found data for itself and 109 other communities in Pennsylvania near Philadelphia. Data is in philacrimerate.JMP. House price = Average house price for sales during most recent year, Crime Rate=Rate of crimes per 1000 population.

Which points are influential? Center City Philadelphia is influential; Gladwyne is not. In general, points that have high leverage are more likely to be influential.

Excluding Observations from Analysis in JMP To exclude an observation from the regression analysis in JMP, go to the row of the observation, click Rows and then click Exclude/Unexclude. A red circle with a diagonal line through it should appear next to the observation. To put the observation back into the analysis, go to the row of the observation, click Rows and then click Exclude/Unexclude. The red circle should no longer appear next to the observation.

Formal measures of leverage and influence Leverage: “Hat values” (JMP calls them hats) Influence: Cook’s Distance (JMP calls them Cook’s D Influence). To obtain them in JMP, click Analyze, Fit Model, put Y variable in Y and X variable in Model Effects box. Click Run Model box. After model is fit, click red triangle next to Response. Click Save Columns and then Click Hats for Leverages and Click Cook’s D Influences for Cook’s Distances. To sort observations in terms of Cook’s Distance or Leverage, click Tables, Sort and then put variable you want to sort by in By box.

Center City Philadelphia has both influence (Cook’s Distance much Greater than 1 and high leverage (hat value > 3*2/99=0.06). No other observations have high influence or high leverage.

Rules of Thumb for High Leverage and High Influence High Leverage Any observation with a leverage (hat value) > (3 * # of coefficients in regression model)/n has high leverage, where # of coefficients in regression model = 2 for simple linear regression. n=number of observations. High Influence: Any observation with a Cook’s Distance greater than 1 indicates a high influence.

What to Do About Suspected Influential Observations? See flowchart handout. Does removing the observation change the substantive conclusions? If not, can say something like “Observation x has high influence relative to all other observations but we tried refitting the regression without Observation x and our main conclusions didn’t change.”

If removing the observation does change substantive conclusions, is there any reason to believe the observation belongs to a population other than the one under investigation? –If yes, omit the observation and proceed. –If no, does the observation have high leverage (outlier in explanatory variable). If yes, omit the observation and proceed. Report that conclusions only apply to a limited range of the explanatory variable. If no, not much can be said. More data (or clarification of the influential observation) are needed to resolve the questions.

General Principles for Dealing with Influential Observations General principle: Delete observations from the analysis sparingly – only when there is good cause (observation does not belong to population being investigated or is a point with high leverage). If you do delete observations from the analysis, you should state clearly which observations were deleted and why.

The Question of Causation The community that ran this regression would like to increase property values. If low crime rates increase property values, the community might be able to cover the costs of increased police protection by gains in tax revenue from higher property values. The regression without Center City Philadelphia is Linear Fit HousePrice = CrimeRate The community concludes that if it can cut its crime rate from 30 down to 20 incidents per 1000 population, it will increase its average house price by $ *10=$22,887. Is the community’s conclusion justified?

Potential Outcomes Model Let Y i 30 denote what the house price for community i would be if its crime rate was 30 and Y i 20 denote what the house price for community i would be if its crime rate was 20. X (crime rate) causes a change in Y (house price) for community i if. A decrease in crime rate causes an increase in house price for community i if

Association is Not Causation A regression model tells us about how the mean of Y|X is associated with changes in X. A regression model does not tell us what would happen if we actually changed X. Possible Explanations for an Observed Association Between Y and X 1.Y causes X 2.X causes Y 3.There is a confounding variable Z that is associated with changes in both X and Y. Any combination of the three explanations may apply to an observed association.

X Causes Y Perhaps it is changes in house price that cause changes in crime rate. When house prices increase, the residents of a community have more to lose by engaging in criminal actives; this is called the economic theory of crime.

Confounding Variables Confounding variable for the causal relationship between X and Y: A variable Z that is associated with both X and Y. Example of confounding variable in Philadelphia crime rate data: Level of education. Level of education may be associated with both house prices and crime rate. The effect of crime rate on house price is confounded with the effect of education on house price. If we just look at data on house price and crime rate, we can’t distinguish between the effect of crime rate on house price and the effect of education on house price.

Note on Confounding Variables and Lurking Variables The book’s distinction between lurking variable and confounding variable is confusing and the term “lurking variable” is not standard in statistics, whereas “confounding variable” is. So I will just use the term confounding variable in the rest of the course.

Examples of Confounding Variables Many studies have found that people who are active in their religion live longer than nonreligious people. Potential confounding variables?

Weekly Wages (Y) and Education (X) in March 1988 CPS Will getting an extra year of education cause an increase of $50.41 on average in your weekly wage? What are some potential confounding variables?

Math enrollment data: The residual plot vs. time indicates that there is a confounding variable associated with time. It turns out that one of the schools (say the engineering school) in the university changed its program to require that entering students take another mathematics course. The variable of whether the engineering school requires its students to take another mathematics course is a confounding variable.

Establishing Causation Best method is an experiment, but many times that is not ethically or practically possible (e.g., smoking and cancer, education and earnings).

Main strategy for learning about causation when we can’t do an experiment: Consider all confounding variables you can think of. Try to take them into account (we’ll see how to do this when we study multiple regression in Chapter 11) and see if association between Y and X remains once the known confounding variables have been accounted for.

Other Criteria for Establishing Causation When We Can’t Do An Experiment 1.The association is strong. 2.The association is consistent. 3.Higher doses are associated with stronger responses. 4.The alleged cause precedes the effect in time. 5.The alleged cause is plausible.