Modeling: Variable Selection Request: “Estimate the annual maintenance costs attributable to annual mileage on a car. Dollars per thousand miles driven.

Slides:



Advertisements
Similar presentations
1 Correlation and Simple Regression. 2 Introduction Interested in the relationships between variables. What will happen to one variable if another is.
Advertisements

SADC Course in Statistics Multiple Linear Regression: Introduction (Session 06)
Bivariate &/vs. Multivariate
The 5S numbers game..
Simple Linear Regression 1. review of least squares procedure 2
Regression with Panel Data
Correlation and Regression
Regression Analysis The Motorpool Example Looking just at two-dimensional shadows, we don’t see the true effects of the variables. We need a way to look.
Simple Linear Regression Analysis
Multiple Regression and Model Building
Design of Experiments Lecture I
What’s4me? Class Material. Introduction Deciding what you want to do after Year 11 is very important for your future, so its a good idea to start thinking.
Simple Linear Regression Conditions Confidence intervals Prediction intervals Section 9.1, 9.2, 9.3 Professor Kari Lock Morgan Duke University.
Repeated Measures Analysis Christopher R. Seemann The New School for Social Research.
Topic 7 – Other Regression Issues Reading: Some parts of Chapters 11 and 15.
QUESTION TAGS.
Regression Analysis The Motorpool Example – Looking just at two-dimensional shadows, we don’t see the true effects of the variables. – We need a way to.
Where do data come from and Why we don’t (always) trust statisticians.
1 Review Lecture: Guide to the SSSII Assignment Gwilym Pryce 5 th March 2006.
Lecture 17: Tues., March 16 Inference for simple linear regression (Ch ) R2 statistic (Ch ) Association is not causation (Ch ) Next.
Research Hypotheses and Multiple Regression: 2 Comparing model performance across populations Comparing model performance across criteria.
Example 1 To predict the asking price of a used Chevrolet Camaro, the following data were collected on the car’s age and mileage. Data is stored in CAMARO1.
Correlation... beware. Definition Var(X+Y) = Var(X) + Var(Y) + 2·Cov(X,Y) The correlation between two random variables is a dimensionless number between.
Irwin/McGraw-Hill © Andrew F. Siegel, 1997 and l Chapter 12 l Multiple Regression: Predicting One Factor from Several Others.
Correlation and Linear Regression
Economics 105: Statistics GH 24 due Wednesday. Hypothesis Tests on Several Regression Coefficients Consider the model (expanding on GH 22) Is “race” as.
Collinearity. Symptoms of collinearity Collinearity between independent variables – High r 2 High vif of variables in model Variables significant in simple.
LECTURE 3 Introduction to Linear Regression and Correlation Analysis
From last time….. Basic Biostats Topics Summary Statistics –mean, median, mode –standard deviation, standard error Confidence Intervals Hypothesis Tests.
Class 17: Tuesday, Nov. 9 Another example of interpreting multiple regression coefficients Steps in multiple regression analysis and example analysis Omitted.
Statistics for Managers Using Microsoft® Excel 5th Edition
Chapter 4 Multiple Regression.
January 6, morning session 1 Statistics Micro Mini Multiple Regression January 5-9, 2008 Beth Ayers.
Chapter Topics Types of Regression Models
Multicollinearity Omitted Variables Bias is a problem when the omitted variable is an explanator of Y and correlated with X1 Including the omitted variable.
Lecture 20 Simple linear regression (18.6, 18.9)
Regression Analysis: How to DO It Example: The “car discount” dataset.
Slide 4- 1 Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Active Learning Lecture Slides For use with Classroom Response.
Stat 112: Lecture 16 Notes Finish Chapter 6: –Influential Points for Multiple Regression (Section 6.7) –Assessing the Independence Assumptions and Remedies.
Linear Regression.
Aim: How do we compute the coefficients of determination and the standard error of estimate?
Correlation & Regression
The Gotham City Motor Pool Gotham City maintains a fleet of automobiles in a special motor pool. These cars are used by the various city agencies when.
Irwin/McGraw-Hill © Andrew F. Siegel, 1997 and Regression and Time Series CHAPTER 11 Correlation and Regression: Measuring and Predicting Relationships.
Inference with computer printouts. Coefficie nts Standard Errort StatP-value Lower 95% Upper 95% Intercept
 Relationship between education level, income, and length of time out of school  Our new regression equation: is the predicted value of the dependent.
Correlation... beware. Definition Var(X+Y) = Var(X) + Var(Y) + 2·Cov(X,Y) The correlation between two random variables is a dimensionless number between.
LEAST-SQUARES REGRESSION 3.2 Least Squares Regression Line and Residuals.
11 Chapter 5 The Research Process – Hypothesis Development – (Stage 4 in Research Process) © 2009 John Wiley & Sons Ltd.
LEAST-SQUARES REGRESSION 3.2 Role of s and r 2 in Regression.
Example x y We wish to check for a non zero correlation.
Copyright © 2014, 2011 Pearson Education, Inc. 1 Chapter 24 Building Regression Models.
Does Poverty Cause Domestic Terrorism? Who knows? (Regression alone can’t establish causality.) There does appear to be some.
Multiple Regression Learning Objectives n Explain the Linear Multiple Regression Model n Interpret Linear Multiple Regression Computer Output n Test.
The coefficient of determination, r 2, is The fraction of the variation in the value of y that is explained by the regression line and the explanatory.
Agenda 1.Exam 2 Review 2.Regression a.Prediction b.Polynomial Regression.
VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE In this sequence and the next we will investigate the consequences of misspecifying the regression.
Introduction Many problems in Engineering, Management, Health Sciences and other Sciences involve exploring the relationships between two or more variables.
Yandell – Econ 216 Chap 15-1 Chapter 15 Multiple Regression Model Building.
Inference for Least Squares Lines
Chapter 13 Multiple Regression
Modeling: Variable Selection
Regression Analysis: How to DO It
Warm Up Chose two of the following rates and write a sentence that explains what they mean. 65 miles per hour 14 points per game 40 hours per week $79.
Least-Squares Regression
Correlation ... beware.
Seminar in Economics Econ. 470
Least-Squares Regression
Regression Forecasting and Model Building
CHAPTER 3 Describing Relationships
Presentation transcript:

Modeling: Variable Selection Request: “Estimate the annual maintenance costs attributable to annual mileage on a car. Dollars per thousand miles driven will suffice.” This sounds like a regression problem! Let’s sample some cars, and look at their costs and mileage over the past year.

The Results This all looks fine. And it’s wrong!

Here’s What the Computer Sees: What it doesn’t see is the age bias in the data: The cars to the left are mostly older cars, and the cars to the right are mostly newer. An un(age)biased chart would have some lower points on the left, and some higher points on the right … and the regression line would be steeper.

Specification Bias … arises when you leave out of your model a potential explanatory variable that (1) has its own effect on the dependent variable, and (2) covaries systematically with an included explanatory variable. The included variable plays a double role, and its coefficient is a biased estimate of its pure effect. That’s why, when we seek to estimate the pure effect of one explanatory variable on the dependent variable, we should use the most- complete model possible.

Seeing the Man Who isn’t There Yesterday, upon the stair, I met a man who wasn’t there He wasn’t there again today I wish, I wish he’d go away... Antigonish (1899), Hughes Mearns When doing a regression study in order to estimate the pure effect of some variable on the dependent variable, the first challenge in the real (non- classroom) world is to decide for what variables to collect data. The “man who isn’t there” can do you harm. Let’s return to the motorpool example, with Mileage as the only explanatory variable, and look at the residuals, i.e., the errors our current model makes in predicting for individuals in the sample.

Learning from our Mistakes CostspredictedresidualMileage CostspredictedresidualMileage Age Take the “residuals” output Sort the observations from largest to smallest residual. And see if something differentiates the observations near the top of the list from those near the bottom. If so, consider adding that differentiating variable to your model!

We Can Do This Repeatedly Make Regression: Costs constantMileageAge coefficient std error of coef t-ratio significance2.9541%0.0011%0.2841% beta-weight standard error of regression coefficient of determination81.22% adjusted coef of determination78.09% Our new model: After sorting on the new residuals, 3 of the top 4 and 5 of the top 7 cars (those with the greatest positive residuals) are Hondas. 3 of the bottom 4 and 5 of the bottom 7 cars (those with the greatest negative residuals) are Fords. This might suggest adding “make” as another new variable.

Why Not Just Include the Kitchen Sink? Spurious correlation – The Dow, and women’s skirts Collinearity – For example, age and odometer miles: Likely highly correlated – Computer can’t decide what to attribute to each Large standard errors of coefficients leads to large significance levels = no evidence either belongs. But if either is included alone, strong evidence it belongs