Modeling: Variable Selection

Slides:



Advertisements
Similar presentations
Modeling: Variable Selection Request: “Estimate the annual maintenance costs attributable to annual mileage on a car. Dollars per thousand miles driven.
Advertisements

Regression Analysis The Motorpool Example – Looking just at two-dimensional shadows, we don’t see the true effects of the variables. – We need a way to.
Example 1 To predict the asking price of a used Chevrolet Camaro, the following data were collected on the car’s age and mileage. Data is stored in CAMARO1.
Correlation... beware. Definition Var(X+Y) = Var(X) + Var(Y) + 2·Cov(X,Y) The correlation between two random variables is a dimensionless number between.
Irwin/McGraw-Hill © Andrew F. Siegel, 1997 and l Chapter 12 l Multiple Regression: Predicting One Factor from Several Others.
Some Terms Y =  o +  1 X Regression of Y on X Regress Y on X X called independent variable or predictor variable or covariate or factor Which factors.
Econ Prof. Buckles1 Multiple Regression Analysis y =  0 +  1 x 1 +  2 x  k x k + u 4. Further Issues.
Chapter 4 Multiple Regression.
Multicollinearity Omitted Variables Bias is a problem when the omitted variable is an explanator of Y and correlated with X1 Including the omitted variable.
Regression Analysis: How to DO It Example: The “car discount” dataset.
Multiple Regression 2 Sociology 5811 Lecture 23 Copyright © 2005 by Evan Schofer Do not copy or distribute without permission.
Multiple Linear Regression A method for analyzing the effects of several predictor variables concurrently. - Simultaneously - Stepwise Minimizing the squared.
The Gotham City Motor Pool Gotham City maintains a fleet of automobiles in a special motor pool. These cars are used by the various city agencies when.
Lecture 7: What is Regression Analysis? BUEC 333 Summer 2009 Simon Woodcock.
Inference with computer printouts. Coefficie nts Standard Errort StatP-value Lower 95% Upper 95% Intercept
 Relationship between education level, income, and length of time out of school  Our new regression equation: is the predicted value of the dependent.
Correlation... beware. Definition Var(X+Y) = Var(X) + Var(Y) + 2·Cov(X,Y) The correlation between two random variables is a dimensionless number between.
LEAST-SQUARES REGRESSION 3.2 Least Squares Regression Line and Residuals.
11 Chapter 5 The Research Process – Hypothesis Development – (Stage 4 in Research Process) © 2009 John Wiley & Sons Ltd.
Example x y We wish to check for a non zero correlation.
Copyright © 2014, 2011 Pearson Education, Inc. 1 Chapter 24 Building Regression Models.
Does Poverty Cause Domestic Terrorism? Who knows? (Regression alone can’t establish causality.) There does appear to be some.
Bellwork (Why do we want scattered residual plots?): 10/2/15 I feel like I didn’t explain this well, so this is residual plots redux! Copy down these two.
The coefficient of determination, r 2, is The fraction of the variation in the value of y that is explained by the regression line and the explanatory.
Inference with Computer Printouts. Leaning Tower of Pisa Find a 90% confidence interval. Year Lean
Agenda 1.Exam 2 Review 2.Regression a.Prediction b.Polynomial Regression.
Regression Analysis: A statistical procedure used to find relations among a set of variables B. Klinkenberg G
Yandell – Econ 216 Chap 15-1 Chapter 15 Multiple Regression Model Building.
Stats Methods at IC Lecture 3: Regression.
Inference for Linear Regression
INEQUALITY AND MOTHERHOOD
Linear Regression Essentials Line Basics y = mx + b vs. Definitions
The simple linear regression model and parameter estimation
F-tests continued.
Statistics 200 Lecture #6 Thursday, September 8, 2016
Chapter 14 Introduction to Multiple Regression
Inference for Least Squares Lines
Chapter 9 Audit Sampling: An Application to Substantive Tests of Account Balances McGraw-Hill/Irwin ©2008 The McGraw-Hill Companies, All Rights Reserved.
Chapter 13 Multiple Regression
Let’s Get It Straight! Re-expressing Data Curvilinear Regression
Chow test.
QM222 Class 16 & 17 Today’s New topic: Estimating nonlinear relationships QM222 Fall 2017 Section A1.
Correlation and Regression
10.3 Coefficient of Determination and Standard Error of the Estimate
Multiple Regression Analysis and Model Building
QM222 Class 18 Omitted Variable Bias
The scatterplot shows the advertised prices (in thousands of dollars) plotted against ages (in years) for a random sample of Plymouth Voyagers on several.
QM222 Class 8 Section A1 Using categorical data in regression
...Relax... 9/21/2018 ST3131, Lecture 3 ST5213 Semester II, 2000/2001
Simple Linear Regression
Regression and Residual Plots
Instrumental Variables and Two Stage Least Squares
Regression Analysis: How to DO It
Coefficient of Determination and Standard Error of the Estimate.

Prepared by Lee Revere and John Large
Instrumental Variables and Two Stage Least Squares
Least-Squares Regression
The greatest blessing in life is
Correlation ... beware.
Ch. 13. Pooled Cross Sections Across Time: Simple Panel Data.
Instrumental Variables and Two Stage Least Squares
Warm Up The table below shows data on the number of live births per 1000 women (aged years) from 1965 to (Hint: enter the year as the years.
CHAPTER 14 MULTIPLE REGRESSION
Seminar in Economics Econ. 470
Least-Squares Regression
Regression Forecasting and Model Building
CHAPTER 3 Describing Relationships
Chapter 13 Additional Topics in Regression Analysis
Ch. 13. Pooled Cross Sections Across Time: Simple Panel Data.
Presentation transcript:

Modeling: Variable Selection Request: “Estimate the annual maintenance costs attributable to annual mileage on a car. Dollars per thousand miles driven will suffice.” This sounds like a regression problem! Let’s sample some cars, and look at their costs and mileage over the past year.

The Results This all looks fine. And it’s wrong!

Here’s What the Computer Sees: What it doesn’t see is the age bias in the data: The cars to the left are mostly older cars, and the cars to the right are mostly newer. An un(age)biased chart would have some lower points on the left, and some higher points on the right … and the regression line would be steeper.

Specification Bias … arises when you leave out of your model a potential explanatory variable that (1) has its own effect on the dependent variable, and (2) covaries systematically with an included explanatory variable. The included variable plays a double role, and its coefficient is a biased estimate of its pure effect. That’s why, when we seek to estimate the pure effect of one explanatory variable on the dependent variable, we should use the most-complete model possible.

Seeing the Man Who isn’t There Yesterday, upon the stair, I met a man who wasn’t there He wasn’t there again today I wish, I wish he’d go away... Antigonish (1899), Hughes Mearns When doing a regression study in order to estimate the pure effect of some variable on the dependent variable, the first challenge in the real (non-classroom) world is to decide for what variables to collect data. The “man who isn’t there” can do you harm. Let’s return to the motorpool example, with Mileage as the only explanatory variable, and look at the residuals, i.e., the errors our current model makes in predicting for individuals in the sample.

Learning from our Mistakes Take the “residuals” output Costs predicted residual Mileage 643 725.06 -82.06 18.2 613 689.39 -76.39 16.4 673 762.70 -89.70 20.1 531 530.90 0.10 8.4 518 554.67 -36.67 9.6 594 604.20 -10.20 12.1 722 699.30 22.70 16.9 861 780.53 80.47 21.0 842 851.85 -9.85 24.6 706 742.89 -36.89 19.1 795 647.79 147.21 14.3 776 691.38 84.62 16.5 815 89.94 571 616.09 -45.09 12.7 711.19 -38.19 17.5 Costs predicted residual Mileage 795 647.79 147.21 14.3 815 725.06 89.94 18.2 776 691.38 84.62 16.5 861 780.53 80.47 21.0 722 699.30 22.70 16.9 531 530.90 0.10 8.4 842 851.85 -9.85 24.6 594 604.20 -10.20 12.1 518 554.67 -36.67 9.6 706 742.89 -36.89 19.1 673 711.19 -38.19 17.5 571 616.09 -45.09 12.7 613 689.39 -76.39 16.4 643 -82.06 762.70 -89.70 20.1 Age 2 1 Sort the observations from largest to smallest residual. And see if something differentiates the observations near the top of the list from those near the bottom. If so, consider adding that differentiating variable to your model!

We Can Do This Repeatedly Our new model: After sorting on the new residuals, 3 of the top 4 and 5 of the top 7 cars (those with the greatest positive residuals) are Hondas. 3 of the bottom 4 and 5 of the bottom 7 cars (those with the greatest negative residuals) are Fords. This might suggest adding “make” as another new variable. Regression: Costs constant Mileage Age coefficient 180.9150 26.6788 71.1309 std error of coef 73.2707 3.7041 19.0376 t-ratio 2.4691 7.2024 3.7363 significance 2.9541% 0.0011% 0.2841% beta-weight 1.0377 0.5383 standard error of regression 52.2696 coefficient of determination 81.22% adjusted coef of determination 78.09% Make 1

Why Not Just Include the Kitchen Sink? Spurious correlation The Dow, and women’s skirts Collinearity For example, age and odometer miles: Likely highly correlated Computer can’t decide what to attribute to each Large standard errors of coefficients leads to large significance levels = no evidence either belongs. But if either is included alone, strong evidence it belongs