Advanced Quantitative Techniques

Slides:



Advertisements
Similar presentations
LINEAR REGRESSION: Evaluating Regression Models Overview Assumptions for Linear Regression Evaluating a Regression Model.
Advertisements

LINEAR REGRESSION: Evaluating Regression Models. Overview Assumptions for Linear Regression Evaluating a Regression Model.
Multivariate Data Analysis Chapter 4 – Multiple Regression.
January 6, morning session 1 Statistics Micro Mini Multiple Regression January 5-9, 2008 Beth Ayers.
Class 6: Tuesday, Sep. 28 Section 2.4. Checking the assumptions of the simple linear regression model: –Residual plots –Normal quantile plots Outliers.
BIOST 536 Lecture 9 1 Lecture 9 – Prediction and Association example Low birth weight dataset Consider a prediction model for low birth weight (< 2500.
Lecture 20 Simple linear regression (18.6, 18.9)
Regression Diagnostics - I
Statistical Analysis SC504/HS927 Spring Term 2008 Session 7: Week 23: 7 th March 2008 Complex independent variables and regression diagnostics.
Regression Diagnostics Checking Assumptions and Data.
Correlation and Regression Analysis
Multiple Linear Regression A method for analyzing the effects of several predictor variables concurrently. - Simultaneously - Stepwise Minimizing the squared.
Forecasting Revenue: An Example of Regression Model Building Setting: Possibly a large set of predictor variables used to predict future quarterly revenues.
Quantitative Business Analysis for Decision Making Multiple Linear RegressionAnalysis.
© 2004 Prentice-Hall, Inc.Chap 15-1 Basic Business Statistics (9 th Edition) Chapter 15 Multiple Regression Model Building.
1 Least squares procedure Inference for least squares lines Simple Linear Regression.
Shuyu Chu Department of Statistics February 17, 2014 Lisa Short Course Series R Statistical Analysis Laboratory for Interdisciplinary Statistical Analysis.
Forecasting Revenue: An Example of Regression Model Building Setting: Possibly a large set of predictor variables used to predict future quarterly revenues.
LOGO Chapter 4 Multiple Regression Analysis Devilia Sari - Natalia.
Examining Relationships in Quantitative Research
Multiple Regression and Model Building Chapter 15 Copyright © 2014 by The McGraw-Hill Companies, Inc. All rights reserved.McGraw-Hill/Irwin.
Dr. C. Ertuna1 Issues Regarding Regression Models (Lesson - 06/C)
Model Selection and Validation. Model-Building Process 1. Data collection and preparation 2. Reduction of explanatory or predictor variables (for exploratory.
REGRESSION DIAGNOSTICS Fall 2013 Dec 12/13. WHY REGRESSION DIAGNOSTICS? The validity of a regression model is based on a set of assumptions. Violation.
Assumption checking in “normal” multiple regression with Stata.
© Department of Statistics 2012 STATS 330 Lecture 19: Slide 1 Stats 330: Lecture 19.
B AD 6243: Applied Univariate Statistics Multiple Regression Professor Laku Chidambaram Price College of Business University of Oklahoma.
Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc. Chap 15-1 Chapter 15 Multiple Regression Model Building Basic Business Statistics 10 th Edition.
Venn diagram shows (R 2 ) the amount of variance in Y that is explained by X. Unexplained Variance in Y. (1-R 2 ) =.36, 36% R 2 =.64 (64%)
Lecturer: Ing. Martina Hanová, PhD.. Regression analysis Regression analysis is a tool for analyzing relationships between financial variables:  Identify.
Data Screening. What is it? Data screening is very important to make sure you’ve met all your assumptions, outliers, and error problems. Each type of.
Regression. Why Regression? Everything we’ve done in this class has been regression: When you have categorical IVs and continuous DVs, the ANOVA framework.
Quantitative Methods Residual Analysis Multiple Linear Regression C.W. Jackson/B. K. Gordor.
Chapter 12 REGRESSION DIAGNOSTICS AND CANONICAL CORRELATION.
Yandell – Econ 216 Chap 15-1 Chapter 15 Multiple Regression Model Building.
Advanced Quantitative Techniques
Advanced Quantitative Techniques
Inference for Least Squares Lines
Statistical Data Analysis - Lecture /04/03
Linear Regression.
Advanced Quantitative Techniques November 3, 2016
The Correlation Coefficient (r)
Understanding Regression Analysis Basics
Correlation – Regression
Multiple Regression Analysis and Model Building
Violations of Regression Assumptions
Multiple Regression.
Chapter 12: Regression Diagnostics
Lab 9 – Regression Diagnostics
Regression.
Chapter 8 Part 2 Linear Regression
Week 5 Lecture 2 Chapter 8. Regression Wisdom.
1. An example for using graphics
BA 275 Quantitative Business Methods
Prepared by Lee Revere and John Large
Multiple Linear Regression
The greatest blessing in life is
Solution 9 1. a) From the matrix plot, 1) The assumption about linearity seems ok; 2).The assumption about measurement errors can not be checked at this.
Regression diagnostics
Multiple Regression Chapter 14.
Regression Diagnostics
Product moment correlation
Regression Forecasting and Model Building
Checking Assumptions Primary Assumptions Secondary Assumptions
Chapter 13 Additional Topics in Regression Analysis
Exercise 1: Gestational age and birthweight
Model Adequacy Checking
Diagnostics and Remedial Measures
Exercise 1: Gestational age and birthweight
The Correlation Coefficient (r)
Presentation transcript:

Advanced Quantitative Techniques Lab 7

Low Birth Weight Example The goal of this study was to identify risk factors associated with giving birth to a low birth weight baby (weighing less than 2500 grams). Data were collected on 189 women, 59 of which had low birth weight babies and 130 of which had normal birth weight babies. Four variables which were thought to be of importance were age, weight of the subject at her last menstrual period, race, and the number of physician visits during the first trimester of pregnancy (This dataset is from a famous study which led to important clinical recommendations).

LIST OF VARIABLES: Variable Abbreviation Identification Code ID Birth Weight in Grams BWT Low Birth Weight (0 = Birth Weight >= 2500g, LOW 1 = Birth Weight < 2500g) Age of the Mother in Years AGE Weight in Pounds at the Last Menstrual Period LWT Race (1 = White, 2 = Black, 3 = Other) RACE Smoking Status During Pregnancy (1 = Yes, 0 = No) SMOKE History of Premature Labor (0 = None 1 = One, etc.) PTL History of Hypertension (1 = Yes, 0 = No) HT Presence of Uterine Irritability (1 = Yes, 0 = No) UI Number of Physician Visits During the First Trimester FTV (0 = None, 1 = One, 2 = Two, etc.)

Model Building Step 1: Without looking at the data, record expectations: what factors are likely to explain birth weight (make a ‘wish list’ of independent variables)? Step 2: Reconcile “wish list” with available data. Take note of variables that you can’t measure because they aren’t available (to gauge omitted variable bias). List those variables here. Step 3: Create a list of the variables in your wish list that are available in the data (or have close proxies). Add any other variables that might reasonably be predictors of birth weight (you should test most variables). But eliminate variables that have no possible predictive power or that are circular. The variables that you keep are your candidate independent variables.

Step 4: Perform basic checks of the candidate variables Step 4: Perform basic checks of the candidate variables. Any missing value or out of range data problems? Create a dummy variable for race. In light of theory, I made black =1, other races =0. Be sure to check that you coded this correctly. Race can not be included “as is” because it is a nominal variable. You need the dummy variable transformation. sum bwt age lwt smoke ht ui ftv black gen black=. replace black=1 if race==2 replace black=0 if race==1|race==3

Step 5: Build a correlation matrix which includes your dependent variable and candidate independent variables. What did your check of the correlation matrix find? Which variables seem most highly correlated with birth weight? Does it look like you need to worry about multicollinearity? Don’t include variables that you eliminated in step 3 in the correlation matrix corr bwt age lwt smoke ht ui ftv black

pwcorr bwt age lwt smoke ht ui ftv black, obs sig The most important difference between correlate and pwcorr is the way in which missing data is handled. With correlate, an observation or case is dropped if any variable has a missing value. (listwise deletion) In pwcorr, an observation is dropped only if there is a missing value for the pair of variables being correlated. (pairwise deletion) =>pwcorr is better in most cases

Step6: Rank your independent variables based on logic/reasoning or theory. Write down the order of entry based on your best guess given your knowledge of field (protection against specification error) . If you are not sure, you can use the correlation results as a guide, but try to let reasoning and logic drive the order of entry. Step7: Add your first independent variable to the regression model. Show your bivariate model. Did it accord with your expectations? Step 8: Check for regression violations for this bivariate mode. Did you find any major violations?

Step 9:Sequentially build up the model adding variables in the order you specified (don’t check reg. assumptions at each stage) Add variables one by one. As we add variables: Drop variables that are insignificant unless strong theoretical reason to keep. If an insignificant variable makes existing variable insignificant just drop the new one. If the new variable is significant but adding it makes an old variable insignificant, keep both. Theory led you to think the other important, so keep it. Keep track of variables which are not significant. This is important to document. Briefly document what you kept and what you dropped.

regress bwt age lwt smoke ht ui ftv black, beta

Step 10: Recheck model assumptions, for your final model (You do NOT need to check assumptions for each variable you add, only do this for the bivariate model and your final model). Discuss your final model, review the coefficient table in detail, and the other key statistics. Also, briefly discuss if the final model satisfied regression assumptions overall. If not, what are some options for improving the model fit?

predict pr list pr bwt in 1/10 predict res, residual list res in 1/10

Residual regress bwt age lwt smoke ht ui ftv black, beta rvpplot age

regress bwt age lwt smoke ht ui ftv black

Studentized Residuals Studentized residuals are a type of standardized residual that can be used to identify outliers. predict r, rstudent sort r list id r in 1/10 list id r in -10/ list r id bwt age lwt smoke ht ui ftv black if abs(r) > 2 display 189*0.05 5% * N (>2) 1%* N (>3)

Leverage Leverage is a measure of how far an observation deviates from the mean of that variable. predict lev, leverage Generally, a point with leverage greater than (2k+2)/n should be carefully examined. k =number of predictors (in our example 7) n = number of observations. (in our example 189) display (2*7+2)/189 list bwt age lwt smoke ht ui ftv black id lev if lev >.08465608

Cook’s D Overall measurement of both information on the residual and leverage. The lowest value that Cook's D can assume is zero, and the higher the Cook's D is, the more influential the point. The convention for a cut-off point for undue influence from a single observation as measured through Cook’s D is 4/n. predict d, cooksd list id bwt age lwt smoke ht ui ftv black d if d>4/189

DFITS similar to Cook’s D except that they scale differently, but they give us similar answers. can be either positive or negative, with numbers close to zero corresponding to the points with small or zero influence. The cut-off point for DFITS is 2*sqrt(k/n) predict dfit, dfits list id bwt age lwt smoke ht ui ftv black dfit if abs(dfit)>2*sqrt(7/189)

We find that Id=226 is an observation that both has a large residual and large leverage.  Such points are potentially the most influential. regress bwt age lwt smoke ht ui ftv black if id!=226

Diagnostics 3: Checking Homoscedasticity of Residuals rvfplot, yline(0) A commonly used graphical method is to plot the residuals versus predicted values. If the model is well-fitted, there should be no pattern to the residuals plotted against the fitted values. If the variance of the residuals is non-constant then the residual variance is said to be "heteroscedastic. We do this by the rvfplot command. the yline(0) option is to put a reference line at y=0. We see that the pattern of the data points is getting a little narrower towards the right end, which is an indication of heteroscedasticity. In our case, there is a little narrowing in the error bandwidth, but it is minor.

Diagnostics 4: Checking for Multicollinearity < 10 vif multicollinearity will arise if we have put in too many variables that measure the same thing.

Diagnostics 5: Checking Linearity Bivariate regression twoway (scatter bwt lwt) (lfit bwt lwt) (lowess bwt lwt) We will try to illustrate some of the techniques that you can use.

Diagnostics 5: Checking Linearity Multiple regression: the most straightforward thing to do is to plot the residuals against each of the predictor variables in the regression model. If there is a clear nonlinear pattern, there is a problem of nonlinearity. Otherwise, we should see for each of the plots just a random scatter of points. scatter res age scatter res lwt