Download presentation
Presentation is loading. Please wait.
1
Advanced Quantitative Techniques
Lab 7
2
Low Birth Weight Example
The goal of this study was to identify risk factors associated with giving birth to a low birth weight baby (weighing less than 2500 grams). Data were collected on 189 women, 59 of which had low birth weight babies and 130 of which had normal birth weight babies. Four variables which were thought to be of importance were age, weight of the subject at her last menstrual period, race, and the number of physician visits during the first trimester of pregnancy (This dataset is from a famous study which led to important clinical recommendations).
3
LIST OF VARIABLES: Variable Abbreviation Identification Code ID Birth Weight in Grams BWT Low Birth Weight (0 = Birth Weight >= 2500g, LOW 1 = Birth Weight < 2500g) Age of the Mother in Years AGE Weight in Pounds at the Last Menstrual Period LWT Race (1 = White, 2 = Black, 3 = Other) RACE Smoking Status During Pregnancy (1 = Yes, 0 = No) SMOKE History of Premature Labor (0 = None 1 = One, etc.) PTL History of Hypertension (1 = Yes, 0 = No) HT Presence of Uterine Irritability (1 = Yes, 0 = No) UI Number of Physician Visits During the First Trimester FTV (0 = None, 1 = One, 2 = Two, etc.)
4
Model Building Step 1: Without looking at the data, record expectations: what factors are likely to explain birth weight (make a ‘wish list’ of independent variables)? Step 2: Reconcile “wish list” with available data. Take note of variables that you can’t measure because they aren’t available (to gauge omitted variable bias). List those variables here. Step 3: Create a list of the variables in your wish list that are available in the data (or have close proxies). Add any other variables that might reasonably be predictors of birth weight (you should test most variables). But eliminate variables that have no possible predictive power or that are circular. The variables that you keep are your candidate independent variables.
5
Step 4: Perform basic checks of the candidate variables
Step 4: Perform basic checks of the candidate variables. Any missing value or out of range data problems? Create a dummy variable for race. In light of theory, I made black =1, other races =0. Be sure to check that you coded this correctly. Race can not be included “as is” because it is a nominal variable. You need the dummy variable transformation. sum bwt age lwt smoke ht ui ftv black gen black=. replace black=1 if race==2 replace black=0 if race==1|race==3
6
Step 5: Build a correlation matrix which includes your dependent variable and candidate independent variables. What did your check of the correlation matrix find? Which variables seem most highly correlated with birth weight? Does it look like you need to worry about multicollinearity? Don’t include variables that you eliminated in step 3 in the correlation matrix corr bwt age lwt smoke ht ui ftv black
7
pwcorr bwt age lwt smoke ht ui ftv black, obs sig
The most important difference between correlate and pwcorr is the way in which missing data is handled. With correlate, an observation or case is dropped if any variable has a missing value. (listwise deletion) In pwcorr, an observation is dropped only if there is a missing value for the pair of variables being correlated. (pairwise deletion) =>pwcorr is better in most cases
8
Step6: Rank your independent variables based on logic/reasoning or theory. Write down the order of entry based on your best guess given your knowledge of field (protection against specification error) . If you are not sure, you can use the correlation results as a guide, but try to let reasoning and logic drive the order of entry. Step7: Add your first independent variable to the regression model. Show your bivariate model. Did it accord with your expectations? Step 8: Check for regression violations for this bivariate mode. Did you find any major violations?
9
Step 9:Sequentially build up the model adding variables in the order you specified (don’t check reg. assumptions at each stage) Add variables one by one. As we add variables: Drop variables that are insignificant unless strong theoretical reason to keep. If an insignificant variable makes existing variable insignificant just drop the new one. If the new variable is significant but adding it makes an old variable insignificant, keep both. Theory led you to think the other important, so keep it. Keep track of variables which are not significant. This is important to document. Briefly document what you kept and what you dropped.
10
regress bwt age lwt smoke ht ui ftv black, beta
11
Step 10: Recheck model assumptions, for your final model (You do NOT need to check assumptions for each variable you add, only do this for the bivariate model and your final model). Discuss your final model, review the coefficient table in detail, and the other key statistics. Also, briefly discuss if the final model satisfied regression assumptions overall. If not, what are some options for improving the model fit?
12
predict pr list pr bwt in 1/10 predict res, residual list res in 1/10
13
Residual regress bwt age lwt smoke ht ui ftv black, beta rvpplot age
14
regress bwt age lwt smoke ht ui ftv black
15
Studentized Residuals
Studentized residuals are a type of standardized residual that can be used to identify outliers. predict r, rstudent sort r list id r in 1/10 list id r in -10/ list r id bwt age lwt smoke ht ui ftv black if abs(r) > 2 display 189*0.05 5% * N (>2) 1%* N (>3)
16
Leverage Leverage is a measure of how far an observation deviates from the mean of that variable. predict lev, leverage Generally, a point with leverage greater than (2k+2)/n should be carefully examined. k =number of predictors (in our example 7) n = number of observations. (in our example 189) display (2*7+2)/189 list bwt age lwt smoke ht ui ftv black id lev if lev >
17
Cook’s D Overall measurement of both information on the residual and leverage. The lowest value that Cook's D can assume is zero, and the higher the Cook's D is, the more influential the point. The convention for a cut-off point for undue influence from a single observation as measured through Cook’s D is 4/n. predict d, cooksd list id bwt age lwt smoke ht ui ftv black d if d>4/189
18
DFITS similar to Cook’s D except that they scale differently, but they give us similar answers. can be either positive or negative, with numbers close to zero corresponding to the points with small or zero influence. The cut-off point for DFITS is 2*sqrt(k/n) predict dfit, dfits list id bwt age lwt smoke ht ui ftv black dfit if abs(dfit)>2*sqrt(7/189)
19
We find that Id=226 is an observation that both has a large residual and large leverage. Such points are potentially the most influential. regress bwt age lwt smoke ht ui ftv black if id!=226
21
Diagnostics 3: Checking Homoscedasticity of Residuals
rvfplot, yline(0) A commonly used graphical method is to plot the residuals versus predicted values. If the model is well-fitted, there should be no pattern to the residuals plotted against the fitted values. If the variance of the residuals is non-constant then the residual variance is said to be "heteroscedastic. We do this by the rvfplot command. the yline(0) option is to put a reference line at y=0. We see that the pattern of the data points is getting a little narrower towards the right end, which is an indication of heteroscedasticity. In our case, there is a little narrowing in the error bandwidth, but it is minor.
22
Diagnostics 4: Checking for Multicollinearity
< 10 vif multicollinearity will arise if we have put in too many variables that measure the same thing.
23
Diagnostics 5: Checking Linearity
Bivariate regression twoway (scatter bwt lwt) (lfit bwt lwt) (lowess bwt lwt) We will try to illustrate some of the techniques that you can use.
24
Diagnostics 5: Checking Linearity
Multiple regression: the most straightforward thing to do is to plot the residuals against each of the predictor variables in the regression model. If there is a clear nonlinear pattern, there is a problem of nonlinearity. Otherwise, we should see for each of the plots just a random scatter of points. scatter res age scatter res lwt
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.