Advanced Quantitative Techniques November 3, 2016

Advanced Quantitative Techniques November 3, 2016
Lab 8

Today! How to build a good model by playing with variables
Get comfortable throwing out assumptions and changing our minds End early and get excited about the long weekend (and the election of our first female president)

Low Birth Weight Example
The goal of this study was to identify risk factors associated with giving birth to a low birth weight baby (weighing less than 2500 grams). Data were collected on 189 women, 59 of which had low birth weight babies and 130 of which had normal birth weight babies. Four variables which were thought to be of importance: age, weight of the subject at her last menstrual period, race, and number of physician visits during the first trimester of pregnancy (This dataset is from a famous study which led to important clinical recommendations)

Open STATA Download LBW.dta Take a look at your variables

LIST OF VARIABLES: Variable Abbreviation Identification Code ID Birth Weight in Grams BWT Low Birth Weight (0 = Birth Weight >= 2500g, LOW 1 = Birth Weight < 2500g) Age of the Mother in Years AGE Weight in Pounds at the Last Menstrual Period LWT Race (1 = White, 2 = Black, 3 = Other) RACE Smoking Status During Pregnancy (1 = Yes, 0 = No) SMOKE History of Premature Labor (0 = None 1 = One, etc.) PTL History of Hypertension (1 = Yes, 0 = No) HT Presence of Uterine Irritability (1 = Yes, 0 = No) UI Number of Physician Visits During the First Trimester FTV (0 = None, 1 = One, 2 = Two, etc.)

Model Building Step 1: Without looking at the data, record expectations: what factors are likely to explain birth weight (make a ‘wish list’ of independent variables)? Step 2: Reconcile “wish list” with available data. Take note of variables that you can’t measure because they aren’t available (to gauge omitted variable bias). List those variables here. Step 3: Create a list of the variables in your wish list that are available in the data (or have close proxies). Add any other variables that might reasonably be predictors of birth weight (you should test most variables). But eliminate variables that have no possible predictive power or that are circular. The variables that you keep are your candidate independent variables.

Step 4: Perform basic checks of the candidate variables
Step 4: Perform basic checks of the candidate variables. Any missing value or out of range data problems? Create a dummy variable for race. In light of theory, I made black =1, other races =0. Be sure to check that you coded this correctly. Race can not be included “as is” because it is a nominal variable. You need the dummy variable transformation. sum bwt age lwt smoke ht ui ftv black gen race=. replace black=1 if race==2 replace black=0 if race==1|race==3

corr bwt age lwt smoke ht ui ftv black
Step 5: Build a correlation matrix which includes your dependent variable and candidate independent variables. What did your check of the correlation matrix find? Which variables seem most highly correlated with birth weight? Does it look like you need to worry about multicollinearity? Don’t include variables that you eliminated in step 3 in the correlation matrix corr bwt age lwt smoke ht ui ftv black The correlation is a number between -1 and +1 that measures how close the relationship between two variables is to being linear (i.e., forming a straight line if the two were graphed against each other). A correlation matrix is simply a grid of the correlations among all the variables. Correlation = +1 means variables are perfectly positively correlated (they go up and down in perfect synchronization; e.g., dollars of sales and sales tax); -1 means perfect negative correlation (one goes up and the other goes down; e.g., items sold and inventory); values close to 0 mean either no relation or the relation isn't linear. Correlation is independent of scale; measuring one variable in millions and the other in millionths won't affect it.

Or, a pairwise correlation
pwcorr bwt age lwt smoke ht ui ftv black, obs sig The most important difference between correlate and pwcorr is the way in which missing data is handled. With correlate, an observation or case is dropped if any variable has a missing value, in other words, correlate uses listwise , also called casewise, deletion. pwcorr uses pairwise deletion, meaning that the observation is dropped only if there is a missing value for the pair of variables being correlated. pwcorr versus corr They differ with respect to the way they deal with missing values. To compute a correlation you just need two variables, so if you ask for a matrix of correlations you could just do so by looking at each pair of variables separately and include all observations that contain valid values for that pair. (pwcorr) Alternatively, you could say that the entire list of variables defines your sample, in that case would first remove all observations that contain a missing value on any of the variables in the list of variables. (corr)

If you’re not sure if you have missing values…
pwcorr bwt age lwt smoke ht ui ftv black if !missing(bwt,age,lwt,smoke,ht,ui,ftv,black) You can use the -if- condition to let -pwcorr- behave like -corr-.

Step 6: Rank your independent variables based on logic/reasoning or theory. Write down the order of entry based on your best guess given your knowledge of field (protection against specification error) . If you are not sure, you can use the correlation results as a guide, but try to let reasoning and logic drive the order of entry. Step 7: Add your first independent variable to the regression model. Show your bivariate model. Did it accord with your expectations? Step 8: Check for regression violations for this bivariate mode. Did you find any major violations?

Step 9: Sequentially build up the model adding variables in the order you specified (don’t check reg. assumptions at each stage) Add variables one by one. As we add variables: Drop variables that are insignificant unless strong theoretical reason to keep. If an insignificant variable makes existing variable insignificant just drop the new one. If the new variable is significant but adding it makes an old variable insignificant, keep both. Theory led you to think the other important, so keep it. Keep track of variables which are not significant. This is important to document. Briefly document what you kept and what you dropped.

regress bwt age lwt smoke ht ui ftv black, beta

Step 10: Recheck model assumptions, for your final model (You do NOT need to check assumptions for each variable you add, only do this for the bivariate model and your final model). Discuss your final model, review the coefficient table in detail, and the other key statistics. Also, briefly discuss if the final model satisfied regression assumptions overall. If not, what are some options for improving the model fit?

Reminder on reading Stata Output
SS – Sum of Squares associated with three sources of variance: Model, Residual, and Total MS – Mean of Squares, the SS divided by the respective degrees of freedom. MS represents the sample, error and model variance respectively F-statistic – this is the MS Model divided by the MS Residual; the numbers in brackets are the respective df Prob>F – this is the p-value associated with F-statistic. It tests the hypothesis that all the model coefficients are 0 R-squared – the proportion of variance in y explained by the independent variables. Adjusted R-squared – in which the addition of extraneous variables to the model is penalized. It is always less than R-squared and increases only if the addition of one more explanatory variable improves the model more than what would be expected by chance Root MSE – the Root of the MS Residual. This is the standard deviation of the residuals

More monkeying with data
If you’d like to stick around, we can run through another dataset… Download NYC_Community District CD: community district Disamen: n’hood disamenities (waste transfer sites, wastewater facilities, etc.) Garbage: waste management facilities percent persons with income below the poverty level percent white alone persons percent Hispanic or Latino persons percent black or African American persons percent Asian alone persons percent foreign-born persons Owned units Gini coefficient Parks: number of parks

Advanced Quantitative Techniques November 3, 2016

Similar presentations

Presentation on theme: "Advanced Quantitative Techniques November 3, 2016"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Advanced Quantitative Techniques November 3, 2016

Similar presentations

Presentation on theme: "Advanced Quantitative Techniques November 3, 2016"— Presentation transcript:

Similar presentations

About project

Feedback