Topics Introduction to Stata – Files / directories – Stata syntax – Useful commands / functions Logistic regression analysis with Stata – Estimation – GOF – Coefficients – Checking assumptions
Introduction to Stata Note: we did this interactively for the larger part …
Stata file types.ado – programs that add commands to Stata.do – Batch files that execute a set of Stata commands.dta – Data file in Stata’s format.log – Output saved as plain text by the log using command
The working directory The working directory is the default directory for any file operations such as using & saving data, or logging output cd “d:\my work\”
Saving output to log files Syntax for the log command –log using filename [, append replace [smcl|text]] To close a log file –log close
Using and saving datasets Load a Stata dataset – use d:\myproject\data.dta, clear Save – save d:\myproject\data, replace Using change directory – cd d:\myproject – Use data, clear – save data, replace
Entering data Data in other formats – You can use SPSS to convert data – You can use the infile and insheet commands to import data in ASCII format Entering data by hand – Type edit or just click on the data-editor button
Do-files You can create a text file that contains a series of commands Use the do-editor to work with do-files Example I
Adding comments // or * denote comments stata should ignore Stata ignores whatever follows after /// and treats the next line as a continuation Example II
A recommended structure capture log close //if a log file is open, close it, otherwise disregard set more off//dont'pause when output scrolls off the page cd d:\myproject//change directory to your working directory log using myfile, replace text //log results to file myfile.log … here you put the rest of your Stata commands … log close //close the log file
Serious data analysis Ensure replicability use do+log files Document your do-files – What is obvious today, is baffling in six months Keep a research log – Diary that includes a description of every program you run Develop a system for naming files
Serious data analysis New variables should be given new names Use labels and notes Double check every new variable ARCHIVE
Stata syntax examples
The Stata syntax Regress y x1 x2 if x3 <20, cluster(x4) 1.Regress = Command – What action do you want to performed 2.y x1 x2 = Names of variables, files or other objects – On what things is the command performed 3.if x3 <20 = Qualifier on observations – On which observations should the command be performed 4., cluster(x4) = Options – What special things should be done in executing the command
Examples tabulate smoking race if agemother > 30, row Example of the if qualifier – sum agemother if smoking == 1 & weightmother < 100
Elements used for logical statements OperatorDefinitionExample ==Equal toIf male == 1 !=Not equal toIf male !=1 >Greater thanIf age > 20 >=Greater than or equal toIf age >=21 <Less thanIf age<66 <=Less than or equal toIf age<=65 &AndIf age==21 & male ==1 |orIf age =65
Missing values Automatically excluded when Stata fits models; they are stored as the largest positive values Beware!! – The expression ‘age > 65’ can thus also include missing values – To be sure type: ‘age > 65 & age !=.’
Selecting observations drop variable list keep variable list drop if age < 65
Creating new variables generate command – generate age2 = age * age – generate – see help function – !!sometimes the command egen is a useful alternative, f.i. – egen meanage = mean(age)
Useful functions FunctionDefinitionExample +additiongen y = a+b -subtractiongen y = a-b /Divisiongen density=population/area *Multiplicationgen y = a*b ^Take to a powergen y = a^3 lnNatural loggen lnwage = ln(wage) expexponentialgen y = exp(b) sqrtSquare rootGen agesqrt = sqrt(age)
Replace command replace has the same syntax as generate but is used to change values of a variable that already exists gen age_dum =. replace age = 0 if age < 5 replace age = 1 if age >=5
Recode Change values of exisiting variables – Change 1 to 2 and 3 to 4: recode origvar (1=2)(3=4), gen(myvar1) – Change missings to 1: recode origvar (.=1), gen(origvar)
Logistic regression Logistic
Logistic regression Lets use a set of data collected by the state of California from 1200 high schools measuring academic achievement. Our dependent variable is called hiqual. Our predictor variable will be a continuous variable called avg_ed, which is a continuous measure of the average education (ranging from 1 to 5) of the parents of the students in the participating high schools.
OLS in Stata
Logistic regression in Stata
Multiple predictors
MODEL FIT Consider model fit using: 1)The likelihood ratio test 2)The pseudo-R2 (proportional change in log-likelihood) 3)The classification table
Model fit: the likelihood ratio test
Model fit: LR test
Pseudo R2: proportional change in LL
Classification Table
Hosmer & Lemeshow Test divides sample in subgroups, checks whether difference between observed and predicted is about equal in these groups Test should not be significant (indicating no difference)
Goodness of fit: Hosmer & Lemeshow Average Probability In j th group
First logistic regression
Then postestimation command
Including interaction term helps...
... as you can see here Ok now
Interpreting coefficients
Interpreting coefficients: significance
Interpretation of coefficients: direction
Interpretation of coefficients: Magnitude
Ok now
Multicollinearity
Influential observations
To do Perform a logistic regression analysis Use apilog.dta Awards = dependent variable