Download presentation
Presentation is loading. Please wait.
Published byNorah Skinner Modified over 9 years ago
1
Automating Your Work: An Introduction to Programming in Stata Shawna N. Smith 29 July 2009
2
…but why? GSS Mental Health Replication Study Respondents received one of four different vignettes: depression, schizophrenia, alcohol abuse, normal troubles 38 outcomes [binary] Two waves of data: 1996 & 2006 First question: Is there a survey year difference? 4 vignettes x 38 outcomes = 152 potential differences 2
3
Roadmap Writing effective do-files [Review] Automation – Macros – Using stored info – foreach and forvalues loops – Ado-files {brief preview} 3
4
The Workflow of Data Analysis: Principles and Practices By J. Scott Long Much of this talk is from Chapter 4: Automating your work For example files: type findit workflow and follow the instructions 4
5
[aside] Writing effective do-files Robust: To be robust, a do-file must produce exactly the same result when run at a later time or on another computer Legible: To be legible, a do-file must be documented and formatted so that it is easy to understand what is being done 5
6
Robust Self-contained Include version control Exclude directory information – Never hardcode your directory! Rather set your working directory before you start your work 6
7
Legible Use comments Use alignment and indentation Use short lines [<80 characters] Limit the use of abbreviations 7
8
Automating Your Work 8
9
Macros A macro assigns a string of text or a number to an abbreviation Two types of macros, {local} & {global} {Global} – Persists until you delete it or exit Stata – Can lead to do-files that unintentionally depend on a global macro created by another do-file – Such do-files are not robust and can lead to unpredictable results *{Local} – Can only be used within the do-file or ado-file in which they are defined – When that program ends, the local macro disappears Macros are the simplest tool for automating your work 9
10
Syntax local local-name “string” – local rhs “var1 var2 var3” – display “The local rhs contains: `rhs’” local local-name = expression – local ncases = 198 – display “The local ncases equals: `ncases’” With the equals sign, expression is limited to 80 characters; without, “string” is limited to 67,784 characters. It is usually better to use “string” 10
11
Here is a simple example. I want to estimate the model:. logit y var1 var2 var3 I can create the macro rhs with the names of the independent or right-hand-side variables:. local rhs “var1 var2 var3” Then, I can write the logit command as:. logit y `rhs’ where the ` and ‘ indicate that I want to insert the contents of the macro rhs. i.e., the command: logit y `rhs’ works exactly the same as logit y var1 var2 var3 11
12
12
13
Macros can be combined to specify a sequence of nested models. First, I create macros for four groups of independent variables:. local set1_age “age agesquared”. local set2_educ “wc hc”. local set3_kids “k5 k618”. local set4_money “lwg inc” Next, I specify four nested models. The first model includes only the first set of variables and is specified as:. local model_1 “`set1_age’” The macro model_2 combines the content of the local model_1 with the variables in local set2_educ:. local model_2 “`model_1’ `set2_educ’” The next two models are specified the same way:. local model_3 “`model_2’ `set3_kids’”. local model_4 “`model_3’ `set4_money’” 13
14
Next, I check the variables in each model:. display “model_1: `model_1’” model_1: age agesquared. display “model_2: `model_2’” model_2: age agesquared wc hc. display “model_3: `model_3’” model_3: age agesquared wc hc k5 k618. display “model_4: `model_4’” model_4: age agesquared wc hc k5 k618 lwg inc Using these locals, I estimate a series of logits:. logit lfp `model_1’. logit lfp `model_2’. logit lfp `model_3’. logit lfp `model_4’ 14
15
The whole thing:. local set1_age “age agesquared”. local set2_educ “wc hc”. local set3_kids “k5 k618”. local set4_money “lwg inc”. local model_1 “`set1_age’”. local model_2 “`model_1’ `set2_educ’”. local model_3 “`model_2’ `set3_kids’”. local model_4 “`model_3’ `set4_money’”. display “model_1: `model_1’” model_1: age agesquared. display “model_2: `model_2’” model_2: age agesquared wc hc. display “model_3: `model_3’” model_3: age agesquared wc hc k5 k618. display “model_4: `model_4’” model_4: age agesquared wc hc k5 k618 lwg inc. logit lfp `model_1’. logit lfp `model_2’. logit lfp `model_3’. logit lfp `model_4’ 15
16
Automating Your Work 16
17
Saved results Stata commands send results to your log file but also save those results to memory Drukker’s Dictum: Never type anything that you can obtain from a saved result This information can be moved into macros and matrices, and used in many ways 17
18
Consider a simple example using -prvalue-. Use -prvalue- to calculate discrete change for SD of age centered on the mean) [The old way…]. sum age Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------- age | 753 42.53785 8.072574 30 60. di 42.53785 + (8.072574/2) 46.574137. di 42.53785 - (8.072574/2) 38.501563. qui prvalue, x(age=46.574137) rest(mean) save label(SD-). prvalue, x(age=38.501563) rest(mean) dif label(SD+) ::: 18
19
A simpler [& more robust] way:. local c “age”. sum `c’ Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------- age | 753 42.53785 8.072574 30 60. return list scalars: r(N) = 753 r(sum_w) = 753 r(mean) = 42.53784860557769 // scalar for mean of age r(Var) = 65.16645121641095 r(sd) = 8.072574014303674 // scalar for sd of age r(min) = 30 r(max) = 60 r(sum) = 32031. local sdup = r(mean) + (r(sd)/2). local sddn = r(mean) - (r(sd)/2). qui prvalue, x(`c’=`sddn’) rest(mean) save label(SD-). prvalue, x(`c’=`sdup’) rest(mean) dif label(SD+) ::: 19
20
Question: I discover a problem with my age variable & decide to change my C to income. Which parts of the above code do I need to change if: [1] I ‘hardcoded’ my numbers; & [2] I used the locals & scalars? 20
21
Automating Your Work 21
22
foreach and forvalues loops Loops let you execute a group of commands multiple times By combining macros with loops, you can speed up tasks ranging from creating variables to estimating models Loops can be used in many ways that make your workflow faster and more accurate. For example: – Creating interaction variables – Using the same command for multiple variables – Using information returned by Stata for other purposes 22
23
Syntax: foreach foreach local-name in | of list-type list { commands referring to `local-name’ } – foreach name in var1 var2 var3 { – foreach var of varlist var1-var10 { 23
24
Syntax: forvalues forvalues lname = range { commands referring to `lname’ } – forvalues nage = 40(5)80 { – forvalues n = 1(.1)100 { 24 SyntaxMeaningExampleGenerates #1(#d)#2From #1 to #2 in steps of #d.1(2)101, 3, 5, 7, 9 #1/#2From #1 to #2 in steps of 1.1/101, 2, 3,..., 10 #1 #t to #2From #1 to #2 in steps of (#t-#1)1 4 to 151, 4, 7, 10, 13
25
Here is a simple example that illustrates the key features of loops. I have a four-category ordinal variable y with values from 1 to 4. I want to create the binary variables y_lt2, y_lt3, and y_lt4 that equal 1 if y is less than the indicated value, else 0. I can create the variables with three generate commands:. generate y_lt2 = y<2 if y<.. generate y_lt3 = y<3 if y<.. generate y_lt4 = y<4 if y<. 25
26
I can do the same thing with a foreach loop: 1> foreach cutpt in 2 3 4 { 2> generate y_lt‘cutpt’ = y<‘cutpt’ if y<. 3> } The first time through the local cutpt is assigned the first value in the list. Next, the generate command is run, where ‘cutpt’ is replaced by the value assigned to cutpt. The first time through the loop, line 2 is evaluated as:. generate y_lt2 = y<2 if y<. Next, the closing brace } is encountered, which sends us back to the foreach command in line 1. In the second pass, foreach assigns cutpt to the second value in the list, which means that the generate command is evaluated as:. generate y_lt3 = y<3 if y<. This continues once more, assigning cutpt to 4. When the foreach loop ends, three variables have been generated. 26
27
foreach and forvalues loops 27
28
Suppose that I need variables that are interactions between the binary variable male and a set of independent variables. I can do this quickly with a loop: 1> foreach varname of varlist yr89 white age ed prst { 2> generate maleX‘varname’ = male*‘varname’ 3> label var maleX‘varname’ "male*‘varname’" 4> } To examine the new variables and their labels, I use codebook:. codebook maleX*, compact Variable Obs Unique Mean Min Max Label --------------------------------------------------------------------------- maleXyr89 2293 2.1766245 0 1 male*yr89 maleXwhite 2293 2.4147405 0 1 male*white maleXage 2293 71 20.50807 0 89 male*age maleXed 2293 21 5.735717 0 20 male*ed maleXprst 2293 59 18.76625 0 82 male*prst --------------------------------------------------------------------------- How can we use what we learned about extended macros to improve upon this? 28
29
foreach and forvalues loops 29
30
Suppose I want to estimate discrete change for a sd (using the -prvalue, save- & -dif-) for multiple continuous variables. Earlier, we used the following commands:. local c “age”. sum `c’. local sdup = r(mean) + (r(sd)/2). local sddn = r(mean) - (r(sd)/2). qui prvalue, x(`c’=`sddn’) rest(mean) save label(SD-). prvalue, x(`c’=`sdup’) rest(mean) dif label(SD+) To expand this to multiple continuous variables, we’ll use a -foreach- loop: foreach var in age lwg { qui sum `var’ local sdup = r(mean) + (r(sd)/2) local sddn = r(mean) - (r(sd)/2) di “” di “**Change in `var’ from `sddn’ to `sdup’” qui prvalue, x(`var’=`sddn’) rest(mean) save label(SD-) prvalue, x(`var’=`sdup’) rest(mean) dif label(SD+) } 30
31
Output: **Change in age from 38.50156159842585 to 46.57413561272952 logit: Change in Predictions for lfp Confidence intervals by delta method SD+ SD- Current Saved Change 95% CI for Change Pr(y=inLF|x): 0.5150 0.6382 -0.1232 [-0.1717, -0.0747] Pr(y=NotInLF|x): 0.4850 0.3618 0.1232 [ 0.0747, 0.1717] k5 k618 age wc hc lwg inc Current=.2377158 1.3532537 46.574136.2815405.39176627 1.0971148 20.128965 Saved=.2377158 1.3532537 38.501562.2815405.39176627 1.0971148 20.128965 Diff= 0 0 8.072574 0 0 0 0 **Change in lwg from.8033366225286708 to 1.390893047643295 logit: Change in Predictions for lfp Confidence intervals by delta method SD+ SD- Current Saved Change 95% CI for Change Pr(y=inLF|x): 0.6204 0.5340 0.0865 [ 0.0445, 0.1285] Pr(y=NotInLF|x): 0.3796 0.4660 -0.0865 [-0.1285, -0.0445] k5 k618 age wc hc lwg inc Current=.2377158 1.3532537 42.537849.2815405.39176627 1.390893 20.128965 Saved=.2377158 1.3532537 42.537849.2815405.39176627.80333662 20.128965 Diff= 0 0 0 0 0.58755643 0 31
32
Question: If I wanted to additionally compute the discrete change for a sd for income—what would I need to change? foreach v in age lwg { qui sum `v’ local sdup = r(mean) + (r(sd)/2) local sddn = r(mean) - (r(sd)/2) di “” di “**Change in `v’ from `sddn’ to `sdup’” qui prvalue, x(`c’=`sddn’) rest(mean) save label(SD- ) prvalue, x(`c’=`sdup’) rest(mean) dif label(SD+) } 32
33
foreach and forvalues loops 33
34
As mentioned earlier, when we run a command in Stata, it stores the information in memory. We can access it from there & use it in our program. This includes both scalars [as seen from -sum-, prior], but also matrices:. qui logit lfp k5 k618 age wc hc lwg inc. ereturn list scalars: e(N) = 753 [:::] macros: e(title) : "Logistic regression” [:::] matrices: e(b) : 1 x 8 e(V) : 8 x 8 e(rules) : 1 x 4. mat list e(b) e(b)[1,8] k5 k618 age wc hc lwg [:::] y1 -1.462913 -.06457068 -.06287055.80727378.11173357.60469312 34
35
Many commands creates matrices we can use to, e.g., create cumulative matrices. For example, running -prvalue, save- & -dif- generates the following matrices:. prvalue, x(age=20) dif [:::]. matrix dir _PEtemp[3,7] pedifsep[2,1] pelower[7,2] //Matrix for lower CI bound peupper[7,2] //Matrix for upper CI bound pepred[7,2] //Matrix that includes discrete change peinfo[3,12] pebase[3,7] PE_in[1,7] PE_base[1,7] PRVinfo[1,12] PRVlower[2,2] PRVupper[2,2] PRVmisc[1,2] PRVprob[1,2] PRVbase[1,7] _PRVsav[1,6] pegrad_pr[2,8] 35
36
. matrix list pepred pepred[7,2] c1 c2 1values 0 1 2prob.15049911.84950089 3misc 1.7306918. saved=.06454021.93545979 saved= 2.6737502. saved=.0859589 -.0859589 // Discrete change [6,2] saved= -.94305837. 36
37
We can make use of these stored matrices to generate our own matrix of discrete change coefficients & confidence intervals matrix dc = J(9,4,.) //create empty matrix with 9 rows & 4 columns matrix colnames dc = x dc dcLB dcUB //label columns local irow1 = 0 //initialize a counter that will indicate row where I want to put info forvalues n = 30(5)70 { local ++irow1 //this adds 1 to the counter prvalue, x(wc=1 age=`n') save rest(mean) lab(WC) prvalue, x(wc=0 age=`n') diff rest(mean) lab(noWC) matrix dc[`irow1',1] = `n' matrix dc[`irow1',2] = pepred[6,2] matrix dc[`irow1',3] = pelower[6,2] matrix dc[`irow1',4] = peupper[6,2] mat list dc } Final output: dc[9,4] x dc dcLB dcUB r1 30.13744253.06500561.20987945 r2 35.16046226.07829062.2426339 r3 40.18021131.08726909.27315353 r4 45.19384143.09151363.29616923 r5 50.19910133.09109842.30710424 r6 55.19506124.08578821.30433427 r7 60.1824384.07516111.2897157 r8 65.16334447.05965045.26703849 r9 70.14059515.04131751.2398728 37
38
And for my final trick: //change matrix to variables svmat dc, names(col) label var x "value of x" label var dc "discrete change" label var dcLB "95% CI" label var dcUB "95% CI" twoway /// (connected dcLB x, msymbol(i) clpat(dash) clwidth(medthick) clcolor(blue)) /// (connected dc x, msymbol(i) clpat(solid) clwidth(medthick) ) /// (connected dcUB x, msymbol(i) clpat(dash) clwidth(medthick) clcolor(blue)) ///, ytitle(Pr(Wife no college)-Wife college)) ylabel(0(.2)1) /// xtitle(age) xlabel(30(5)70) /// legend(pos(11) order(2 1) ring(0) cols(1) region(ls(none))) /// title(”Labor Force Participation by" ”Wife’s College Attendance") 38
39
39
40
Ado-files Ado-files are like do-files, except that they are automatically run Indeed, ado stands for automatically loaded do-file Stata 10 has nearly 2,000 ado-files When you run a command, you cannot tell whether it is part of the executable or is an ado-file This means that Stata users like you can write new commands and use them just like official Stata commands 40
41
Ado-files: An Example List variables names and labels nmlabel.ado 41
42
My first version of nmlabel lists the names and labels with no options. It looks like this: 1> *! version 1.0.0 \ trm 2008-03-29 2> program define nmlabelV1 3> version 10 4> syntax varlist 5> foreach varname in ‘varlist’ { 6> local varlabel : variable label ‘varname’ 7> display in yellow "‘varname’" _col(10) "‘varlabel’" 8> } 9> end Here is how the command works:. nmlabelV1 lfp-inc lfp Paid Labor Force: 1=yes 0=no k5 # kids < 6 k618 # kids 6-18 age Wife's age in years wc Wife College: 1=yes 0=no hc Husband College: 1=yes 0=no lwg Log of wife's estimated wages inc Family income excluding wife‘s 42
43
The new version of the program looks like this: 1> *! version 2.0.0 \ trm 2008-03-29 2> program define nmlabelV2 3> version 10 4> syntax varlist [, skip] 5> if "‘skip’"=="skip" { 6> display 7> } 8> foreach varname in ‘varlist’ { 9> local varlabel : variable label ‘varname’ 10> display in yellow "‘varname’" _col(10) "‘varlabel’" 11> } 12> end If I enter the command with the skip option, the syntax command in line 4 creates a local named skip that contains the string skip: local skip “skip” If I do not specify the skip option, syntax creates the local skip as a null string: local skip “” 43
44
The third version looks like this: 1> *! version 3.0.0 \ trm 2008-03-29 2> program define nmlabelV3 3> version 10 4> syntax varlist [, skip NUMber ] 5> if "‘skip’"=="skip" { 6> display 7> } 8> local varnumber = 0 9> foreach varname in ‘varlist’ { 10> local ++varnumber 11> local varlabel : variable label ‘varname’ 12> if "‘number’"=="" { // do not number lines 13> display in yellow "‘varname’" _col(10) "‘varlabel’" 14> } 15> else { // number lines 16> display in green "#‘varnumber’: " /// 17> in yellow "‘varname’" _col(13) "‘varlabel’" 18> } 19> } 20> end 44
45
Here is the new ado-file: 1> *! version 4.0.0 \ trm 2008-03-29 2> program define nmlabelV4 3> version 10 4> syntax varlist [, skip NUMber COLnum(integer 16)] 5> if "‘skip’"=="skip" { 6> display 7> } 8> local varnumber = 0 9> foreach varname in ‘varlist’ { 10> local ++varnumber 11> local varlabel : variable label ‘varname’ 12> if "‘number’"=="" { // do not number lines 13> display in yellow "‘varname’” 14> _col(‘colnum’) "‘varlabel’" 15> } 15> else { // number lines 16> display in green "#‘varnumber’: " /// 17> in yellow _col(6) "‘varname’" /// 18> _col(‘colnum’) "‘varlabel’" 19> } 20> } 21> end 45
46
Extra slides 46
47
Counters are so useful that Stata has a simpler way to increment them. The command local ++counter is equivalent to local counter = ‘counter’ + 1. So instead of this: local counter = 0 foreach varname of varlist warm yr89 male white age ed prst { local counter = ‘counter’ + 1 local varlabel : variable label ‘varname’ display "‘counter’. ‘varname’" _col(12) "‘varlabel’“ } We can use this: local counter = 0 foreach varname of varlist warm yr89 male white age ed prst { local ++counter local varlabel : variable label ‘varname’ display "‘counter’. ‘varname’" _col(12) "‘varlabel’" } 47
48
Next, I use a matrix command to create a matrix named stats : matrix stats = J(‘nvars’,2,.) The J function creates a matrix based on three arguments. The first is the number of rows, the second the number of columns, and the third is the value used to fill the matrix. In this case, I want the matrix to be initialized with missing values which are indicated by a period. The matrix looks like this:. matrix list stats stats[6,2] c1 c2 r1.. r2.. r3.. r4.. r5.. r6.. 48
49
Nested Loops You can nest loops by placing one loop inside of another loop. Consider the earlier example of creating binary variables indicating if y was less than a given value: 1> foreach cutpt in 2 3 4 { 2> generate y_lt‘cutpt’ = y<‘cutpt’ if y<. 3> } Suppose that I need to do this for variables ya, yb, yc, and yd. 1> foreach y of varlist ya yb yc yd { // loop 1 begins 2> foreach cutpt in 2 3 4 { // loop 2 begins 3> * create binary variable 4> generate ‘y’_lt‘cutpt’ = `y’<‘cutpt’ if `y’<. 9> } // loop 2 ends 10> } // loop 1 ends What is the first variable created? the last? 49
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.