Automating Your Work: An Introduction to Programming in Stata Shawna N. Smith 29 July 2009.

Slides:



Advertisements
Similar presentations
Introducing JavaScript
Advertisements

A MATLAB function is a special type of M-file that runs in its own independent workspace. It receives input data through an input argument list, and returns.
Stata and logit recap. Topics Introduction to Stata – Files / directories – Stata syntax – Useful commands / functions Logistic regression analysis with.
RESEARCH WORKFLOW USING STATA How to Be an Effective Researcher CCPR Workshop.
Introduction to Computing Science and Programming I
Lecture 2 Introduction to C Programming
Introduction to C Programming
 2005 Pearson Education, Inc. All rights reserved Introduction.
 2000 Prentice Hall, Inc. All rights reserved. Chapter 2 - Introduction to C Programming Outline 2.1Introduction 2.2A Simple C Program: Printing a Line.
Introduction to C Programming
Computing for Research I Spring 2011 Primary Instructor: Elizabeth Garrett-Mayer Stata Programming February 28.
Programing Concept Ken Youssefi/Ping HsuIntroduction to Engineering – E10 1 ENGR 10 Introduction to Engineering (Part A)
Computing for Research I Spring 2013 Primary Instructor: Elizabeth Garrett-Mayer Stata Programming February 21.
Working with JavaScript. 2 Objectives Introducing JavaScript Inserting JavaScript into a Web Page File Writing Output to the Web Page Working with Variables.
 2007 Pearson Education, Inc. All rights reserved Introduction to C Programming.
Guide To UNIX Using Linux Third Edition
Guide To UNIX Using Linux Third Edition
Introduction to C Programming
Introduction to Unix (CA263) Introduction to Shell Script Programming By Tariq Ibn Aziz.
Introduction to Array The fundamental unit of data in any MATLAB program is the array. 1. An array is a collection of data values organized into rows and.
Adding Automated Functionality to Office Applications.
Fundamentals of Python: From First Programs Through Data Structures
STATA User Group September 2007 Shuk-Li Man and Hannah Evans.
Chapter Seven Advanced Shell Programming. 2 Lesson A Developing a Fully Featured Program.
Introduction to Shell Script Programming
IE 212: Computational Methods for Industrial Engineering
Project organisation in Stata Adrian Spoerri and Marcel Zwahlen Department of Social and Preventive Medicine University of Berne, Switzerland Research.
Introduction to Python
IPC144 Introduction to Programming Using C Week 1 – Lesson 2
Lecture Set 5 Control Structures Part D - Repetition with Loops.
Chapter 2 - Algorithms and Design
Key Data Management Tasks in Stata
Tricks in Stata Anke Huss Generating „automatic“ tables in a do-file.
Linux+ Guide to Linux Certification, Third Edition
Linux Operations and Administration
Vectors and Matrices In MATLAB a vector can be defined as row vector or as a column vector. A vector of length n can be visualized as matrix of size 1xn.
MATLAB for Engineers 4E, by Holly Moore. © 2014 Pearson Education, Inc., Upper Saddle River, NJ. All rights reserved. This material is protected by Copyright.
© Copyright 1992–2004 by Deitel & Associates, Inc. and Pearson Education Inc. All Rights Reserved. Chapter 2 Chapter 2 - Introduction to C Programming.
Getting Started with MATLAB 1. Fundamentals of MATLAB 2. Different Windows of MATLAB 1.
 Pearson Education, Inc. All rights reserved Introduction to Java Applications.
Chapter 5: More on the Selection Structure Programming with Microsoft Visual Basic 2005, Third Edition.
WHAT IS A DATABASE? A DATABASE IS A COLLECTION OF DATA RELATED TO A PARTICULAR TOPIC OR PURPOSE OR TO PUT IT SIMPLY A GENERAL PURPOSE CONTAINER FOR STORING.
What does C store? >>A = [1 2 3] >>B = [1 1] >>[C,D]=meshgrid(A,B) c) a) d) b)
ENG College of Engineering Engineering Education Innovation Center 1 More Script Files in MATLAB Script File I/O : Chapter 4 1.Global Variables.
Dec-15H.S.1 Stata 8, Programing Hein Stigum Presentation, data and programs at:
Comparison of different output options from Stata
Lecture 26: Reusable Methods: Enviable Sloth. Creating Function M-files User defined functions are stored as M- files To use them, they must be in the.
Introduction to Python Dr. José M. Reyes Álamo. 2 Three Rules of Programming Rule 1: Think before you program Rule 2: A program is a human-readable set.
© Copyright 1992–2004 by Deitel & Associates, Inc. and Pearson Education Inc. All Rights Reserved. 1 Chapter 2 - Introduction to C Programming Outline.
Today Introduction to Stata – Files / directories – Stata syntax – Useful commands / functions Logistic regression analysis with Stata – Estimation – GOF.
Topics Introduction to Stata – Files / directories – Stata syntax – Useful commands / functions Logistic regression analysis with Stata – Estimation –
Dr. Abdullah Almutairi Spring PHP is a server scripting language, and a powerful tool for making dynamic and interactive Web pages. PHP is a widely-used,
Ec 2390: Section 1 Useful STATA commands Jack Willis September 14th, 2015.
Today… Preparation for doing Assignment 1. Invoking methods overview. Conditionals and Loops. Winter 2016CMPE212 - Prof. McLeod1.
PHP Tutorial. What is PHP PHP is a server scripting language, and a powerful tool for making dynamic and interactive Web pages.
1 Agenda  Unit 7: Introduction to Programming Using JavaScript T. Jumana Abu Shmais – AOU - Riyadh.
FILES AND EXCEPTIONS Topics Introduction to File Input and Output Using Loops to Process Files Processing Records Exceptions.
1 Lecture 2 - Introduction to C Programming Outline 2.1Introduction 2.2A Simple C Program: Printing a Line of Text 2.3Another Simple C Program: Adding.
Repetition Structures Chapter 9
Chapter 2 - Introduction to C Programming
Econometrics 704 Emilio Cuilty
Chapter 2 - Introduction to C Programming
Intro to PHP & Variables
Topics Introduction to File Input and Output
STATA User Group September 2007
T. Jumana Abu Shmais – AOU - Riyadh
Vectors and Matrices In MATLAB a vector can be defined as row vector or as a column vector. A vector of length n can be visualized as matrix of size 1xn.
Stata Basic Course Lab 2.
Functions continued.
Topics Introduction to File Input and Output
Presentation transcript:

Automating Your Work: An Introduction to Programming in Stata Shawna N. Smith 29 July 2009

…but why? GSS Mental Health Replication Study Respondents received one of four different vignettes: depression, schizophrenia, alcohol abuse, normal troubles 38 outcomes [binary] Two waves of data: 1996 & 2006 First question: Is there a survey year difference? 4 vignettes x 38 outcomes = 152 potential differences 2

Roadmap Writing effective do-files [Review] Automation – Macros – Using stored info – foreach and forvalues loops – Ado-files {brief preview} 3

The Workflow of Data Analysis: Principles and Practices By J. Scott Long Much of this talk is from Chapter 4: Automating your work For example files: type findit workflow and follow the instructions 4

[aside] Writing effective do-files Robust: To be robust, a do-file must produce exactly the same result when run at a later time or on another computer Legible: To be legible, a do-file must be documented and formatted so that it is easy to understand what is being done 5

Robust Self-contained Include version control Exclude directory information – Never hardcode your directory! Rather set your working directory before you start your work 6

Legible Use comments Use alignment and indentation Use short lines [<80 characters] Limit the use of abbreviations 7

Automating Your Work 8

Macros A macro assigns a string of text or a number to an abbreviation Two types of macros, {local} & {global} {Global} – Persists until you delete it or exit Stata – Can lead to do-files that unintentionally depend on a global macro created by another do-file – Such do-files are not robust and can lead to unpredictable results *{Local} – Can only be used within the do-file or ado-file in which they are defined – When that program ends, the local macro disappears Macros are the simplest tool for automating your work 9

Syntax local local-name “string” – local rhs “var1 var2 var3” – display “The local rhs contains: `rhs’” local local-name = expression – local ncases = 198 – display “The local ncases equals: `ncases’” With the equals sign, expression is limited to 80 characters; without, “string” is limited to 67,784 characters. It is usually better to use “string” 10

Here is a simple example. I want to estimate the model:. logit y var1 var2 var3 I can create the macro rhs with the names of the independent or right-hand-side variables:. local rhs “var1 var2 var3” Then, I can write the logit command as:. logit y `rhs’ where the ` and ‘ indicate that I want to insert the contents of the macro rhs. i.e., the command: logit y `rhs’ works exactly the same as logit y var1 var2 var3 11

12

Macros can be combined to specify a sequence of nested models. First, I create macros for four groups of independent variables:. local set1_age “age agesquared”. local set2_educ “wc hc”. local set3_kids “k5 k618”. local set4_money “lwg inc” Next, I specify four nested models. The first model includes only the first set of variables and is specified as:. local model_1 “`set1_age’” The macro model_2 combines the content of the local model_1 with the variables in local set2_educ:. local model_2 “`model_1’ `set2_educ’” The next two models are specified the same way:. local model_3 “`model_2’ `set3_kids’”. local model_4 “`model_3’ `set4_money’” 13

Next, I check the variables in each model:. display “model_1: `model_1’” model_1: age agesquared. display “model_2: `model_2’” model_2: age agesquared wc hc. display “model_3: `model_3’” model_3: age agesquared wc hc k5 k618. display “model_4: `model_4’” model_4: age agesquared wc hc k5 k618 lwg inc Using these locals, I estimate a series of logits:. logit lfp `model_1’. logit lfp `model_2’. logit lfp `model_3’. logit lfp `model_4’ 14

The whole thing:. local set1_age “age agesquared”. local set2_educ “wc hc”. local set3_kids “k5 k618”. local set4_money “lwg inc”. local model_1 “`set1_age’”. local model_2 “`model_1’ `set2_educ’”. local model_3 “`model_2’ `set3_kids’”. local model_4 “`model_3’ `set4_money’”. display “model_1: `model_1’” model_1: age agesquared. display “model_2: `model_2’” model_2: age agesquared wc hc. display “model_3: `model_3’” model_3: age agesquared wc hc k5 k618. display “model_4: `model_4’” model_4: age agesquared wc hc k5 k618 lwg inc. logit lfp `model_1’. logit lfp `model_2’. logit lfp `model_3’. logit lfp `model_4’ 15

Automating Your Work 16

Saved results Stata commands send results to your log file but also save those results to memory Drukker’s Dictum: Never type anything that you can obtain from a saved result This information can be moved into macros and matrices, and used in many ways 17

Consider a simple example using -prvalue-. Use -prvalue- to calculate discrete change for  SD of age centered on the mean) [The old way…]. sum age Variable | Obs Mean Std. Dev. Min Max age | di ( /2) di ( /2) qui prvalue, x(age= ) rest(mean) save label(SD-). prvalue, x(age= ) rest(mean) dif label(SD+) ::: 18

A simpler [& more robust] way:. local c “age”. sum `c’ Variable | Obs Mean Std. Dev. Min Max age | return list scalars: r(N) = 753 r(sum_w) = 753 r(mean) = // scalar for mean of age r(Var) = r(sd) = // scalar for sd of age r(min) = 30 r(max) = 60 r(sum) = local sdup = r(mean) + (r(sd)/2). local sddn = r(mean) - (r(sd)/2). qui prvalue, x(`c’=`sddn’) rest(mean) save label(SD-). prvalue, x(`c’=`sdup’) rest(mean) dif label(SD+) ::: 19

Question: I discover a problem with my age variable & decide to change my C to income. Which parts of the above code do I need to change if: [1] I ‘hardcoded’ my numbers; & [2] I used the locals & scalars? 20

Automating Your Work 21

foreach and forvalues loops Loops let you execute a group of commands multiple times By combining macros with loops, you can speed up tasks ranging from creating variables to estimating models Loops can be used in many ways that make your workflow faster and more accurate. For example: – Creating interaction variables – Using the same command for multiple variables – Using information returned by Stata for other purposes 22

Syntax: foreach foreach local-name in | of list-type list { commands referring to `local-name’ } – foreach name in var1 var2 var3 { – foreach var of varlist var1-var10 { 23

Syntax: forvalues forvalues lname = range { commands referring to `lname’ } – forvalues nage = 40(5)80 { – forvalues n = 1(.1)100 { 24 SyntaxMeaningExampleGenerates #1(#d)#2From #1 to #2 in steps of #d.1(2)101, 3, 5, 7, 9 #1/#2From #1 to #2 in steps of 1.1/101, 2, 3,..., 10 #1 #t to #2From #1 to #2 in steps of (#t-#1)1 4 to 151, 4, 7, 10, 13

Here is a simple example that illustrates the key features of loops. I have a four-category ordinal variable y with values from 1 to 4. I want to create the binary variables y_lt2, y_lt3, and y_lt4 that equal 1 if y is less than the indicated value, else 0. I can create the variables with three generate commands:. generate y_lt2 = y<2 if y<.. generate y_lt3 = y<3 if y<.. generate y_lt4 = y<4 if y<. 25

I can do the same thing with a foreach loop: 1> foreach cutpt in { 2> generate y_lt‘cutpt’ = y<‘cutpt’ if y<. 3> } The first time through the local cutpt is assigned the first value in the list. Next, the generate command is run, where ‘cutpt’ is replaced by the value assigned to cutpt. The first time through the loop, line 2 is evaluated as:. generate y_lt2 = y<2 if y<. Next, the closing brace } is encountered, which sends us back to the foreach command in line 1. In the second pass, foreach assigns cutpt to the second value in the list, which means that the generate command is evaluated as:. generate y_lt3 = y<3 if y<. This continues once more, assigning cutpt to 4. When the foreach loop ends, three variables have been generated. 26

foreach and forvalues loops 27

Suppose that I need variables that are interactions between the binary variable male and a set of independent variables. I can do this quickly with a loop: 1> foreach varname of varlist yr89 white age ed prst { 2> generate maleX‘varname’ = male*‘varname’ 3> label var maleX‘varname’ "male*‘varname’" 4> } To examine the new variables and their labels, I use codebook:. codebook maleX*, compact Variable Obs Unique Mean Min Max Label maleXyr male*yr89 maleXwhite male*white maleXage male*age maleXed male*ed maleXprst male*prst How can we use what we learned about extended macros to improve upon this? 28

foreach and forvalues loops 29

Suppose I want to estimate discrete change for a  sd (using the -prvalue, save- & -dif-) for multiple continuous variables. Earlier, we used the following commands:. local c “age”. sum `c’. local sdup = r(mean) + (r(sd)/2). local sddn = r(mean) - (r(sd)/2). qui prvalue, x(`c’=`sddn’) rest(mean) save label(SD-). prvalue, x(`c’=`sdup’) rest(mean) dif label(SD+) To expand this to multiple continuous variables, we’ll use a -foreach- loop: foreach var in age lwg { qui sum `var’ local sdup = r(mean) + (r(sd)/2) local sddn = r(mean) - (r(sd)/2) di “” di “**Change in `var’ from `sddn’ to `sdup’” qui prvalue, x(`var’=`sddn’) rest(mean) save label(SD-) prvalue, x(`var’=`sdup’) rest(mean) dif label(SD+) } 30

Output: **Change in age from to logit: Change in Predictions for lfp Confidence intervals by delta method SD+ SD- Current Saved Change 95% CI for Change Pr(y=inLF|x): [ , ] Pr(y=NotInLF|x): [ , ] k5 k618 age wc hc lwg inc Current= Saved= Diff= **Change in lwg from to logit: Change in Predictions for lfp Confidence intervals by delta method SD+ SD- Current Saved Change 95% CI for Change Pr(y=inLF|x): [ , ] Pr(y=NotInLF|x): [ , ] k5 k618 age wc hc lwg inc Current= Saved= Diff=

Question: If I wanted to additionally compute the discrete change for a  sd for income—what would I need to change? foreach v in age lwg { qui sum `v’ local sdup = r(mean) + (r(sd)/2) local sddn = r(mean) - (r(sd)/2) di “” di “**Change in `v’ from `sddn’ to `sdup’” qui prvalue, x(`c’=`sddn’) rest(mean) save label(SD- ) prvalue, x(`c’=`sdup’) rest(mean) dif label(SD+) } 32

foreach and forvalues loops 33

As mentioned earlier, when we run a command in Stata, it stores the information in memory. We can access it from there & use it in our program. This includes both scalars [as seen from -sum-, prior], but also matrices:. qui logit lfp k5 k618 age wc hc lwg inc. ereturn list scalars: e(N) = 753 [:::] macros: e(title) : "Logistic regression” [:::] matrices: e(b) : 1 x 8 e(V) : 8 x 8 e(rules) : 1 x 4. mat list e(b) e(b)[1,8] k5 k618 age wc hc lwg [:::] y

Many commands creates matrices we can use to, e.g., create cumulative matrices. For example, running -prvalue, save- & -dif- generates the following matrices:. prvalue, x(age=20) dif [:::]. matrix dir _PEtemp[3,7] pedifsep[2,1] pelower[7,2] //Matrix for lower CI bound peupper[7,2] //Matrix for upper CI bound pepred[7,2] //Matrix that includes discrete change peinfo[3,12] pebase[3,7] PE_in[1,7] PE_base[1,7] PRVinfo[1,12] PRVlower[2,2] PRVupper[2,2] PRVmisc[1,2] PRVprob[1,2] PRVbase[1,7] _PRVsav[1,6] pegrad_pr[2,8] 35

. matrix list pepred pepred[7,2] c1 c2 1values 0 1 2prob misc saved= saved= saved= // Discrete change [6,2] saved=

We can make use of these stored matrices to generate our own matrix of discrete change coefficients & confidence intervals matrix dc = J(9,4,.) //create empty matrix with 9 rows & 4 columns matrix colnames dc = x dc dcLB dcUB //label columns local irow1 = 0 //initialize a counter that will indicate row where I want to put info forvalues n = 30(5)70 { local ++irow1 //this adds 1 to the counter prvalue, x(wc=1 age=`n') save rest(mean) lab(WC) prvalue, x(wc=0 age=`n') diff rest(mean) lab(noWC) matrix dc[`irow1',1] = `n' matrix dc[`irow1',2] = pepred[6,2] matrix dc[`irow1',3] = pelower[6,2] matrix dc[`irow1',4] = peupper[6,2] mat list dc } Final output: dc[9,4] x dc dcLB dcUB r r r r r r r r r

And for my final trick: //change matrix to variables svmat dc, names(col) label var x "value of x" label var dc "discrete change" label var dcLB "95% CI" label var dcUB "95% CI" twoway /// (connected dcLB x, msymbol(i) clpat(dash) clwidth(medthick) clcolor(blue)) /// (connected dc x, msymbol(i) clpat(solid) clwidth(medthick) ) /// (connected dcUB x, msymbol(i) clpat(dash) clwidth(medthick) clcolor(blue)) ///, ytitle(Pr(Wife no college)-Wife college)) ylabel(0(.2)1) /// xtitle(age) xlabel(30(5)70) /// legend(pos(11) order(2 1) ring(0) cols(1) region(ls(none))) /// title(”Labor Force Participation by" ”Wife’s College Attendance") 38

39

Ado-files Ado-files are like do-files, except that they are automatically run Indeed, ado stands for automatically loaded do-file Stata 10 has nearly 2,000 ado-files When you run a command, you cannot tell whether it is part of the executable or is an ado-file This means that Stata users like you can write new commands and use them just like official Stata commands 40

Ado-files: An Example List variables names and labels nmlabel.ado 41

My first version of nmlabel lists the names and labels with no options. It looks like this: 1> *! version \ trm > program define nmlabelV1 3> version 10 4> syntax varlist 5> foreach varname in ‘varlist’ { 6> local varlabel : variable label ‘varname’ 7> display in yellow "‘varname’" _col(10) "‘varlabel’" 8> } 9> end Here is how the command works:. nmlabelV1 lfp-inc lfp Paid Labor Force: 1=yes 0=no k5 # kids < 6 k618 # kids 6-18 age Wife's age in years wc Wife College: 1=yes 0=no hc Husband College: 1=yes 0=no lwg Log of wife's estimated wages inc Family income excluding wife‘s 42

The new version of the program looks like this: 1> *! version \ trm > program define nmlabelV2 3> version 10 4> syntax varlist [, skip] 5> if "‘skip’"=="skip" { 6> display 7> } 8> foreach varname in ‘varlist’ { 9> local varlabel : variable label ‘varname’ 10> display in yellow "‘varname’" _col(10) "‘varlabel’" 11> } 12> end If I enter the command with the skip option, the syntax command in line 4 creates a local named skip that contains the string skip: local skip “skip” If I do not specify the skip option, syntax creates the local skip as a null string: local skip “” 43

The third version looks like this: 1> *! version \ trm > program define nmlabelV3 3> version 10 4> syntax varlist [, skip NUMber ] 5> if "‘skip’"=="skip" { 6> display 7> } 8> local varnumber = 0 9> foreach varname in ‘varlist’ { 10> local ++varnumber 11> local varlabel : variable label ‘varname’ 12> if "‘number’"=="" { // do not number lines 13> display in yellow "‘varname’" _col(10) "‘varlabel’" 14> } 15> else { // number lines 16> display in green "#‘varnumber’: " /// 17> in yellow "‘varname’" _col(13) "‘varlabel’" 18> } 19> } 20> end 44

Here is the new ado-file: 1> *! version \ trm > program define nmlabelV4 3> version 10 4> syntax varlist [, skip NUMber COLnum(integer 16)] 5> if "‘skip’"=="skip" { 6> display 7> } 8> local varnumber = 0 9> foreach varname in ‘varlist’ { 10> local ++varnumber 11> local varlabel : variable label ‘varname’ 12> if "‘number’"=="" { // do not number lines 13> display in yellow "‘varname’” 14> _col(‘colnum’) "‘varlabel’" 15> } 15> else { // number lines 16> display in green "#‘varnumber’: " /// 17> in yellow _col(6) "‘varname’" /// 18> _col(‘colnum’) "‘varlabel’" 19> } 20> } 21> end 45

Extra slides 46

Counters are so useful that Stata has a simpler way to increment them. The command local ++counter is equivalent to local counter = ‘counter’ + 1. So instead of this: local counter = 0 foreach varname of varlist warm yr89 male white age ed prst { local counter = ‘counter’ + 1 local varlabel : variable label ‘varname’ display "‘counter’. ‘varname’" _col(12) "‘varlabel’“ } We can use this: local counter = 0 foreach varname of varlist warm yr89 male white age ed prst { local ++counter local varlabel : variable label ‘varname’ display "‘counter’. ‘varname’" _col(12) "‘varlabel’" } 47

Next, I use a matrix command to create a matrix named stats : matrix stats = J(‘nvars’,2,.) The J function creates a matrix based on three arguments. The first is the number of rows, the second the number of columns, and the third is the value used to fill the matrix. In this case, I want the matrix to be initialized with missing values which are indicated by a period. The matrix looks like this:. matrix list stats stats[6,2] c1 c2 r1.. r2.. r3.. r4.. r5.. r6.. 48

Nested Loops You can nest loops by placing one loop inside of another loop. Consider the earlier example of creating binary variables indicating if y was less than a given value: 1> foreach cutpt in { 2> generate y_lt‘cutpt’ = y<‘cutpt’ if y<. 3> } Suppose that I need to do this for variables ya, yb, yc, and yd. 1> foreach y of varlist ya yb yc yd { // loop 1 begins 2> foreach cutpt in { // loop 2 begins 3> * create binary variable 4> generate ‘y’_lt‘cutpt’ = `y’<‘cutpt’ if `y’<. 9> } // loop 2 ends 10> } // loop 1 ends What is the first variable created? the last? 49