Presentation is loading. Please wait.

Presentation is loading. Please wait.

CCPR Computing Services More Efficient Programming Courtney Engel October 12, 2007.

Similar presentations


Presentation on theme: "CCPR Computing Services More Efficient Programming Courtney Engel October 12, 2007."— Presentation transcript:

1 CCPR Computing Services More Efficient Programming Courtney Engel October 12, 2007

2 Outline Overview of programming Thinking through a programming task Ways of efficiently documenting and organizing your project  Naming variables, programs, files  Commenting code  Including file header  Implementing directory structure Programming constructs  Examples Raw data -> finished product: Replicable?

3 Overview “Recipe” to complete given task  Commands that tell your computer what to do  Language standards determine correct commands Basic programming allows you to:  Read, write, and reformat data files  Perform data calculations  Have the computer complete mundane tasks and minimize human error

4 Before you start coding… Conceptualize Clearly define the problem in writing Write down the solution/algorithm in English  Modularity  Create test (if reasonable) Translate one section to code Test the section thoroughly Translate/Test next section, etc.

5 Documentation - File Header *Josie Bruin (jbruin@ucla.edu) *HRS project */u/socio/jbruin/HRS/ *October 5, 2007 *Stata version 8 *Purpose: Create and merge two datasets in Stata, * then convert data to SAS *Input programs: *HRS/staprog/H2002.do, *HRS/staprog/x2002.do, *HRS/staprog/mergeFiles.do *Output: *HRS/stalog/H2002.log, *HRS/stalog/x2002.log, *HRS/stalog/mergeFiles.log *HRS/stadata/Hx2002.dta *HRS/sasdata/Hx2002.sas *Special instructions: Check log files for errors *check for duplicates upon new data release File header includes:  Name (email)  Project  Project location  Date  Software Version  Purpose of program  Inputs  Outputs  Special Instructions

6 Naming Files, Variables, and Functions Use language standard (if it exists) Be aware of language-specific rules  Max length, underscore, case, reserved words Meaningful variable names:  LogWt vs. var1  AgeLt30 vs. x Procedure that cleans missing values of Age:  fixMissingAge Matrix multiplication X transpose times X  matXX Differentiating log files:  ProgramsMergeHH.sas, MergeHH.do  Log filesMergeHHsas.log, MergeHHsta.log

7 Commenting Code Good code is self-commenting  Naming conventions, structure/formatting, header should explain 95% Comments should explain  Purpose of code, not every detail  Tricks used  Reasons for unusual coding Comments do not  fix sloppy code  translate syntax If it takes longer to read the comment than to read the code, don’t add a comment!

8 Commenting Code - Stata example SAMPLE 2 *Convert names in dataset to lowercase. program def lowerVarNames foreach v of varlist _all { local LowName = lower("`v'") if `"`v'"' != `"`LowName'"' { rename `v' `=lower("`v'")' } } end SAMPLE 1 program def function1 foreach v of varlist _all { local x = lower("`v'") if `"`v'"' != `"`x'"' { rename `v' `=lower("`v'")' } end Compare formatting, comments, variable name and function names

9 Directory Structure A project consists of many different types of files Use folders to separate files in a logical way Be consistent across projects if possible ATTIC folder for older versions HOME PROJECT NAME DATA RESULTS LOG PROGRAMS ATTIC

10 Stata example: using directory structure ** Paths: global parentpath "C:\Documents and Settings\jbruin\Fall07\prog\progtips" global pgmsloc "$parentpath\pgms" global logsloc "$parentpath\logs" global cleandataloc "$parentpath\data\clean" global rawdataloc "$parentpath\data\raw" log using "$logsloc\test200710", text replace ********************************************************************* *INSERT FILE HEADER HERE...then it’s included in log file. ********************************************************************* macro list webuse union, clear save "$rawdataloc\union.dta", replace keep idcode year age grade save "$cleandataloc\unionLJP.dta", replace log close

11 Programming Constructs Tools to simplify and clarify your coding Available in virtually all languages Constructs  Loops - for, foreach, do, while  If/elseif/else– if, then, else, case  continue  exit

12 Loop Construct The syntax for foreach is foreach lname { in | of listtype } list { Stata commands referring to lname} where lname is the name of the new local macro and listtype is the type of list on which you want to operate.

13 Loop Example 1 – pulling from 2 lists From Stata FAQ website Code: local animalgrp "cat dog cow pig" local noisegrp "meow woof moo oinkoink" local n : word count `animalgrp' forvalues i = 1/`n' { local animal : word `i' of `animalgrp' local noise : word `i' of `noisegrp' display "`animal’ says `noise'" } Resulting output: cat says meow dog says woof cow says moo pig says oinkoink

14 Loop Example 2 Given indicator variables white, black, other, and continuous variable EducYrs, create interaction variables Solution using loop: local allraces "white black other" foreach race of varlist `allraces' { generate `race'_educ=`race‘*EducYrs } Obs #WhiteBlackOtherEducYrsWhite_ educ Black_ educ Other_ educ 110010 00 2010150 0 30012000

15 Loop Example 3 Problem:  Dataset contains variables over multiple years (1970-1990)  Need to perform a number of commands separately for 1970, 1975, 1980, 1985. Solution without loop bysort year: command1 if year==70 | year==75 | year==80 | year==85 bysort year: command2 if year==70 | year==75 | year==80 | year==85 Solution with loop foreach year in 70 75 80 85 { display as result "***Regression for year = `year':" regress ln_wage grade tenure ttl_exp if year==`year' display as result "***Summarize for year = `year':" summarize ln_wage if year==`year' }

16 Constructs - If/then/else Execute section of code if condition is true: if condition then {execute this code if condition true} end Execute one of two sections of code: if condition then {execute this code if condition true} else {execute this code if condition false} end

17 If/Else Example Problem: need to execute commands on an operating system, but only if the os is Unix…the commands will fail if os is anything else Solution: if "`c(os)'"~="Unix" { display as err "Sorry; this section requires Unix OS." } else { ** continue with unix commands… }

18 Constructs - Elseif/case Elseif - Execute one of many sections of code: if condition1 then {execute this code if condition1 true} elseif condition2 then {execute this code if condition2 true} else {execute this code if condition1, condition2 are all false} end Case- same idea, different name case condition1 then {execute this code if condition1 true} case condition2 then {execute this code if condition2 true} etc.

19 Elseif Example Problem: Continue example from if…else, but execute different section of code for Unix, Windows, and Mac Solution: if "`c(os)'"=="Unix" { display "This is a Unix environment" } else if "`c(os)'" == "Windows" { display "This is a Windows environment" } else if "`c(os)'" =="MacOSX" { display "This is a MacOS” environment." } else { display as err "`c(os)' not recognized." }

20 Example Problem: Given 4 indicator variables (south, union, black, not_smsa) and 2 discrete variables (age, grade), generate 8 new indicator variables: south_age21 =south and age > 21, south_gr12=south and grade > 12 Similarly for union, black, not_smsa Solution without loop  8 lines of code similar to: generate newvar = (south==1 & age>21 & age<.) generate newvar = (south==1 & grade>12 & grade<.) Solution with loop foreach j in south union black not_smsa { generate `j'_age21 = (age>21 & age<. & `j'==1) generate `j'_gr12 = (grade>12 & grade<. & `j'==1) }

21 Example, cont. *CHECK GENERATED VARIABLES AGAINST ORIGINAL VARIABLES foreach j in south union black not_smsa { quietly count if `j'==1 & age>21 & age<. local origCount = r(N) quietly count if `j'_age21==1 if `origCount' ~= `r(N)' { display "Counts do not match for `j'_age21!" } else display "Counts match for `j'_age21." quietly count if `j'==1 & grade>12 & grade<. local origCount = r(N) quietly count if `j'_gr12==1 if `origCount' ~= `r(N)' { display "Counts do not match for `j'_gr12!" } else display "Counts match for `j'_gr12." } Obs # SouthAgeGradeSouth_age21South_gr12 1110500 21351611 3014900 40392000 5156n/a10 61201301 70381100 total 422

22 Stata- If qualifier vs If command ifcmd was designed to be used with a single expression Example:  Given variable x with 5 observations: 1, 1, 2, 1, 3  Compare the following three pieces of Stata code: if x==2 { replace x=99 } if x==1 { replace x=99 } replace x=99 if x==2

23 Stata- If qualifier vs If command list x +---+ | x | |---| 1. | 1 | 2. | 1 | 3. | 2 | 4. | 1 | 5. | 3 | +---+ if x==2 { replace x=99 }. list x +---+ | x | |---| 1. | 1 | 2. | 1 | 3. | 2 | 4. | 1 | 5. | 3 | +---+ if x==1 { replace x=99 (5 real changes made) } list x +----+ | x | |---- | 1. | 99 | 2. | 99 | 3. | 99 | 4. | 99 | 5. | 99 | +----+ replace x=99 if x==1 (3 real changes made) list x +----+ | x | |---- | 1. | 99 | 2. | 99 | 3. | 2 | 4. | 99 | 5. | 3 | +----+

24 Constucts -- Continue Example from Stata online help Continue is used to exit current iteration of loop and continue with next iteration The following two loops produce the same result: forvalues x = 1/10 { if mod(`x',2)==1 { display "`x' is odd" continue } display "`x' is even" } forvalues x = 1/10 { if mod(`x',2)==1 { display "`x' is odd" } else { display "`x' is even" } 3 R 1/3 3 10 - 9 1 mod(10,3)=1

25 Constructs – Exit Stop execution of program (only “hello” displayed) Examples:  Do-file contains a number of data checks followed by analysis commands. If data checks reveal something unacceptable, you can exit out of do-file before running analysis.  Program requires user input. If user enters “bad” information, need to quit program.  Debugging. If particular error occurs then break.  Check denominator prior to dividing. If equals zero, exit. display “hello” exit display “goodbye”

26 Raw data to finished product Raw data Analysis data Runs/results Finished product

27 Raw Data -> Analysis Data Always have two distinct data files- the raw data and analysis data A program should completely re-create analysis data from raw data NO interactive changes!! Final changes must go in a program!!

28 Raw Data -> Analysis Data Document all of the following:  Outliers?  Errors?  Missing data?  Changes to the data? Remember to check-  Consistency across variables  Duplicates  Individual records, not just summary stats

29 Analysis Data -> Results All results should be produced by a program Program should use analysis data (not raw) Have a “translation” of raw variable names -> analysis variable names -> publication variable names

30 Analysis Data -> Results Document-  How were variances estimated? Why?  What algorithms were used and why? Were results robust?  What starting values were used? Was convergence sensitive?  Did you perform diagnostics? Include in programs/documentation.

31 Log files Your log file should tell a story to the reader. As you print results to the log file, include words explaining the results Include not only what your code is doing, but your reasoning and thought process Don’t output everything to the log-file- use quietly and noisily in a meaningful way.

32 Project Clean-up Create a zip file that contains everything necessary for complete replication Use a readme.txt file to describe zip contents Delete/archive unused or old files Include any referenced files in zip When you have a final zip archive containing everything-  Open it in it’s own directory and run the script  Check that all the results match

33 CCPR’s Cluster and helping your research Software and Data  STATA, SAS, R, Compilers, text editors, etc  HRS, CPS (Unicon version), AddHealth, IFLS, etc Efficiency  Your PC is available for other work when you submit a job to the cluster  Faster processors  More RAM  Easy to share data, programs, etc. with colleagues via the cluster Obtain access by requesting an account  http://lexis.ccpr.ucla.edu/account/request/

34 Questions/Feedback Please email me if you need help in the future  cengel@ccpr.ucla.edu


Download ppt "CCPR Computing Services More Efficient Programming Courtney Engel October 12, 2007."

Similar presentations


Ads by Google