Presentation is loading. Please wait.

Presentation is loading. Please wait.

Generating new variables and manipulating data with STATA Biostatistics 212 Session 2.

Similar presentations


Presentation on theme: "Generating new variables and manipulating data with STATA Biostatistics 212 Session 2."— Presentation transcript:

1 Generating new variables and manipulating data with STATA Biostatistics 212 Session 2

2 Today... What we did last time, and why it was unrealistic What does “data cleaning” mean? Do and Log files How to generate a variable How to manipulate the data in your new variable How to label variables and otherwise document your work Examples

3 Last time… The dataset was “clean”

4 Last time… The dataset was “clean” Your data will not come pre-entered into STATA!

5 How your data will arrive On paper forms In a text file (comma or tab delimited) In Excel In Access In another data format (SAS, etc)

6 Your first task Import into STATA Options: –Cut and Paste –Insheet, Infile, other flexible Stata commands –A convenience program like “Stat/Transfer”

7 Your second task Figure out what all those variables mean Options –Browse, describe, summarize, list in STATA –Refer to a data dictionary –Refer to a data collection form –Guess, or ask the person who gave it to you

8 Your second task Example: Neonatal opiate withdrawal data

9 Your second task Example: Neonatal opiate withdrawal data Problems arise… –Sex is m/f, not 1/0 –Gestational age has nonsense values (0, 60) –Breastfeeding has a bunch of weird text values –Drug variables coded y or blank –Many variable names are obscure

10 Your third task You must “clean” your data so it is ready to analyze.

11 Cleaning your data Cleaning tasks –Check for consistency and clean up non-sense data –Deal with missing values –Code all dichotomous variables 1/0 –Categorize variables meaningfully (for Table 1, etc) –Derive new variables –Rename variables With common sense, or with a consistent scheme –Label variables –Label the VALUES of coded variables

12 Cleaning your data The importance of documentation –Retracing your steps Document every step using a “do” file

13 What is a “do” file? A text file containing a list of Stata commands Create and edit using Stata’s do-file editor –Open with button or menu –Save file in the normal way – filename.do Run the do file by –The Menus: File/Do… –The button on the editor –A Stata command – Do “pathname/filename.do”

14 What is a “do” file? Example using Lab 1 data

15 What is a “do” file? But the results are not documented anywhere. For this, we need a “log” file

16 What is a “log” file? A text file that captures everything that occurs in Stata’s Results window Two formats –Special Stata formatted log files (.smcl) or regular text (.log) Open using: –The log button (4 th from the right) –log using “pathname/filename.log”, replace Close using: –The log button –log close

17 What is a “log” file? Example using Lab 1 data

18 Using “do” and “log” files together The first command in any do file should open a log file The last command should close it

19 Using “do” and “log” files together Open the dataset you are analyzing WITHIN the log file –use “pathname/filename.dta”, clear –And clear at the end, if you want Make the do file run quickly and automatically –Set more off at the beginning –Set more on at the end

20 Using “do” and “log” files together For data cleaning do/log files, save the clean data set within the do file –Save “pathname/file.dta”, replace

21 Data cleaning Basic skill 1 – make a new variable Creating new variables generate newvar = expression An expression can be: –A number (constant) - generate allzeros = 0 –A variable - generate ageclone = age –A function - generate agesqrt = sqrt(age)

22 Data cleaning Basic skill 1 – make a new variable Getting rid of a variable drop var Getting rid of observations drop if boolean exp

23 Data cleaning Basic skill 2 – manipulating the values Changing the values of a variable replace var = exp [if boolean exp] A boolean expression evaluates to true or false for each observation

24 Data cleaning Basic skill 2 – manipulating the values Examples generate male = 0 replace male = 1 if sex==“male” generate ageover50 = 0 replace ageover 50 = 1 if age>50 generate complexvar = age replace complexvar = (ln(age)*3) if (age>30 | male==1) & (othervar1>=othervar2)

25 Data cleaning Basic skill 2 – manipulating the values Logical operators for boolean expressions: EnglishStata Equal to == Not equal to! =, ~= Greater than> Greater than/equal to> = Less than < Less than/equal to <= And & Or |

26 Data cleaning Basic skill 2 – manipulating the values Mathematical operators: EnglishStata Add + Subtract - Multiply * Divide/ To the power of ^ Natural log of ln(expression) Base 10 log of log10(expression) Etcetera…

27 Data cleaning Basic skill 2 – manipulating the values Another way to manipulate data Recode var oldvalue1=newvalue1 [oldvalue2=newvalue2] [if boolean expression] More complicated, but more flexible command than replace

28 Data cleaning Basic skill 2 – manipulating the values Examples Generate male = 0 Recode male 0=1 if sex==“male” Generate raceethnic = race Recode raceethnic 1=6 if ethnic==“hispanic” (Replace raceethnic = 6 if ethnic==“hispanic” & race==1) Generate tertilescac = cac Recode min/54=1 55/82=2 83/max=3

29 Data cleaning Basic skill 3 – labeling variables You can label: –A dataset label data “label” –A variable label var varname “label” –Values of a variable (2-step process) label define labelname value1 “label1” [value2 “value2”…] Label values varname labelname

30 Cleaning your data Cleaning tasks –Check for consistency and clean up non-sense data –Deal with missing values –Code all dichotomous variables 1/0 –Categorize variables meaningfully (for Table 1, etc) –Derive new variables –Rename variables With common sense, or with a consistent scheme –Label variables –Label the VALUES of coded variables

31 Data cleaning Example: Neonatal opiate withdrawal data

32 Data cleaning At the end of the day you have: –1 raw data file, original format –1 raw data file, Stata format –1 do file that cleans it up –1 log file that documents the cleaning –1 clean data file, Stata format

33 Summary Do files and Log files –Always pair them –Do everything inside the do file so you can run it repeatedly and easily while you do an analysis –Insert comments to document what you did

34 Summary, cont Generating and manipulating variables –Absolutely necessary skill for using Stata –Always check your work –Watch out for missing values

35 Summary, cont Labeling –Label as much as you can

36 Lab this week It’s long It’s important It’s hard I compromised – the do file “template” The next ones will be shorter and easier

37 Preview of next week… Using Excel –What is it good for? –Formulas –Designing a good spreadsheet –Formatting

38 See you on Thursday! Lab 2 due 10/19 Bring a floppy disc to all labs!


Download ppt "Generating new variables and manipulating data with STATA Biostatistics 212 Session 2."

Similar presentations


Ads by Google