Download presentation
Presentation is loading. Please wait.
1
Generating new variables and manipulating data with STATA Biostatistics 212 Lecture 3
2
Housekeeping Lab 1 cleanup Computer and software issues Change final session from 11/29 12/1 –(Thursday instead of Tuesday) Change schedule – Excel NEXT session
3
Today... What we did last week, and why it was unrealistic What does “data cleaning” mean? How to generate a variable How to manipulate the data in your new variable How to label variables and otherwise document your work Examples
4
Last time… What was unrealistic?
5
Last time… What was unrealistic? –The dataset came as a Stata.dta file
6
Last time… What was unrealistic? –The dataset came as a Stata.dta file –The variables were ready to analyze
7
Last time… What was unrealistic? –The dataset came as a Stata.dta file –The variables were ready to analyze –Most variables were labeled
8
Last time… I.e. – The data was “clean”
9
How your data will arrive On paper forms In a text file (comma or tab delimited) In Excel In Access In another data format (SAS, etc)
10
Importing into Stata Options: –Cut and Paste –insheet, infile, fdause, other flexible Stata commands –A convenience program like “Stat/Transfer”
11
Importing into Stata Make sure it worked –Look at the data
12
Importing into Stata Example – neonatal opiate withdrawal data
13
Exploring your data Figure out what all those variables mean Options –Browse, describe, summarize, list in STATA –Refer to a data dictionary –Refer to a data collection form –Guess, or ask the person who gave it to you
14
Exploring your data Example: Neonatal opiate withdrawal data
15
Exploring your data Example: Neonatal opiate withdrawal data Problems arise… –Sex is m/f, not 1/0 –Gestational age has nonsense values (0, 60) –Breastfeeding has a bunch of weird text values –Drug variables coded y or blank –Many variable names are obscure
16
Cleaning your data You must “clean” your data so it is ready to analyze.
17
Cleaning your data Cleaning tasks –Check for consistency and clean up non-sense data and outliers –Deal with missing values –Code all dichotomous variables 1/0 –Categorize variables meaningfully (for Table 1, etc) –Derive new variables –Rename variables With common sense, or with a consistent scheme –Label variables –Label the VALUES of coded variables
18
Cleaning your data The importance of documentation –Retracing your steps Document every step using a “do” file
19
Data cleaning Basic skill 1 – make a new variable Creating new variables generate newvar = expression An expression can be: –A number (constant) - generate allzeros = 0 –A variable - generate ageclone = age –A function - generate agesqrt = sqrt(age)
20
Data cleaning Basic skill 1 – make a new variable Getting rid of a variable drop var Getting rid of observations drop if boolean exp
21
Data cleaning Basic skill 2 – manipulating the values Changing the values of a variable replace var = exp [if boolean exp] A boolean expression evaluates to true or false for each observation
22
Data cleaning Basic skill 2 – manipulating the values Examples generate male = 0 replace male = 1 if sex==“male” generate ageover50 = 0 replace ageover 50 = 1 if age>50 generate complexvar = age replace complexvar = (ln(age)*3) if (age>30 | male==1) & (othervar1>=othervar2)
23
Data cleaning Basic skill 2 – manipulating the values Logical operators for boolean expressions: EnglishStata Equal to == Not equal to! =, ~= Greater than> Greater than/equal to> = Less than < Less than/equal to <= And & Or |
24
Data cleaning Basic skill 2 – manipulating the values Mathematical operators: EnglishStata Add + Subtract - Multiply * Divide/ To the power of ^ Natural log of ln(expression) Base 10 log of log10(expression) Etcetera…
25
Data cleaning Basic skill 2 – manipulating the values Another way to manipulate data Recode var oldvalue1=newvalue1 [oldvalue2=newvalue2] [if boolean expression] More complicated, but more flexible command than replace
26
Data cleaning Basic skill 2 – manipulating the values Examples Generate male = 0 Recode male 0=1 if sex==“male” Generate raceethnic = race Recode raceethnic 1=6 if ethnic==“hispanic” (Replace raceethnic = 6 if ethnic==“hispanic” & race==1) Generate tertilescac = cac Recode min/54=1 55/82=2 83/max=3
27
Data cleaning Basic skill 3 – labeling variables You can label: –A dataset label data “label” –A variable label var varname “label” –Values of a variable (2-step process) label define labelname value1 “label1” [value2 “value2”…] Label values varname labelname
28
Cleaning your data Cleaning tasks –Check for consistency and clean up non-sense data –Deal with missing values –Code all dichotomous variables 1/0 –Categorize variables meaningfully (for Table 1, etc) –Derive new variables –Rename variables With common sense, or with a consistent scheme –Label variables –Label the VALUES of coded variables
29
Data cleaning Example: Neonatal opiate withdrawal data
30
Data cleaning At the end of the day you have: –1 raw data file, original format –1 raw data file, Stata format –1 do file that cleans it up –1 log file that documents the cleaning –1 clean data file, Stata format
31
Summary Data cleaning –ALWAYS necessary to some extent –ALWAYS use a do file, don’t overwrite original data –Check your work –Watch out for missing values –Label as much as you can
32
Lab this week It’s long It’s important It’s hard But this year, we have 2 sessions for it! Email lab to bio212ucsf@yahoo.combio212ucsf@yahoo.com Due 10/11 at Midnight
33
Preview of next week… Using Excel –What is it good for? –Formulas –Designing a good spreadsheet –Formatting
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.