Generating new variables and manipulating data with STATA Biostatistics 212 Session 2.

Slides:



Advertisements
Similar presentations
Housekeeping: Variable labels, value labels, calculations and recoding
Advertisements

Basics of Biostatistics for Health Research Session 2 – February 14 th, 2013 Dr. Scott Patten, Professor of Epidemiology Department of Community Health.
Do files, log files, and workflow in Stata Biostatistics 212 Lecture 2.
The INFILE Statement Reading files into SAS from an outside source: A Very Useful Tool!
Using Excel Biostatistics 212 Lecture 4. Housekeeping Questions about Lab 3? –replace vs. recode Final Project Dataset! –“Housekeeping” commands vs. data.
Using Excel Biostatistics 212 Lecture 4. Housekeeping Finish Lab 2 today and/or start Lab 3 Mac Addendum Copying and pasting from Stata.
Muhammad Qasim Rafique MS. EXCEL 2007.
Stata and logit recap. Topics Introduction to Stata – Files / directories – Stata syntax – Useful commands / functions Logistic regression analysis with.
XP New Perspectives on Microsoft Office Excel 2003 Tutorial 1 1 Microsoft Office Excel 2003 Tutorial 1 – Using Excel To Manage Data.
Templates and Styles Excel Advanced. Templates are pre- designed and formatted spreadsheets –They provide consistency of layout/structure –They.
Generating new variables and manipulating data with STATA Biostatistics 212 Lecture 3.
INTRODUCTION TO STATA Võ Tuấn Khoa Trần Thế Trung.
Generating new variables and manipulating data with STATA Biostatistics 212 Lecture 3.
Introduction to Statistical Computing in Clinical Research Biostatistics 212 Course director: Mark Pletcher Teaching Assistant: Lee Zane.
XP 1 ﴀ New Perspectives on Microsoft Office 2003, Premium Edition Excel Tutorial 1 Microsoft Office Excel 2003 Tutorial 1 – Using Excel To Manage Data.
Introductory Excel 2000,XP and 2003 for Windows LAB MODELING FINANCE.
Formula Auditing, Data Validation, and Complex Problem Solving
Good Data Management Practices Patty Glynn 10/31/05
Microsoft Excel 2003 Illustrated Complete Excel and Advanced Worksheet Management Customizing.
Getting Started with your data
SAS Workshop Lecture 1 Lecturer: Annie N. Simpson, MSc.
Introduction to Excel, Word and Powerpoint Developing Valuable Technology Skills! Shawn Koppenhoefer Training in Research in Reproductive Health/Sexual.
Using Excel Biostatistics 212 Lecture 4. Housekeeping Questions about Lab 3? Final Project Dataset! –Check in.
Stata Workshop #1 Chiu-Hsieh (Paul) Hsu Associate Professor College of Public Health
Microsoft Excel By: Dr. K.V. Vishwanath Professor, Dept. of C.S.E,
Spreadsheet A spreadsheet is the computer equivalent of a paper ledger sheet. It consists of a grid made from columns and rows. It is an environment that.
Harvard-MIT Data Center (HMDC)
Python Programming Using Variables and input. Objectives We’re learning to make use of if statements to enable code to ask questions. Outcomes Build an.
MS Word – Mail Merge Basic Steps Create Letter/Labels general information Create Excel File with variable Data Link Files through Mail Merge in Word Print.
Microsoft Excel. Today’s Topics Overview of the Excel Screen The Excel Menus: File, Edit, View, Insert, Format, Tools, Data, Window, Help Entering Formulas.
Chapter 6 Generating Form Letters, Mailing Labels, and a Directory
Microsoft Access Lesson 1 Lexington Technology Center February 11, 2003 Bob Herring On the Web at
API-208: Stata Review Session Daniel Yew Mao Lim Harvard University Spring 2013.
1 ADVANCED MICROSOFT WORD Lesson 13 – Working with Long Documents Microsoft Office 2003: Advanced.
Organizing a project, making a table Biostatistics 212 Lecture 7.
Organizing a project, making a table Biostatistics 212 Session 5.
Productivity Programs Common Features and Commands.
Lesson 12: Creating a Manual and Using Mail Merge.
Generating new variables and manipulating data with STATA Biostatistics 212 Lecture 3.
Organizing a project, making a table Biostatistics 212 Lecture 7.
Using Excel Biostatistics 212 Lecture 4. Housekeeping Questions about Lab 3? –replace vs. recode –Cross-checking/recoding missing values –Analysis of.
Introduction to Statistical Computing in Clinical Research Biostatistics 212.
Introduction to Statistical Computing in Clinical Research
Microsoft® Excel Key and format dates and times. 1 Use Date & Time functions. 2 Use date and time arithmetic. 3 Use the IF function. 4 Create.
Lesson 1 – Microsoft Excel * The goal of this lesson is for students to successfully explore and describe the Excel window and to create a new worksheet.
Getting Started with Stata 2/11/2010 Tom Tomberlin Nealia Khan Learning Technologies Center Harvard Graduate School of Education.
STATA for S-052 M. Shane Tutwiler Your Friendly S-040 Lecturer William Johnston IT Services Harvard Graduate School of Education.
Basics of Biostatistics for Health Research Session 1 – February 7 th, 2013 Dr. Scott Patten, Professor of Epidemiology Department of Community Health.
FIRST COURSE Integration Tutorial 2 Integrating Word, Excel, and Access.
Today Introduction to Stata – Files / directories – Stata syntax – Useful commands / functions Logistic regression analysis with Stata – Estimation – GOF.
Data Management Research Methods Professional Development Institute December 4, 2015.
1 PEER Session 02/04/15. 2  Multiple good data management software options exist – quantitative (e.g., SPSS), qualitative (e.g, atlas.ti), mixed (e.g.,
Topics Introduction to Stata – Files / directories – Stata syntax – Useful commands / functions Logistic regression analysis with Stata – Estimation –
An electronic document that stores various types of data.
 The term “spreadsheet” covers a wide variety of elements useful for quantitative analysis of all kinds. Essentially, a spreadsheet is a simple tool.
COM: 111 Introduction to Computer Applications Department of Information & Communication Technology Panayiotis Christodoulou.
Analyzing Data. Learning Objectives You will learn to: – Import from excel – Add, move, recode, label, and compute variables – Perform descriptive analyses.
VERIFYING SPECIAL ED DATA TAMMY SOLTIS IU 5 DATA SUPERVISOR.
Macros in Excel Using VBA Time Required – 5 hours.
Topics Introduction to Stata – Files / directories – Stata syntax – Useful commands / functions Logistic regression analysis with Stata – Estimation –
Microsoft Excel.
Microsoft Excel 2003 Illustrated Complete
ECONOMETRICS ii – spring 2018
Microsoft Excel All editions of Microsoft office.
Microsoft Excel 101.
Objectives This is an introduction to the statistical software STATA aiming at: Preparing the participants in STATA basics (interphase and commands) for.
Stata Basic Course Lab 2.
Evaluation of Public Policy
Presentation transcript:

Generating new variables and manipulating data with STATA Biostatistics 212 Session 2

Today... What we did last time, and why it was unrealistic What does “data cleaning” mean? Do and Log files How to generate a variable How to manipulate the data in your new variable How to label variables and otherwise document your work Examples

Last time… The dataset was “clean”

Last time… The dataset was “clean” Your data will not come pre-entered into STATA!

How your data will arrive On paper forms In a text file (comma or tab delimited) In Excel In Access In another data format (SAS, etc)

Your first task Import into STATA Options: –Cut and Paste –Insheet, Infile, other flexible Stata commands –A convenience program like “Stat/Transfer”

Your second task Figure out what all those variables mean Options –Browse, describe, summarize, list in STATA –Refer to a data dictionary –Refer to a data collection form –Guess, or ask the person who gave it to you

Your second task Example: Neonatal opiate withdrawal data

Your second task Example: Neonatal opiate withdrawal data Problems arise… –Sex is m/f, not 1/0 –Gestational age has nonsense values (0, 60) –Breastfeeding has a bunch of weird text values –Drug variables coded y or blank –Many variable names are obscure

Your third task You must “clean” your data so it is ready to analyze.

Cleaning your data Cleaning tasks –Check for consistency and clean up non-sense data –Deal with missing values –Code all dichotomous variables 1/0 –Categorize variables meaningfully (for Table 1, etc) –Derive new variables –Rename variables With common sense, or with a consistent scheme –Label variables –Label the VALUES of coded variables

Cleaning your data The importance of documentation –Retracing your steps Document every step using a “do” file

What is a “do” file? A text file containing a list of Stata commands Create and edit using Stata’s do-file editor –Open with button or menu –Save file in the normal way – filename.do Run the do file by –The Menus: File/Do… –The button on the editor –A Stata command – Do “pathname/filename.do”

What is a “do” file? Example using Lab 1 data

What is a “do” file? But the results are not documented anywhere. For this, we need a “log” file

What is a “log” file? A text file that captures everything that occurs in Stata’s Results window Two formats –Special Stata formatted log files (.smcl) or regular text (.log) Open using: –The log button (4 th from the right) –log using “pathname/filename.log”, replace Close using: –The log button –log close

What is a “log” file? Example using Lab 1 data

Using “do” and “log” files together The first command in any do file should open a log file The last command should close it

Using “do” and “log” files together Open the dataset you are analyzing WITHIN the log file –use “pathname/filename.dta”, clear –And clear at the end, if you want Make the do file run quickly and automatically –Set more off at the beginning –Set more on at the end

Using “do” and “log” files together For data cleaning do/log files, save the clean data set within the do file –Save “pathname/file.dta”, replace

Data cleaning Basic skill 1 – make a new variable Creating new variables generate newvar = expression An expression can be: –A number (constant) - generate allzeros = 0 –A variable - generate ageclone = age –A function - generate agesqrt = sqrt(age)

Data cleaning Basic skill 1 – make a new variable Getting rid of a variable drop var Getting rid of observations drop if boolean exp

Data cleaning Basic skill 2 – manipulating the values Changing the values of a variable replace var = exp [if boolean exp] A boolean expression evaluates to true or false for each observation

Data cleaning Basic skill 2 – manipulating the values Examples generate male = 0 replace male = 1 if sex==“male” generate ageover50 = 0 replace ageover 50 = 1 if age>50 generate complexvar = age replace complexvar = (ln(age)*3) if (age>30 | male==1) & (othervar1>=othervar2)

Data cleaning Basic skill 2 – manipulating the values Logical operators for boolean expressions: EnglishStata Equal to == Not equal to! =, ~= Greater than> Greater than/equal to> = Less than < Less than/equal to <= And & Or |

Data cleaning Basic skill 2 – manipulating the values Mathematical operators: EnglishStata Add + Subtract - Multiply * Divide/ To the power of ^ Natural log of ln(expression) Base 10 log of log10(expression) Etcetera…

Data cleaning Basic skill 2 – manipulating the values Another way to manipulate data Recode var oldvalue1=newvalue1 [oldvalue2=newvalue2] [if boolean expression] More complicated, but more flexible command than replace

Data cleaning Basic skill 2 – manipulating the values Examples Generate male = 0 Recode male 0=1 if sex==“male” Generate raceethnic = race Recode raceethnic 1=6 if ethnic==“hispanic” (Replace raceethnic = 6 if ethnic==“hispanic” & race==1) Generate tertilescac = cac Recode min/54=1 55/82=2 83/max=3

Data cleaning Basic skill 3 – labeling variables You can label: –A dataset label data “label” –A variable label var varname “label” –Values of a variable (2-step process) label define labelname value1 “label1” [value2 “value2”…] Label values varname labelname

Cleaning your data Cleaning tasks –Check for consistency and clean up non-sense data –Deal with missing values –Code all dichotomous variables 1/0 –Categorize variables meaningfully (for Table 1, etc) –Derive new variables –Rename variables With common sense, or with a consistent scheme –Label variables –Label the VALUES of coded variables

Data cleaning Example: Neonatal opiate withdrawal data

Data cleaning At the end of the day you have: –1 raw data file, original format –1 raw data file, Stata format –1 do file that cleans it up –1 log file that documents the cleaning –1 clean data file, Stata format

Summary Do files and Log files –Always pair them –Do everything inside the do file so you can run it repeatedly and easily while you do an analysis –Insert comments to document what you did

Summary, cont Generating and manipulating variables –Absolutely necessary skill for using Stata –Always check your work –Watch out for missing values

Summary, cont Labeling –Label as much as you can

Lab this week It’s long It’s important It’s hard I compromised – the do file “template” The next ones will be shorter and easier

Preview of next week… Using Excel –What is it good for? –Formulas –Designing a good spreadsheet –Formatting

See you on Thursday! Lab 2 due 10/19 Bring a floppy disc to all labs!