Generating new variables and manipulating data with STATA Biostatistics 212 Lecture 3.

Slides:



Advertisements
Similar presentations
Housekeeping: Variable labels, value labels, calculations and recoding
Advertisements

Maintaining data quality: fundamental steps
Basics of Biostatistics for Health Research Session 2 – February 14 th, 2013 Dr. Scott Patten, Professor of Epidemiology Department of Community Health.
Do files, log files, and workflow in Stata Biostatistics 212 Lecture 2.
The INFILE Statement Reading files into SAS from an outside source: A Very Useful Tool!
Statistical Methods Lynne Stokes Department of Statistical Science Lecture 7: Introduction to SAS Programming Language.
Using Excel Biostatistics 212 Lecture 4. Housekeeping Questions about Lab 3? –replace vs. recode Final Project Dataset! –“Housekeeping” commands vs. data.
Using Excel Biostatistics 212 Lecture 4. Housekeeping Finish Lab 2 today and/or start Lab 3 Mac Addendum Copying and pasting from Stata.
Stata and logit recap. Topics Introduction to Stata – Files / directories – Stata syntax – Useful commands / functions Logistic regression analysis with.
1. Overview Brief guide to the display windows and toolbar
©2004, 2006, 2008 UIW Department of Instructional Technology Meat and Potatoes SPSS Presented by Terence Peak.
INTRODUCTION TO STATA Võ Tuấn Khoa Trần Thế Trung.
1 SAS Formats and SAS Macro Language HRP223 – 2011 November 9 th, 2011 Copyright © Leland Stanford Junior University. All rights reserved. Warning:
STATA TUTORIAL: LAB STATA windows  The command window  The viewer/results window  The review of commands window  The variable window.
Generating new variables and manipulating data with STATA Biostatistics 212 Lecture 3.
Introduction to Statistical Computing in Clinical Research Biostatistics 212 Course director: Mark Pletcher Teaching Assistant: Lee Zane.
So, You’re Going to Write an Empirical Paper Statlab Workshop October 31 st, 2003 David Nickerson.
Generating new variables and manipulating data with STATA Biostatistics 212 Session 2.
Good Data Management Practices Patty Glynn 10/31/05
Getting Started with your data
Pet Fish and High Cholesterol in the WHI OS: An Analysis Example Joe Larson 5 / 6 / 09.
DEVELOPING A CODING SCHEME AND SETTING UP YOUR SPSS DATA FILE
Chapter Seven Advanced Shell Programming. 2 Lesson A Developing a Fully Featured Program.
L2: BECOMING SELF- SUFFICIENT IN STATA Getting started with Stata Angela Ambroz May 2015.
Introduction to Excel, Word and Powerpoint Developing Valuable Technology Skills! Shawn Koppenhoefer Training in Research in Reproductive Health/Sexual.
Using Excel Biostatistics 212 Lecture 4. Housekeeping Questions about Lab 3? Final Project Dataset! –Check in.
Stata Workshop #1 Chiu-Hsieh (Paul) Hsu Associate Professor College of Public Health
Spreadsheet A spreadsheet is the computer equivalent of a paper ledger sheet. It consists of a grid made from columns and rows. It is an environment that.
LINDSEY BREWER CSSCR (CENTER FOR SOCIAL SCIENCE COMPUTATION AND RESEARCH) UNIVERSITY OF WASHINGTON September 17, 2009 Introduction to SPSS (Version 16)
Harvard-MIT Data Center (HMDC)
API-208: Stata Review Session Daniel Yew Mao Lim Harvard University Spring 2013.
Key Data Management Tasks in Stata
PREPARING DATA FOR STATISTICAL ANALYSIS Data Cleaning Data Cleaning Dataset Preparation Dataset Preparation Documentation Documentation 9 September 2008.
STATA Mini Course Fall 2015 Jane Leber Herr Littauer 113 1Stata Mini Course – Spring 2015.
Organizing a project, making a table Biostatistics 212 Lecture 7.
Organizing a project, making a table Biostatistics 212 Session 5.
Basic epidemiologic analysis with Stata Biostatistics 212 Lecture 5.
Data Analysis Lab 02 Using Crosstabs to compare percentages.
Generating new variables and manipulating data with STATA Biostatistics 212 Lecture 3.
C++ Programming Language Lecture 2 Problem Analysis and Solution Representation By Ghada Al-Mashaqbeh The Hashemite University Computer Engineering Department.
Organizing a project, making a table Biostatistics 212 Lecture 7.
Using Excel Biostatistics 212 Lecture 4. Housekeeping Questions about Lab 3? –replace vs. recode –Cross-checking/recoding missing values –Analysis of.
Introduction to Statistical Computing in Clinical Research Biostatistics 212.
1 Data Manipulation (with SQL) HRP223 – 2010 October 13, 2010 Copyright © Leland Stanford Junior University. All rights reserved. Warning: This.
Introduction to Statistical Computing in Clinical Research
Introduction to Statistical Computing in Clinical Research Biostatistics 212 Lecture 1.
Getting Started with Stata 2/11/2010 Tom Tomberlin Nealia Khan Learning Technologies Center Harvard Graduate School of Education.
STATA for S-052 M. Shane Tutwiler Your Friendly S-040 Lecturer William Johnston IT Services Harvard Graduate School of Education.
Basics of Biostatistics for Health Research Session 1 – February 7 th, 2013 Dr. Scott Patten, Professor of Epidemiology Department of Community Health.
PSY6010: Statistics, Psychometrics and Research Design Professor Leora Lawton Spring 2007 Wednesdays 7-10 PM Room 204.
Today Introduction to Stata – Files / directories – Stata syntax – Useful commands / functions Logistic regression analysis with Stata – Estimation – GOF.
Data Management Research Methods Professional Development Institute December 4, 2015.
1 PEER Session 02/04/15. 2  Multiple good data management software options exist – quantitative (e.g., SPSS), qualitative (e.g, atlas.ti), mixed (e.g.,
Stata Review Session Economics 1018 Abby Williamson and Hongyi Li November 17, 2006.
Topics Introduction to Stata – Files / directories – Stata syntax – Useful commands / functions Logistic regression analysis with Stata – Estimation –
1 Data Manipulation (with SQL) HRP223 – 2009 October 12, 2009 Copyright © Leland Stanford Junior University. All rights reserved. Warning: This.
Ec 2390: Section 1 Useful STATA commands Jack Willis September 14th, 2015.
Analyzing Data. Learning Objectives You will learn to: – Import from excel – Add, move, recode, label, and compute variables – Perform descriptive analyses.
SAS ® 101 Based on Learning SAS by Example: A Programmer’s Guide Chapters 5 & 6 By Ravi Mandal.
Topics Introduction to Stata – Files / directories – Stata syntax – Useful commands / functions Logistic regression analysis with Stata – Estimation –
TDA Direct Certification
Lecture 3: Changing Data
ECONOMETRICS ii – spring 2018
Dale Rhoda & Mary Kay Trimner Stata Conference 2018
Introduction Introduction to Stata 2016.
Introduction to Stata Spring 2017.
Objectives This is an introduction to the statistical software STATA aiming at: Preparing the participants in STATA basics (interphase and commands) for.
Stata Basic Course Lab 2.
Evaluation of Public Policy
Presentation transcript:

Generating new variables and manipulating data with STATA Biostatistics 212 Lecture 3

Housekeeping Lab 1 handed back today –Think of red ink as teaching points, not penalties… Do and Log files –Understand each command! –Order them appropriately

The.do file template –Tell STATA where to look for things and where to put things –Run correct version of Stata –Stop STATA from prompting you to push a button to continue –Tell STATA to clear any datasets in memory and increase its mem capacity –Since your do file may not be perfect, tell STATA to close any logs that are open when you try to run your do file –Tell STATA to create a log of your output for you and what you’re going to call that log. Tell it to overwrite it each time –Stick in some comments to remind you what this do file is for –Tell STATA what dataset to work on –Leave some SPACE for putting in analysis commands you want to keep –Lastly, tell STATA to close the log and go back to its usual “more” mode cd “C:\data\biostat212\” version 11 set more off clear set memory 10m capture log close log using “name of your log.log”, replace /* here are my comments */ use “name of your dataset”, clear summarize this browse that tabulate this log close set more on

New issues “” vs. “” Other?

Today... What we did in Lab 1, and why it was unrealistic What does “data cleaning” mean? Importing data into Stata How to generate a variable How to manipulate the data in your new variable How to label variables and otherwise document your work Examples

Last time… What was unrealistic?

Last time… What was unrealistic? –The dataset came as a Stata.dta file

Last time… What was unrealistic? –The dataset came as a Stata.dta file –The variables were ready to analyze

Last time… What was unrealistic? –The dataset came as a Stata.dta file –The variables were ready to analyze –Most variables were labeled

Last time… i.e. – The data was “clean”

How your data will arrive On paper forms In a text file (comma or tab delimited) In Excel In Access In another data format (SAS, etc)

Importing into Stata Options: –Copy and Paste –insheet, infile, fdause, other flexible Stata commands –A convenience program like “Stat/Transfer”

Importing into Stata Make sure it worked –Look at the data

Importing into Stata Demo – neonatal opiate withdrawal data –Import with cut and paste from Excel –Import with insheet (save as.csv file first)

Exploring your data Figure out what all those variables mean Options –Browse, describe, summarize, list in STATA –Refer to a data dictionary –Refer to a data collection form –Guess, or ask the person who gave it to you

Exploring your data Demo: Neonatal opiate withdrawal data

Exploring your data Demo: Neonatal opiate withdrawal data Problems arise… –Sex is m/f, not 1/0 –Gestational age has nonsense values (0, 60) –Breastfeeding has a bunch of weird text values –Drug variables coded y or blank –Many variable names are obscure

Cleaning your data You must “clean” your data so it is ready to analyze.

Cleaning your data What does the variable measure? –rename and/or label var so it’s clear Find nonsense values and outliers –recode as missing or track down real value? Deal with missing values –Too many values missing in some subjects? Coding consistent? –drop variable or observation? Categorize as needed –generate a new numeric variable –recode (dichotomous variables coded as 1/0, watch missing values) –label define and then label values –Check tab oldvar newvar, missing bysort catvar: sum contvar

Cleaning your data The importance of documentation –Retracing your steps Document every step using a “do” file

Data cleaning Basic skill 1 – Making a new variable Creating new variables generate newvar = expression

Data cleaning Basic skill 1 – Making a new variable Creating new variables generate newvar = expression An “expression” can be: –A number (constant) - generate allzeros = 0 –A variable - generate ageclone = age –A function - generate agesqrt = sqrt(age)

Data cleaning Basic skill 2 – Manipulating values of a variable Changing the values of a variable replace var = exp [if boolean_expression] A boolean expression evaluates to true or false for each observation

Data cleaning Basic skill 2 – Manipulating values of a variable Examples generate bmi = weight/(height^2) generate male = 0 replace male = 1 if sex==“male” generate ageover50 = 0 replace ageover 50 = 1 if age>50 generate complexvar = age replace complexvar = (ln(age)*3) if (age>30 | male==1) & (othervar1>=othervar2)

Data cleaning Basic skill 2 – Manipulating values of a variable Logical operators for boolean expressions: EnglishStata Equal to == Not equal to! =, ~= Greater than> Greater than/equal to> = Less than < Less than/equal to <= And & Or |

Data cleaning Basic skill 2 – Manipulating values of a variable Mathematical operators: EnglishStata Add + Subtract - Multiply * Divide/ To the power of ^ Natural log of ln(expression) Base 10 log of log10(expression) Etcetera…

Data cleaning Basic skill 2 – Manipulating values of a variable Another way to manipulate data recode var oldvalue1=newvalue1 [oldvalue2=newvalue2] [if boolean_expression] More complicated, but more flexible command than replace

Data cleaning Basic skill 2 – Manipulating values of a variable Examples generate male = 0 recode male 0=1 if sex==“male” generate female = male recode female 1=0 0=1

Data cleaning Basic skill 2 – Manipulating values of a variable Examples generate raceethnic = race recode raceethnic 1=6 if ethnic==“hispanic” (replace raceethnic = 6 if ethnic==“hispanic” & race==1) generate tertilescac = cac recode tertilescac min/54=1 55/82=2 83/max=3

Data cleaning Basic skill 3 – Getting rid of variables/observations Getting rid of a variable drop var Getting rid of observations drop if boolean_expression

Data cleaning Basic skill 4 – Labeling things You can label: –A dataset label data “label” –A variable label var varname “label” –Values of a variable (2-step process) label define labelname value1 “label1” [value2 “value2”…] label values varname labelname label define caccatlabel 0 “0” 1 “1-100” 2 “ ” 3 “>400” label values caccat caccatlabel

Data cleaning Basic skill 5 –Dealing with missing values Missing values are important, easy to forget –. for numbers –“” for text –tab var1 var2, missing –Watch the total “n” for tab, summarize commands, regression analyses, etc.

Data cleaning Demo: Neonatal opiate withdrawal data

Cleaning your data What does the variable measure? –rename or label var so it’s clear Find nonsense values and outliers –recode as missing or track down real value? Deal with missing values –Too many? Coding consistent? –drop variable or observation? Categorize as needed –generate a new numeric variable –recode (dichotomous variables coded as 1/0, watch missing values) –label define and then label values –Check tab oldvar newvar, missing bysort catvar: sum contvar

Data cleaning At the end of the day you have: –1 raw data file, original format –1 raw data file, Stata format –1 do file that cleans it up –1 log file that documents the cleaning –1 clean data file, Stata format

Summary Data cleaning –ALWAYS necessary to some extent –ALWAYS use a do file –NEVER overwrite original data –Check your work –Watch out for missing values –Label as much as you can

Lab this week It’s long It’s hard It’s important lab to your section leader’s Due at the beginning of lecture next week

Preview of next week… Using Excel –What is it good for? –Formulas –Designing a good spreadsheet –Formatting