Computing for Research I Spring 2014 Primary Instructor: Elizabeth Garrett-Mayer Introduction to Stata February 19.

Slides:



Advertisements
Similar presentations
Housekeeping: Variable labels, value labels, calculations and recoding
Advertisements

CC SQL Utilities.
Do files, log files, and workflow in Stata Biostatistics 212 Lecture 2.
The SAS ® System Additional Information on Statistical Analysis Programming.
Stata and logit recap. Topics Introduction to Stata – Files / directories – Stata syntax – Useful commands / functions Logistic regression analysis with.
Orchard Harvest™ LIS Review Results Training
1. Overview Brief guide to the display windows and toolbar
Generating new variables and manipulating data with STATA Biostatistics 212 Lecture 3.
INTRODUCTION TO STATA Võ Tuấn Khoa Trần Thế Trung.
Exploring Microsoft Excel 2002 Chapter 7 Chapter 7 List and Data Management: Converting Data to Information By Robert T. Grauer Maryann Barber Exploring.
ChiSq Tests: 1 Chi-Square Tests of Association and Homogeneity.
1 An Introduction to IBM SPSS PSY450 Experimental Psychology Dr. Dwight Hennessy.
Introduction to Statistical Computing in Clinical Research Biostatistics 212 Course director: Mark Pletcher Teaching Assistant: Lee Zane.
A Simple Guide to Using SPSS© for Windows
Stata Introduction Sociology 229A, Class 2 Copyright © 2008 by Evan Schofer Do not copy or distribute without permission.
Generating new variables and manipulating data with STATA Biostatistics 212 Session 2.
Getting Started with your data
SPSS 1: An Introduction to the Statistical Package SPSS Suzie Cro MRC Clinical Trials Unit.
SPSS Statistical Package for the Social Sciences is a statistical analysis and data management software package. SPSS can take data from almost any type.
Introduction to SPSS Short Courses Last created (Feb, 2008) Kentaka Aruga.
Pet Fish and High Cholesterol in the WHI OS: An Analysis Example Joe Larson 5 / 6 / 09.
Introduction to SPSS (For SPSS Version 16.0)
Day 1: Getting Started Department of Economics
L2: BECOMING SELF- SUFFICIENT IN STATA Getting started with Stata Angela Ambroz May 2015.
Stata 12 Merging Guide Nathan Favero Texas A&M University October 19, 2012.
Gadgets & More…. “Date Range” Gadgets Allows you to choose a specific date, before or after a date or a range of dates using the Workflows calendar.
Stata Workshop #1 Chiu-Hsieh (Paul) Hsu Associate Professor College of Public Health
Introduction to SAS BIO 226 – Spring Outline Windows and common rules Getting the data –The PRINT and CONTENT Procedures Manipulating the data.
4/22/2017 5:36 PM EViews Training Creating Workfiles.
LINDSEY BREWER CSSCR (CENTER FOR SOCIAL SCIENCE COMPUTATION AND RESEARCH) UNIVERSITY OF WASHINGTON September 17, 2009 Introduction to SPSS (Version 16)
A Brief Introduction to Stata(1). 1. Getting Started.
Key Data Management Tasks in Stata
STATA Mini Course Fall 2015 Jane Leber Herr Littauer 113 1Stata Mini Course – Spring 2015.
Organizing a project, making a table Biostatistics 212 Lecture 7.
Organizing a project, making a table Biostatistics 212 Session 5.
Chapter 17 Creating a Database.
Generating new variables and manipulating data with STATA Biostatistics 212 Lecture 3.
Computing for Research I Spring 2012 Exploratory Data Analysis and Hypothesis Testing February 21 Primary Instructor: Elizabeth Garrett-MAyer.
Organizing a project, making a table Biostatistics 212 Lecture 7.
Introduction to Statistical Computing in Clinical Research Biostatistics 212.
Introduction to Enterprise Guide Jennifer Schmidt Rhonda Ellis Cassandra Hall.
Getting Started With Stata Session 1 Jim Anthony John Troost Department of Epidemiology Michigan State University.
A Simple Guide to Using SPSS ( Statistical Package for the Social Sciences) for Windows.
1 Data Manipulation (with SQL) HRP223 – 2010 October 13, 2010 Copyright © Leland Stanford Junior University. All rights reserved. Warning: This.
11/25/2015Slide 1 Scripts are short programs that repeat sequences of SPSS commands. SPSS includes a computer language called Sax Basic for the creation.
STATA for S-052 M. Shane Tutwiler Your Friendly S-040 Lecturer William Johnston IT Services Harvard Graduate School of Education.
Basics of Biostatistics for Health Research Session 1 – February 7 th, 2013 Dr. Scott Patten, Professor of Epidemiology Department of Community Health.
Comparison of different output options from Stata
McGraw-Hill/Irwin The Interactive Computing Series © 2002 The McGraw-Hill Companies, Inc. All rights reserved. Microsoft Excel 2002 Working with Data Lists.
1.Introduction to SPSS By: MHM. Nafas At HARDY ATI For HNDT Agriculture.
DTC Quantitative Methods Summary of some SPSS commands Weeks 1 & 2, January 2012.
Today Introduction to Stata – Files / directories – Stata syntax – Useful commands / functions Logistic regression analysis with Stata – Estimation – GOF.
1 PEER Session 02/04/15. 2  Multiple good data management software options exist – quantitative (e.g., SPSS), qualitative (e.g, atlas.ti), mixed (e.g.,
1 Data Manipulation (with SQL) HRP223 – 2009 October 12, 2009 Copyright © Leland Stanford Junior University. All rights reserved. Warning: This.
1 EPIB 698C Lecture 1 Instructor: Raul Cruz-Cano
SAS Programming Training Instructor:Greg Grandits TA: Textbooks:The Little SAS Book, 5th Edition Applied Statistics and the SAS Programming Language, 5.
Ec 2390: Section 1 Useful STATA commands Jack Willis September 14th, 2015.
HRP Copyright © Leland Stanford Junior University. All rights reserved. Warning: This presentation is protected by copyright law and.
Based on Learning SAS by Example: A Programmer’s Guide Chapters 1 & 2
Before the class starts: 1) login to a computer 2) start Stata 13.
Topics Introduction to Stata – Files / directories – Stata syntax – Useful commands / functions Logistic regression analysis with Stata – Estimation –
EMPA Statistical Analysis
ECONOMETRICS ii – spring 2018
Introduction Introduction to Stata 2016.
Learning about Taxes with Intuit ProFile
Introduction to Stata Spring 2017.
Objectives This is an introduction to the statistical software STATA aiming at: Preparing the participants in STATA basics (interphase and commands) for.
CSCI N317 Computation for Scientific Applications Unit R
Stata Basic Course Lab 2.
Evaluation of Public Policy
Presentation transcript:

Computing for Research I Spring 2014 Primary Instructor: Elizabeth Garrett-Mayer Introduction to Stata February 19

Stata Stata is a powerful statistical package with – smart data-management facilities – a wide array of up-to-date statistical techniques, – an excellent system for producing publication-quality graphs. Stata is fast and easy to use Current version is Stata 12. Stata vs. Stata SE – “standard” stata can handle up to 2047 variables – SE can handle variables – Number of observations is limited by your computer (up to 2 billion!)

Stata Interface Multiple Windows – Results – Review – Variables – Command Other windows – Data editor – Data viewer – Log – ‘do’ – graph

Stata Interface Customizable windows Can be resized Edits to preferences are ‘remembered’ You can save (then load) different preferences. Command line driven But more recently, drop-down menu

Important Details case sensitive! return means ‘run’. there is no little running man to click. you cannot run commands if your data editor is open you need to ‘clear’ data before you bring in more data you can only have one dataset active at a time Save yourself some typing (and errors) – Utilize the variables window – Utilize the ‘review’ window abbreviations work for commands and variable names! – d instead of describe – case instead of caseid – NOT always, but if they uniquely identify variable name or command, they should – Also true for some options. – See Stata help files for how short you can go on abbreviations

Help!! The most important part Two interactive options: – help ‘command’ – help ‘search’ Also LARGE pdfs that link from help files Plus: – advice – link to Stata – command line help – findit

No data? There are lots of things you can do without data in stata! “immediate” commands – An immediate command is a command that obtains data not from the data stored in memory but from numbers typed as arguments. – Immediate commands, in effect, turn Stata into a glorified hand calculator.

Some immediate commands bitesti Binomial probability test cci csi Tables for epidemiologists; see [ST] epitab iri mcci cii Confidence intervals for means, proportions, counts prtesti One- and two-sample tests of proportions sampsi Sample size and power determination sdtesti Variance comparison tests symmi Symmetry and marginal homogeneity tests tabi One- and two-way tables of frequencies ttesti Mean comparison tests display Displays simple calculations see ‘help immediate’ for more information

Some examples display 4.1–1.96*0.3 tabi \ tabi \ , col tabi \ , col row cell chi cci cci , exact sampsi

Some examples bitesti ttesti ttesti , uneq ttesti

But most of the time, we have datasets *.dta files are stata datasets To open: – Option 1: use the “use” command: use "I:\MUSC Oncology\Cunningham, Joan\June2007\SCbcdata.dta“ – Option 2: menu-driven open File  Open… If you use Option 2, the associated command will appear in your results window AND in your review window If you use Option 2, consider cutting and pasting command into your ‘ do ’ file for next time..

Other types of data? Stata can import – ASCII files – Sas export – and a few others (that I have never heard of) Two options: – menu-driven: File  Import…. – insheet command can be used for ascii files insheet using sampledata.csv, comma insheet using sampledata.csv, tab – for insheet, you can use any separator (use delimiter(“char”) option)

Two notes on opening files if you use command line, you will have to either add clear at the end of the line to clear a current data set, or type clear as a command prior to opening the new dataset – insheet using sampledata.csv, comma clear OR – clear – insheet using sampledata.csv, comma you can use the cd command to tell Stata where to browse for your file(s), instead of giving long path names. This is particularly helpful if you are merging files from the same directory – cd “I:\Classes\StatComputingI”

Example: SC breast cancer registry data from 2004 All diagnoses of breast cancer in SC are recorded Small version for class: N = 2633; 55 variables Demographic and clinical information recorded Let’s read it in and explore it – use cd – use insheet – use ‘ use ’

Exploring your dataset describe (can be abbreviated ‘ d ’) – a very good idea to make sure things look right – tell you about types of variables, number of observations and number of variables codebook – summary per variable – useful for seeing number of uniques and missings sum – statistical summary (N, mean, SD, etc.) – only works on numerically coded variables – sum, detail inspect – similar to codebook. – provides rough histogram and neg, pos, missing Note: – all of these can be used with or without a varlist (e.g. sum race age) – to ‘quit’ a long command, type ‘q’ and it will stop sending output to results window

Exploring your dataset Open dataset in editor or browser Difference? edit capabilities Allows you to sort Variables manager (can access from viewer or main toolbar) – allows you to add labels simply – includes coding

Exploring Categorical variables can be summarized using table or tabulate (‘tab’) – tab race – table race list can help with a small dataset, or to look at a subset of the dataset – list race age if age<30 Can also sort at command line – sort age

Interactive command line driven? Well, there is a little running man, afterall! GOOD PROGRAMMING PRACTICE: – open a ‘do’ file – enter all of your commands in the do file – you can select one or more to run at a time – SAVE your do file!!! Window  Do File Editor how to include comments? * or /*…*/ * this is how we can make a table of race and ER tab race ercat /* our table looks very nice. we should really make pretty tables all the time */

Do file of our commands so far * slide 14: reading in data cd "I:\Classes\StatComputingI" insheet using "SCBC2004.csv", comma clear use SCBC2004.dta * slide 15: exploring our dataset * use d or describe d d ercat codebook codebook dodyr sum sum ercat codebook ercat * slide 17: more exploration tab race table race list race age if age<30 sort age

What about the output? Sometimes you want to have a file that shows the results Useful to share with investigators(?) Nice to have output saved My preference? keep a really good ‘do’ file and rerun it. Log file setup steps: – File  Log  Begin – analyze data, etc. – File  Log  Suspend (or End) Options for text (.log) or formatted (.smcl) files – *.log can be opened in text editor – *.smcl can only be opened in stata but looks nicer (and can be printed from stata)

Getting stuff out of Stata Stata can be good for data management I prefer it to R – step 1: data management in Stata – step 2: write ‘clean’ file from Stata to csv – step 3: read clean file into R Exporting: – menu-driven: File  Export – command line: outsheet [varlist] using “file.csv”, comma **for command line, may need “replace” as an option if you already have a file of the same name you want to replace.

Saving Stata Data File  Save or Save as Command line: – save “filename”, replace – save filename – save filename.dta –.dta will be added – replace may be needed or not

What if you don’t want to save or export everything? You can use keep and drop commands to keep or drop observations or variables before exporting/saving Want analyze ER, PR status, stage, age and grade in African American women. – drop if race==1 – keep ercat prcat stagen age grade These observations and variables are GONE from Stata’s memory If you want them back, you need to reload the original data BE CAREFUL: do NOT drop variables or observations and then overwrite original data! You can also include a ‘varlist’ with the outsheet command

Other options for subsetting by : performs command by categories – by race, sort: sum age – bysort ercat prcat: sum age if : performs command in a category/range – tab ercat if stagen>1 – tab ercat if graden~=. Combine them: – bysort ercat prcat: sum age if ercat<9 & prcat<9

Working with variables new variables can be created with the ‘ generate ’ command (or just ‘ gen ’) Example: grade has 4 levels. tab graden graden | Freq. Percent Cum | | | | Total | 2, We want to create high vs low grade variable

Several approaches gen highgrade = 1 if graden>2 replace highgrade = 0 if graden<3 gen highgrade=cond(graden>2,1,0) replace highgrade =. if graden==. Note well: Check coding of missing values!!

Extensions to generate ‘ egen ’ Same example: egen has a function ‘ cut ’ that can cut a continuous variable at a list of breakpoints: categories are defined by < each breakpoint egen highgrade=cut(graden), at(-1,3,5) egen highgrade=cut(graden), at(-1,3,5) icodes

generate use it for transformations – gen y = log(x) – gen y = x^2 generate random variables – gen z1 = uniform() *uniform(0,1) – gen z2 = 2 + 2*runiform() *uniform(2,4) generate ascending observation id by county – gen id= _n – bysort county: gen countyid=_n

Example of using these commands together We want to randomly select 10 women from each of 46 counties in SC Step 1: generate random numbers – gen z1=runiform() Step 2: sort and number women within counties – sort county z1 – by county : gen countyid=_n Step 3: keep only 10 women in county – drop if countyid>10

Formatting Dates Dates do not always maintain formatting, especially when reading data from csv files Two steps: generate and format Example stata syntax – gen newdate=date(datevar, “MDY”) – format newdate %td Stata treats dates as integers (formatting is like labels) so they can be manipulated Month, day and year can be extracted Also, see clock There are a lot of details that can be found in the help file

Reshaping Data In Stata there is one command to reshape IF your data is in the right format. From long to wide: – i indexes the observation (e.g., patient, hospital) – j indexes the repeats (e.g., year, cycle, visit) – Also need to list which variables vary by j

Example: ceramide data Clinical trial in cancer patients Ceramide (et al.) were measured every two cycles in patients Of interest: do changes in ceramide correlate with outcome (e.g., response, survival)? Data provided in long format – i is patient_id – j is cycle – Ceramide, etc. vary per patient – Some variables are constant (and stata can figure it out!)

Reshaping ceramide data reshape wide collecteddate - frombaselines1p, i(patient) j(cycle) reshape long : once Stata reshapes data in its recent memory, it can reshape again without any options

Reshaping wide to long Much more common Many researchers “grow” their datasets by columns instead of rows Formatting needs to be specific – Variable names must have numeric suffix – Could require a fair amount of editing – Depends on how many repeats and variables there are

Reshaping wide to long clear insheet using "ceramide2.csv" rename cycle1totalceramidelevels totalceramidelevels1 rename cycle1diseasestatus diseasestatus1 rename cycle1c18ceramide c18ceramide1 rename cycle3totalceramidelevels totalceramidelevels3 rename cycle3diseasestatus diseasestatus3 rename cycle3c18ceramide c18ceramide3 rename cycle5totalceramidelevels totalceramidelevels5 rename cycle5diseasestatus diseasestatus5 rename cycle5c18ceramide c18ceramide5 rename cycle3daysfromstart daysfromstart3 rename cycle5daysfromstart daysfromstart5 reshape long daysfromstart diseasestatus totalceramidelevels c18ceramide, i(patient) j(cycle) drop if totalcerami==. replace daysfromstart=0 if cycle==1