Download presentation
Presentation is loading. Please wait.
1
Adrián de la Garza Jeremy Green 27 March 2009
Intermediate STATA Adrián de la Garza Jeremy Green 27 March 2009 4/14/2017
2
Getting Help STATA Help: Just type help in STATA main Command window.
STATA listserv: UCLA Stat Computing: Yale StatLab Consultants, online help and FAQs: Manuals also available at SSL and Yale StatLab. 0. Introduction
3
Today’s Workshop 1. Programming/Project Management Tips
2. Data Management 3. Analyzing Data - Graphs - Statistical Analysis Latest version: STATA v. 10: Commands throughout this presentation will always refer to this version, although most are backwards-compatible. 0. Introduction
4
Using DO files (1/2) DO files allow you to run a whole program interactively; you can run it all at once or select portions of the program. AVOID making changes to your original data interactively using the STATA command window. Use DO files instead. Use DO files to make changes to your data and to run your statistical and graphical analyses. Keep track of your progress. 4 1. Programming/Project Management Tips
5
Using DO files (2/2) Keep your DO files organized: Helps to create a main DO file from which you run other DO files that perform smaller tasks on your data. Write lots of comments in your DO file to help you remember what a command or a section of your DO file does. This will help you remember what you did months ago. To open DO file, use FILE menu or DO-file button. 5 1. Programming/Project Management Tips
6
Log files Syntax Open log file
log using filename [, append replace [text|smcl] name(logname)] Close log, temporarily suspend logging, or resume logging log {close|off|on} [logname] Examples . log using mylog . log close . log using mylog, append . log using "filename containing spaces" 1. Programming/Project Management Tips
7
Managing Your Data Back up all Master Data Files
CD, USB drive, network Keep a detailed codebook Describes each variable and values Adding variables, cases, computing new variables Keep a roadmap Keep a log of all analyses with what you have done Save syntax files 7 2. Data Management
8
Inspecting Your Data (1/3)
cd “C:\Documents and Settings\Adrian\My Documents\stata files” clear set mem 80m log using “C:\Documents and Settings\Adrian\My Documents\stata files\logs\mylog” sysuse census browse list state region pop if _n <= 3 /* shows first 3 obs */ l state region pop if _N - _n <= 2 /* shows last 3 obs */ l state region pop in 1/3 /* shows first 3 obs */ l state region pop in -3/l /* shows last 3 obs */ 2. Data Management
9
Inspecting Your Data (2/3)
generate agesq = medage^2 /* creates variable equal to medage squared */ sum pop /* shows summary stats for pop */ scalar popmean = r(mean) /* saves mean of pop to scalar popmean */ /* create variable equal to 1 when pop > popmean and 0 otherwise */ g dummy = 0 replace dummy = 1 if pop > popmean /* how many states have population higher than average? */ count if dummy == 1 /* how many states NOT IN THE SOUTH have pop > popmean? */ count if dummy == 1 & region != 3 9 2. Data Management
10
Inspecting Your Data (3/3)
describe label list /* shows all labels attached to dataset */ label list cenreg /* shows label cenreg attached to variable region */ sum pop browse /* summarize population by region */ sum pop if region == “NE” /* this gives an error since region is not a string */ sum pop if region == 1 /* this does work */ 10 2. Data Management
11
Calculate mean population by region
Method 1 sum pop if region == 1 sum pop if region == 2 sum pop if region == 3 sum pop if region == 4 Downside: We have to type the sum command for each individual region. If the dataset contained population data by city and we had to compute means for each of the 50 states, typing the sum command 50 times would be very painful!!! 11 2. Data Management
12
Calculate mean population by region
Method 2 bysort region: sum pop Downside: This method shows the population means by region, like we wanted, but it also shows a bunch of other stats we may not care about. Also, the means are stored in memory but are not readily available for use in case we want to use those means for further calculations. 12 2. Data Management
13
Calculate mean population by region
Method 3 table region, c(m pop) Downside: This method is great for presentation purposes: it shows exactly the information we want. One problem, however, is that the information is still not readily available for use in case we want to store the population means by region for further analyses. 13 2. Data Management
14
Calculate mean population by region
Method 4 sysuse census, clear collapse (mean) pop, by(region) Downside: The collapse command converts the dataset in memory into a set of means, standard deviations, and other summary stats. In our case, the new dataset now contains population means by region. All variables other than the collapsed variable (pop) and the grouping variable (region) are NOT collapsed and hence disappear from dataset. Can we make any further analyses without the rest of the variables? 14 2. Data Management
15
Calculate mean population by region
Method 5 sysuse census, clear by region, sort: egen meanpop = mean(pop) Downside: Do we really want an additional variable in the dataset that contains information on population means by region, a number that is repeated for each observation (state) within the same region? In very large datasets, one additional variable may lead to memory constraints. Use scalars? 15 2. Data Management
16
Reshaping Data sysuse bplong, clear br Suppose we want to take difference in bp before and after treatment. Difficult to calculate difference if data is organized in long format. Need to convert to wide format. reshape wide bp, i(patient sex agegrp) j(when) g bpdiff = bp2 – bp1 16 2. Data Management
17
Value Labels (1/2) g gender = sex br Why do gender and sex look different? Value labels Why use value labels? * They save space (e.g., “0” instead of “male” for each obs.) * More informative to the researcher (e.g., what region is 3?) * Regression, lists, tables… display labels instead of values table sex, c(m bp1 m bp2) table gender, c(m bp1 m bp2) 17 2. Data Management
18
Value Labels (2/2) label value gender sex /* note that sex refers to label, not var */ br patient sex gender label value gender /* detaches sex label from gender variable */ br pat sex gend label define genderlbl 0 “man” 1 “woman” label value gender genderlbl What do the following commands do? label define genderlbl 2 “na”, add label define genderlbl 0 “Man” 1 “Woman” 2 “NA”, modify 18 2. Data Management
19
Dummy Variables (1/3) Suppose we want to create dummy vars for each of the 4 regions in census database: g dum1 = 0 replace dum1 = 1 if region == 1 … What problems may these commands lead to? 2. Data Management
20
Dummy Variables (2/3) To create four dummies, we need to type those two commands four times. More importantly, the previous method generates 0s even when we have missing values. tab region, g(d) This second method tabulates the variable region, showing a list of the four regions, and correctly creates 4 separate dummies, accounting for missing values. 20 2. Data Management
21
Dummy Variables (3/3) One more command that will be useful in regressions: xi i.region, noomit This third alternative yields the same results as the tab method described in previous slide. 21 2. Data Management
22
Merging Data (1/4) sysuse census, clear keep state-popurban
sort state /* both master and using data must be sorted */ save census1, replace keep state region medage-divorce /* note region is kept in both */ sort state save census2, replace use census1, clear merge state using census2 /* remember: both files must be sorted */ table _merge /* _merge keeps track of how good merge was */ 2. Data Management
23
Merging Data (2/4) Important!!!
If non-merging variable (e.g. region) is in both files, data on master file will be kept – while data on using file will be lost. use census1, clear l state region in 1 replace region = 2 in 1 sort state merge state using census2 table _merge l state region in 1 /* region data in master file is kept */ 23 2. Data Management
24
Merging Data (3/4) Now suppose that each of the two databases contains information about only SOME (non-overlapping) of the 50 states. Do we lose information after merging the two datasets? use census2, clear drop in 3/6 sort state save, replace use census1, clear drop in 22/23 merge state using census2 table _merge 24 2. Data Management
25
Merging Data (4/4) Finally, it’s important to note that, in case a variable has value labels attached in both datasets, labels attached to variables in master dataset prevail. This may cause serious trouble, for example, when we are merging datasets from surveys taken in different years and for which the possible values in the answers may mean different things. Example 1: Change in scale (1 to 4 in 1980; 1 to 5 in 1990). Example 2: Omitted country in second survey, but all countries, sorted in alphabetical order, are assigned consecutive values. 25 2. Data Management
26
Other Data Management Issues
Use StatTransfer software to convert Excel, SAS, SPSS, … into STATA. Use compress command to make your dataset as small as possible and use less memory. Some very large datasets won’t open in STATA due to STATA’s memory limitations. In this case, it is recommended that you open a subset of the dataset, delete variables/observations that don’t interest you and try again: use varlist using filename 26 2. Data Management
27
Analyzing Data: Make a List
Dependent Variable(s) (response, outcome, criterion) Independent Variables (explanatory or predictor variables) Treatment Variable Covariates / Confounding Variables Categorical and Continuous Variables Remember: Types of variables determine the statistics we use Time period Scope and type of analysis 27 3. Analyzing Data
28
Analyzing Data: Graphs (1/2)
Draw a histogram: sysuse auto, clear histogram price Create a scatter plot: scatter price mpg Draw line of best fit (linear regression): twoway lfit price mpg Put two graphs together: twoway scatter price mpg || lfit price mpg 3. Analyzing Data
29
Analyzing Data: Graphs (2/2)
Type help graphs to: * create other graphs (pie and bar charts, box plots, etc.); * adjust graph settings (change labels, axes, colors…) An easier (although less customizable) option is to use GRAPH menu. 29 3. Analyzing Data
30
Analyzing Data: Statistical Analysis (1/2)
Correlation: quantify relationships between variables Regression: predict dependent variable from independent variable(s) Group differences t-test & ANOVA Chi-square for categorical and frequency data Significance v. effect size More Complex Models 30 3. Analyzing Data
31
Analyzing Data: Statistical Analysis (2/2)
cor var1 var2 gives the basic (Pearson) correlation between two variables. cor price mpg regress var1 var2 gives the effect of var2 on var1. reg price mpg Useful textbook for more on stats for social sciences: Agresti, Alan, and Barbra Finlay (2008): Statistical Methods for the Social Sciences, Prentice Hall, 4th edition. Textbook examples with STATA: 31 3. Analyzing Data
32
Thank you!! 32
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.