Download presentation
Presentation is loading. Please wait.
1
Advanced Quantitative Techniques
Lab 1
2
Schedule
3
Your TAs Liz Marcello elizabeth.marcello@columbia.edu
If you have a question or are late (don’t be) with your assignments, both of us. Liz Marcello Bernadette Baird-Zars
4
Today Recap: Importance of good theory. Logic, think of possible other explaining factors, good comparison. Center of course. Planning almost never can use pure experimental design, so good design is of the essence. Stata basics – opening, saving, commenting Describing your data Summarizing Histograms Normality Sampling Simpson’s index of diversity: developed by ecologists to see how diverse a watershed is. Demographers have adopted for population studies. Unit of analysis. What are you measuring.
5
Download sample data from courseworks
Open STATA Download sample data from courseworks ‘files’ ---> ‘lab session’ -> gssy dta Or, if you have a very clean data set at your fingertips and feel comfortable customizing the commands, use your own data
6
log files One way to make it easy to reproduce your results is to write a set of programs that contain all of your Stata commands. Log using filename (, append/replace) Log close Log off/on
7
do files A do file is just a set of Stata commands typed in a plain text file. You can use Stata's own built-in do-file Editor, which has the great advantage that you can run your program directly from the editor by clicking on the run/do icon. You can also select just a few commands and run them.
8
Comments & Annotation You can even insert comments into the programs to help other researchers (and yourself) follow the thought process. It is always a good idea to start every do file with comments that include at least a title, the name of the programmer who wrote the file, and the date. // is used to indicate that everything that follows to the end of the line is a comment. /* */ is used to indicate that all the text between the opening /* and the closing */, which may be a few characters or may span several lines, is a comment to be ignored by Stata. To indicate to Stata that a command continues on the next line you use ///, which says everything else to the end of the line is a comment and the command itself continues on the next line.
9
Command: “summarize” Mean is best… Median is best…
summarize var1 (or sum var1) This command will give you the mean, standard deviation, minimum and maximum value of var1. summarize var1, detail This command will give you more details, such as median. In planning -- income: median or mean for NYC? What is normally distributed that you use? Mean is best… Median is best… When dataset is outlier free & relatively ‘normal’ When dataset has any significant outliers, multi-modal or skewed
10
Command: “tabulate” You cannot summarize variables containing strings (text), i.e. Stata can’t add up a column labeled ‘yes’ or ‘no’ (without more advanced code) tabulate var1 (or sum var1) Categorical variables • What category each subject belongs to • 2 forms of categorical variables – Nominal/qualitative: unordered categories (gender, race/ethnicity, etc.) – Ordinal: categories can be ordered in some way (questions in course evaluation, etc.)
11
Command: “histogram” Only applied to numeric variables
histogram POP2010 Default: probability density (the relative likelihood for this random variable to take on a given value) histogram POP2010, width(500) frequency histogram POP2010, bin(20) percent
12
Describe the graph Shape of the distribution
1. Where is the hump (called mode) --does the histogram have a single, central hump or several separated humps (unimodal, multimodal) 2. Do any unusual features stick out? --any values that are far away from the body of the distribution (outliers) 3. Is the histogram symmetric? --does one tail stretch out farther than the other (skewed)
13
Skewness “Where is the tail?” Mean > Median Mean = Median
STATA: Skewness > 0 Skewness = 0 Skewness < 0
14
Is population normal distributed?
histogram POP2010, width (600) frequency normal
15
More or equal (at least)
Command: “generate” generate a new variable representing the percentage of females Meaning Command And & Or | Equal == Less than < More than > Less or equal (at most) <= More or equal (at least) >= Does not equal ~= or !=
16
Command: “generate” Try generate female_rate= females/POP2010
Is there any problems? The observations with POP2010=0 are treated as missing values drop female_rate 3-step protocol generate female_rate=0 replace female_rate= females/POP2010 if POP2010!=0 replace female_rate =. if missing(females| POP2010)
17
Value Labels Give female_rate labels of high/low
Values must be specified generate female_rate_cat=0 replace female_rate_cat=1 if female_rate>=0.5 replace female_rate_cat=2 if female_rate<0.5 replace female_rate_cat=. if missing(females| POP2010)
18
Value Labels Label variable “state” (36-New York)
Labels can be assigned only to numeric variables Sometimes you may need to convert between numeric and string variables. destring state, generate(state_destr)
19
Command: “help”
20
Sampling How to a random sample from an existing dataset?
Random sample w/o replacement Random sample w/ replacement
21
Random sample w/o replacement
Any observation in the dataset can be chosen for 0 or 1 time. Specify the size of sample by absolute numbers or percentage. Command Sample # (percentage) Sample #, count (absolute number)
22
What will happen if we specify a number that is larger than the number of observations in the data set? All of the observations from the data set will be kept, but none will be sampled a second time to increase the sample size the desired number. Notice also that Stata did not issue an error message when the sample size exceeded the number of observations in the data set
23
Drawing random sample by category
You can also select a sample with a given percentage or number from each of level of a grouping variable. Note that you need to sort the data on the grouping variable before using the by: prefix. Using gssy dta codebook race sort race by race: count by race: sample 10 (, count)
24
Random sample w/ replacement
Any observation in the dataset can be chosen for 0,1,2,…n times (n is the specified sample size). Specify the size of sample only by absolute numbers. Sample size cannot exceed the number of observations in the data set. Command bsample # Using Lab_5_data bsample 5 list
25
Weight option First, you need to have a "weight" variable in the data set. clear generate wt=0 bsample 5, weight (wt) list All observations are kept The “wt” have been changed to reflect which observations should be included in the sample. Can be run for multiple times, but cannot give descriptive statistics for samples.
26
Drawing random sample by category
bsample may not be combined with by Instead, we use strata option Provide Stata with the number of observations that you want from each strata clear sort Gender bsample 3, strata (Gender) list
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.