Stata Basic Course Lab 4.

Stata Basic Course Lab 4

Modifying & Managing Data: Labeling Values
Value labels puts word labels on category variables Example: no=0 and yes=1 in dataset We can read Yes & No in output even though variable values are 0 or 1 To make value labels Step One: Define the label label def [label name] [value1] “[label for value1]” value2 “[label for value2]” Remember: Value always comes first and labels always go in quotes Ex: label def yn 0 “No” 1 “Yes” Step Two: Apply the value label to that variable label val [variable you are labeling] [label you want to apply] Note: the same label can be used for multiple observations!

sort and gsort commands
sort arranges the observations into ascending order of the values of the variable. If we want to sort the variable income sort income arranges the observations from the lowest income observation to the highest. You can sort observations according to more than one variable. For instance, if you type sort femhead income STATA sort observations first by femhead then sort the observations according to income To see the sorted data, you can look into the data window, or you can use the list command. list shows identified variables on the screen. If you type, list income You can see income values from the lowest, separately for male and female headed households.

sort can only sort the observations in ascending order
sort can only sort the observations in ascending order. Sometimes, you may want to sort observations descending order, from the largest to the smallest. For this purpose, you can use the gsort command, type: gsort income Observations from the largest to the smallest. You can also use more than one variables. gsort femhead income This will sort observations from the largest to the smallest separately for male and female headed households

Keeping and deleting variables
drop/ keep commands Both used in the same way but with opposite functions drop deletes observations (rows) or variables (columns) that you want to delete keep delete everything EXCEPT the observations (rows) or variables (columns) that you want to keep. At this point we need a very important parenthesis: If we don’t want to lose information of the dataset, before cutting or keeping variables, we should use the command preserve and then restore preserve : preserve current Stata session to temporarily use another dataset or let us work with variables without changing the original one. restore restore the previous Stata session So before cutting any dataset typing the two command is recommended.

Preserve and restore and the use of tempfile
Before we drop some of our observations it’s better to preserve our current data set so that we will be able to recover the dropped observations at a later stage. The preserve command saves our current data set in memory. We can then recover this same data set by typing restore. Suppose we are only interested in female observations. To make our data set smaller, we can type keep if femhead==1 when we DROP observations, all the information associated with the observations will be erased from our data set, and if we want them back, the original read-only data set should be opened again (unless you have use the preserve command). We can use the count command to see how many observation we cut. Typing restore and then count the original data set should be restored. We can also save the new dataset in a tempfile dataset. tempfile assigns names to the specified local macro that asbe used as names for temporary files. When the program or do-file concludes, any datasets created with these assigned names will be erased.

This drops the variable agehead from the dataset
Specify the variables you want to drop or keep drop agehead This drops the variable agehead from the dataset keep region hhid femhead income pcexp Keeps only these 5 variables in the original dataset You can also specify the observations you want to delete using the “if” clause drop if region==1 Drops all observations for which variable region is equal to one keep if region==2 Keeps all observations for which variable region is equal to two

Algebric operators The Arithemetic operators are: + addition
subtraction (or create negative of value, or negation) * multiplication / division ^ raise to a power To illustrate these operators consider the following generate statements: generate income1 = income+1 (addition) generate negwage = -income (negative or negation) generate femeduc = femhead*educave (multiplication) generate income_ed = income/educave (division) generate age_sq = agehead^2 (power)

Click on math functions
Click on math functions. Scrolling down the list you will see many functions that are new to you. A few examples are: generate logincome = log(income) (natural logarithm) generate elincome = exp(income) (exponential function is antilog of natural log) generate rootpcexp = sqrt(pcexp) (square root) The if qualifier (we’ll see later about it) uses a logical expression to determine which observations to use < less than <= less than or equal == equal > greater than >= greater than or equal != not equal & and | or ! not (logical negation; ~ can also be used) ( ) parentheses are for grouping to specify order of evaluation

Using gen/egen and replace/recode
Examples of gen command: generate a = 1 (column of ones) gen land = rlandown+ilandown (adds rlandown & ilandown) gen age40=1 if agehead>40 & agehead!=. Be carefull: STATA reads missing values as infinity, so be careful when using > and >=! Variable names can’t start with numbers! Examples of replace command: replace agehead=. if agehead==999 E.g. questionnaire non-response replace over10=0 if age<=10

egen is often used for more sophisticated functions (statistical functions like mean, sd, etc.)
egen examples: egen mean_age=mean(agehead) egen mean_age_over40=mean(agehead) if over40==1 bysort age40: egen groupagemean=mean(agehad) egen groupagemean=mean(agehead), by(age40) recode Recoding variables involves changing a specific value of a variable to another one. It is better to generate a new variable before you recode it. For example, before recoding the values on var1, generate a new variable called var1b (gen var1b = var1) and then work with the new variable.

Recode all 0s to be 2s: gen female = femhead recode female (0=2) Recode all 0s to be 0s AND 2s to be 1s: recode female (1=0) (2=1) Recode all 999s to be “missing”: recode dw_wealth (999=.) Recode all values between 30 and 60 to be 45: gen age = agehead recode age (30/60=45)

Modifying Data: Merging Datasets
merge combines multiple datasets into one by matching on the variable you choose Step One: Identify the appropriate variable to use for merging Note: The variable must exist in both datasets Step Two: Format data for merging Make sure variable has the same name in both datasets If not, rename or create a new variable Sort both datasets by the merging variable and save sort hhid save xxx.dta, replace Step Three: Open the “master” dataset (the primary dataset)

Step Five: Examine merge success Stata sintax for merge is:
Step Four: Merge merge [variable used for merging] using [DATASET BEING MERGED INTO the current one] Ex: merge hhid using [DATASET] Step Five: Examine merge success tab _merge Stata sintax for merge is: One-to-one merge on specified key variables merge 1:1 varlist using filename Many-to-one merge on specified key variables merge m:1 varlist using filename One-to-many merge on specified key variables merge 1:m varlist using filename Many-to-many merge on specified key variables merge m:m varlist using filename

The master and using data are automatically sorted.
There is also another way to merge different dataset. Using the command mmerge mmerge is an extension of merge that makes matched merging safe. You need to to specify the type of match to be performed; mmerge verifies that the requirements hold. It also makes merging easy, though that may not be obvious at a first look at the full syntax diagram. In contrast with merge, the resulting data after mmerge is independent of the order of observations in master and using data. As a consequence, you are not required to sort the data yourself. mmerge displays names of variables that occur in both master and using data. The master and using data are automatically sorted. If there is a _merge variable in master or using data will be silently overwritten; mmerge automatically tabulates _merge The match-variable(s) of the using data can be named differently from the master data .

Modifying Dataset: Append
append using [filename] append is a much simpler command than merge Just make sure that: The variable names are exactly the same in both files. The variable types (string or numeric) are the same in both files. Both files are saved as Stata files (“.dta”) It is usually used to creates panel_data datasets (we’’l see later what a panel data is). use appending_a_day2.dta, clear append using appending_b_day2.dta append using appending_c_day2.dta

Dataset Managment bysort runs a command separately for each value of a variable Using just by requires the data to be sorted by the variable in consideration. bysort does that for you. bysort region: gen rage=sum(age>30) bysort region: reg pcexp femhead bysort region: su femhead agehead

Creating dummies From Stata Manual:
A dummy variable is a variable that takes on the values 1 and 0; 1 means something is true (such as age < 25, sex is female) Dummy variables are also called indicator variables. There are three ways to create dummy variables: use generate, which creates one dummy variable at a time; tabulate, which creates whole sets of dummies at once; use xi, which may allow you to avoid the issue of dummy- creation altogether. gen youngHH = 0 replace youngHH = 1 if agehead<25 or gen youngHH1 = (agehead<25) tabulate region tabulate region, gen(dummy_region) regress logincome agehead female i.region

Stata Basic Course Lab 4.

Similar presentations

Presentation on theme: "Stata Basic Course Lab 4."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Stata Basic Course Lab 4.

Similar presentations

Presentation on theme: "Stata Basic Course Lab 4."— Presentation transcript:

Similar presentations

About project

Feedback