Download presentation
Presentation is loading. Please wait.
1
Lecture 3: Changing Data
By: Kevin Baier
2
Lecture Summary
3
Topics Covered Generate & Egen Commands Variable Formats and Types
Variable and Value Labels Preserve and Restore Replace & Recode Commands Percentiles
4
Generate Command
5
What is generate? The generate command creates new variables based on some expression Often times datasets have the parts to make variables of interest but do not natively contain these variables Generically: generate [type] newvar[:lblname] =exp [if] [in] Now we will discuss variable types, formats, and labels Then we’ll circle back to generate
6
Variable types When generating a new variable, and elsewhere, you can set the variable type (i.e. how the variable is stored) Types (type names can be used in programming): byte: values from -127 to 100, no decimals int: values from -32,767 to 32,740, no decimals long: values from mil to 2.15 mil, no decimals float: values from -1.7e38 to 1.70e38, e-38 decimals double: values from -8.99e307 to 8.99e307, e-323 decimals strX: string values with a maximum length of X
7
Variable types, cont. Using the double variable type allows for storage of pretty much any type of number 0-1 dummy variables can be stored as double as well as byte and would be equivalent Often you do not need to specify the variable type as STATA will automatically do so based on your work If you specify the type “strL”, this will specify a string variable of the maximum length (2 billion).
8
Recast command If you want to change your variable type at any time you can use the recast command Generically: recast type varlist [, force] The force option should be used with caution as it causes “the variables to be given the new storage type even if that will cause a loss of precision, introduction of missing values, or, for a string variables, the truncation of strings. Example: recast double mis, force
9
Variable formats Variable formats set the variable’s display output; for example, when viewing data in the data browser/editor or posting tables For numeric variables, only two formats typically will be used: %#.#g %#.#f In the above formats: The number specified to the left of the decimal point is the total width or numeric characters to be displayed The number specified to the right of the decimal point is number of numeric characters to the right of the decimal to be displayed
10
Variable formats, cont. Example: %9.2g specified the variable display format be a total of nine digits long with 2 digits being to the right of the decimal point and 7 being to the left of the decimal point. You can use display formats to make non-integers display as integers, and so much more! Some variables you may want to include commas at the thousandths places Use: %#.#gc %#.#fc
11
Variable formats, cont. Note that all these formats are “right-justified” Adding a “-” (negative sign/dash) makes them left-justified Examples: %-#.#g %-#.#f There are lots of different formats but you only need to be familiar with the basic ones here for the most part
12
Format command You can change the format of variables at anytime using the “format” command Generically: format varlist %fmt format %fmt varlist Example: format mis %10.0g This increases the total digit width from 8 to 10 for mis Either or!
13
Labels There are variable labels and there are value labels
Variable labels are some textual description of the variable and its values Example: the variable label on “wbhaom” is “Race, inc. Asian and Mixed” Value labels are textual descriptions unique to each unique numeric value of a given variable Value labels are used only on numeric variables and typically only on categorical or ordinal variables Example: tab wbhaom
15
Labels, cont. In the previous example, the race variable is a numeric one with numeric values; however, textual descriptions have been assigned to each unique numeric value of the variable. Example: tab wbhaom, nolabel In addition to labeling values and variables, you can label a dataset too When you open the CPS data, you should see “(CEPR March Extract, Version , 2015)”
17
Label commands Label your data Label variables
Generically: label data [“label”] Example: label data “March 2015 CPS Super-Fun-Time” Label variables Generically: label variable varname [“label”] Example: label variable female “=1 if participant is female” It’s hard to automate variable labeling unless the labels themselves differ by a few distinct parts
18
Label commands, cont. Value labels have to be defined then attributed to the variables: a two-step process Defining the label: Generically: label define lblname # “label” [# “label” …] [, add modify replace nofix] lblname = the name of your custom label # = numeric value label = “Textual Value Label”
19
Label commands, cont. Value label attribution:
Example: label define genderlab 0 “Male” 1 “Female” Value label attribution: Generically: label values varlist [lblname | .] [, nofix] Example: label value female gender You can label multiple variables with the same label in one command The “.” option will simply delete any value labels attached to a particular variable
20
Generate command Generically: generate [type] newvar[:lblname] =exp [if] [in] Example: gen incp_less_chldcr=incp_all-careval Not setting “type” causes STATA to do it automatically (which 99% of the time is perfectly fine) Example: gen long incp_less_chldcr=incp_all-careval Same as before, but now we set the variable type to long Suppose we create a variable on which we want to use a value label label define yesno 0 "No" 1 "Yes" gen black_male:yesno=1 if wbhaom==2 & female==0 tab black_male
22
Final Thoughts on Generate
The “=exp” part of the generate command can be a multitude of things The arithmetic combination (i.e. add, subtract, multiply, divide, exponentiate) of n variables The arithmetic combination of n variables and m constants The arithmetic combination of m constants If a variable is part of some new variable created by an arithmetic combination and it contains missing values, then values of that new created variable will be missing where its components are also missing
23
Egen Command
24
What is egen? The egen command is an extension of generate and allows for more expansive creation of unique variables You’ll want to use “help egen” often as there are lots of different arguments to use Examples: rowtotal(varlist) [, missing] rowmean(varlist) total(exp) [, missing] *good for column totals*
25
Egen command Generically: egen [type] newvar = fcn(arguments) [if] [in] [, options] egen tax_credits=rowtotal(child_tc-eitc), missing Note the “-” telling STATA to execute over that variable range The “, missing” tells STATA to specify the new value as missing (instead of zero) if all values for that observation are missing egen avg_health_spnd=rowmean(fhipval-fmedval), missing egen total_eitc=total(eitc)
26
Preserve and Restore
27
What is P&R? Preserve and restore basically allows you to temporarily change the dataset then bring back the dataset to its original form Generically: preserve commands…. restore
28
Preserve and Restore in Practice
Example: preserve drop year month restore This would initially create a dataset without year and month and then restore our original dataset with all the variables Preserve and restore should be run “together” Do not run the preserve command and some other commands and then decide to run the restore command later
29
Replace and Recode Commands
30
What is replace? Replace changes the contents/values of existing variables Generically: replace oldvar =exp [if] [in] [, nopromote] “oldvar” is an existing variable “exp” can be many things: Another variable The arithmetic combination (i.e. add, subtract, multiply, divide, exponentiate) of n variables The arithmetic combination of n variables and m constants The arithmetic combination of m constants “nopromote” option prevents STATA from automatically changing the variable type during a replace
31
Replace Command Examples:
replace incp_all=0 if female==1 replace incp_all=incp_ern if eitc>0 Just like with generate, you only use a single equals, “=“, for the beginning part of the command (i.e. replace oldvar=exp) Any if statements used would require double-equals, “==“
32
What is recode? Recode is similar to replace in that it changes the content/values of existing variables Unlike replace, recode can be used on multiple variables at once (i.e. a “varlist”) with at least one (but possibly more) rules for change Generically: recode varlist (erule) [(erule) ...] [if] [in] [, options] For now we’ll cover basic “rules” rather than “erules”
33
What is recode?, cont. Common rules:
# = missing (all of # number to missing)
34
What is recode?, cont. Most the options are not things you need to worry about right now Recode is typically used with categorical variables Think about our race variable where each race is its own unique category with a unique numeric value
35
Recode Command Example: recode wbhaom (5 6=.)
This changes the values of 5 and 6 in our race variable to missing recode wbhaom citstat (4=20) (5=22) This changes the values of 4 to 20 and 5 to 22 of the race and citizenship status variables
36
Percentiles
37
Percentile Commands Percentiles are a useful thing in social science research For example, income inequality discussions are usually framed by the percentiles of income-earners STATA offers three commands for percentiles, two of which we’ll cover here: pctile xtile _pctile (we’ll disregard this command)
38
pctile Command Generically:
pctile [type] newvar = exp [if] [in] [weight] [, pctile_options] “exp” can be many things: Another variable The arithmetic combination (i.e. add, subtract, multiply, divide, exponentiate) of n variables The arithmetic combination of n variables and m constants The arithmetic combination of m constants “, nquantiles(#)” is the main option which specifies the number of quantiles If you wanted 100 quantiles you would write “, nquant(100)”
39
pctile Command, cont. Example:
pctile pctile=incp_all, nquantiles(100) This generates the value of income at each 1 percentile out of 100 The percentile commands generates the percentile values from the number of quantiles specified but does not specify in which percentile each observation is
40
xtile Command The xtile command can specify for each observation in which percentile it is Generically: xtile newvar = exp [if] [in] [weight] [, xtile_options] Just like with the pctile command, “exp” can be many things: Another variable The arithmetic combination (i.e. add, subtract, multiply, divide, exponentiate) of n variables The arithmetic combination of n variables and m constants The arithmetic combination of m constants “, nquantiles(#)” is also the main option with xtile which specifies the number of quantiles If you wanted 100 quantiles you would write “, nquant(100)”
41
xtile Command, cont. Example:
xtile inc_pctile=incp_all, nquantiles(100) This tells STATA that for each observation, mark in which percentile it is If we sum our new percentile variable (inc_pctile)… We see all observations in our dataset have a value Values range from 1 to 100, thus suggesting for each person we know in which income percentile that person falls The mean (45.21) shows that not all values of income are unique If all values of income were unique then the mean would be 50
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.