Lecture 3: Changing Data

Slides:



Advertisements
Similar presentations
1 Creating and Tweaking Data HRP223 – 2010 October 24, 2011 Copyright © Leland Stanford Junior University. All rights reserved. Warning: This.
Advertisements

1 An Introduction to IBM SPSS PSY450 Experimental Psychology Dr. Dwight Hennessy.
Differences between Java and C CS-2303, C-Term Differences between Java and C CS-2303, System Programming Concepts (Slides include materials from.
Introduction to Computers and Programming Lecture 4: Mathematical Operators New York University.
Mathematical Operators: working with floating point numbers and more operators  2000 Prentice Hall, Inc. All rights reserved. Modified for use with this.
The Binary Number System
STATA User Group September 2007 Shuk-Li Man and Hannah Evans.
Objectives You should be able to describe: Data Types
Fortran 1- Basics Chapters 1-2 in your Fortran book.
Key Data Management Tasks in Stata
CNG 140 C Programming Lecture Notes 2 Processing and Interactive Input Spring 2007.
Input, Output, and Processing
CPS120: Introduction to Computer Science Operations Lecture 9.
Copyright © 2012 Pearson Education, Inc. Publishing as Pearson Addison-Wesley C H A P T E R 2 Input, Processing, and Output.
Chapter 3 – Variables and Arithmetic Operations. Variable Rules u Must declare all variable names –List name and type u Keep length to 31 characters –Older.
Operators & Identifiers The Data Elements. Arithmetic Operators exponentiation multiplication division ( real ) division ( integer quotient ) division.
Programming Fundamentals. Overview of Previous Lecture Phases of C++ Environment Program statement Vs Preprocessor directive Whitespaces Comments.
Chapter 4: Variables, Constants, and Arithmetic Operators Introduction to Programming with C++ Fourth Edition.
Lecture 3: More Java Basics Michael Hsu CSULA. Recall From Lecture Two  Write a basic program in Java  The process of writing, compiling, and running.
1 float Data Type Data type that can hold numbers with decimal values – e.g. 3.14, 98.6 Floats can be used to represent many values: –Money (but see warning.
Creating Database Objects
CompSci 230 S Programming Techniques
Topics Designing a Program Input, Processing, and Output
Arithmetic operations and operators, converting data types and formatting programs for output. Year 11 Information Technology.
CSC201: Computer Programming
Chapter 3 Syntax, Errors, and Debugging
Expressions.
Data Representation Binary Numbers Binary Addition
© 2016 Pearson Education, Ltd. All rights reserved.
Introduction to SPSS.
CPS120: Introduction to Computer Science
CPS120: Introduction to Computer Science
ITEC113 Algorithms and Programming Techniques
Object Oriented Programming
TMF1414 Introduction to Programming
Assignment and Arithmetic expressions
Econometrics 704 Emilio Cuilty
IT 0213: INTRODUCTION TO COMPUTER ARCHITECTURE
Data Structures Mohammed Thajeel To the second year students
LINDSEY BREWER CSSCR (CENTER FOR SOCIAL SCIENCE COMPUTATION AND RESEARCH) UNIVERSITY OF WASHINGTON September 17, 2009 Introduction to SPSS (Version 16)
ECONOMETRICS ii – spring 2018
Writing Basic SQL SELECT Statements
Operators and Expressions
Introduction Introduction to Stata 2016.
Variables and Arithmetic Operations
Lesson 4 Using Basic Formulas
Variables In programming, we often need to have places to store data. These receptacles are called variables. They are called that because they can change.
Expressions Chapter 4 Copyright © 2008 W. W. Norton & Company.
INPUT & OUTPUT scanf & printf.
CS 240 – Lecture 9 Bit Shift Operations, Assignment Expressions, Modulo Operator, Converting Numeric Types to Strings.
A First Book of ANSI C Fourth Edition
STATA User Group September 2007
C++ Data Types Data Type
Chapter 6 Control Statements: Part 2
Differences between Java and C
Computing in COBOL: The Arithmetic Verbs and Intrinsic Functions
Expressions and Assignment
Stata Basic Course Lab 4.
Lab 2 and Merging Data (with SQL)
Lab 2 HRP223 – 2010 October 18, 2010 Copyright © Leland Stanford Junior University. All rights reserved. Warning: This presentation is protected.
SE1H421 Procedural Programming LECTURE 4 Operators & Conditionals (1)
Topics Designing a Program Input, Processing, and Output
Data Types and Expressions
Topics Designing a Program Input, Processing, and Output
EECE.2160 ECE Application Programming
OPERATORS in C Programming
EECE.2160 ECE Application Programming
Creating Database Objects
OPERATORS in C Programming
Presentation transcript:

Lecture 3: Changing Data By: Kevin Baier

Lecture Summary

Topics Covered Generate & Egen Commands Variable Formats and Types Variable and Value Labels Preserve and Restore Replace & Recode Commands Percentiles

Generate Command

What is generate? The generate command creates new variables based on some expression Often times datasets have the parts to make variables of interest but do not natively contain these variables Generically: generate [type] newvar[:lblname] =exp [if] [in] Now we will discuss variable types, formats, and labels Then we’ll circle back to generate

Variable types When generating a new variable, and elsewhere, you can set the variable type (i.e. how the variable is stored) Types (type names can be used in programming): byte: values from -127 to 100, no decimals int: values from -32,767 to 32,740, no decimals long: values from -2.15 mil to 2.15 mil, no decimals float: values from -1.7e38 to 1.70e38, e-38 decimals double: values from -8.99e307 to 8.99e307, e-323 decimals strX: string values with a maximum length of X

Variable types, cont. Using the double variable type allows for storage of pretty much any type of number 0-1 dummy variables can be stored as double as well as byte and would be equivalent Often you do not need to specify the variable type as STATA will automatically do so based on your work If you specify the type “strL”, this will specify a string variable of the maximum length (2 billion).

Recast command If you want to change your variable type at any time you can use the recast command Generically: recast type varlist [, force] The force option should be used with caution as it causes “the variables to be given the new storage type even if that will cause a loss of precision, introduction of missing values, or, for a string variables, the truncation of strings. Example: recast double mis, force

Variable formats Variable formats set the variable’s display output; for example, when viewing data in the data browser/editor or posting tables For numeric variables, only two formats typically will be used: %#.#g %#.#f In the above formats: The number specified to the left of the decimal point is the total width or numeric characters to be displayed The number specified to the right of the decimal point is number of numeric characters to the right of the decimal to be displayed

Variable formats, cont. Example: %9.2g specified the variable display format be a total of nine digits long with 2 digits being to the right of the decimal point and 7 being to the left of the decimal point. You can use display formats to make non-integers display as integers, and so much more! Some variables you may want to include commas at the thousandths places Use: %#.#gc %#.#fc

Variable formats, cont. Note that all these formats are “right-justified” Adding a “-” (negative sign/dash) makes them left-justified Examples: %-#.#g %-#.#f There are lots of different formats but you only need to be familiar with the basic ones here for the most part

Format command You can change the format of variables at anytime using the “format” command Generically: format varlist %fmt format %fmt varlist Example: format mis %10.0g This increases the total digit width from 8 to 10 for mis Either or!

Labels There are variable labels and there are value labels Variable labels are some textual description of the variable and its values Example: the variable label on “wbhaom” is “Race, inc. Asian and Mixed” Value labels are textual descriptions unique to each unique numeric value of a given variable Value labels are used only on numeric variables and typically only on categorical or ordinal variables Example: tab wbhaom

Labels, cont. In the previous example, the race variable is a numeric one with numeric values; however, textual descriptions have been assigned to each unique numeric value of the variable. Example: tab wbhaom, nolabel In addition to labeling values and variables, you can label a dataset too When you open the CPS data, you should see “(CEPR March Extract, Version 0.9.9.1, 2015)”

Label commands Label your data Label variables Generically: label data [“label”] Example: label data “March 2015 CPS Super-Fun-Time” Label variables Generically: label variable varname [“label”] Example: label variable female “=1 if participant is female” It’s hard to automate variable labeling unless the labels themselves differ by a few distinct parts

Label commands, cont. Value labels have to be defined then attributed to the variables: a two-step process Defining the label: Generically: label define lblname # “label” [# “label” …] [, add modify replace nofix] lblname = the name of your custom label # = numeric value label = “Textual Value Label”

Label commands, cont. Value label attribution: Example: label define genderlab 0 “Male” 1 “Female” Value label attribution: Generically: label values varlist [lblname | .] [, nofix] Example: label value female gender You can label multiple variables with the same label in one command The “.” option will simply delete any value labels attached to a particular variable

Generate command Generically: generate [type] newvar[:lblname] =exp [if] [in] Example: gen incp_less_chldcr=incp_all-careval Not setting “type” causes STATA to do it automatically (which 99% of the time is perfectly fine) Example: gen long incp_less_chldcr=incp_all-careval Same as before, but now we set the variable type to long Suppose we create a variable on which we want to use a value label label define yesno 0 "No" 1 "Yes" gen black_male:yesno=1 if wbhaom==2 & female==0 tab black_male

Final Thoughts on Generate The “=exp” part of the generate command can be a multitude of things The arithmetic combination (i.e. add, subtract, multiply, divide, exponentiate) of n variables The arithmetic combination of n variables and m constants The arithmetic combination of m constants If a variable is part of some new variable created by an arithmetic combination and it contains missing values, then values of that new created variable will be missing where its components are also missing

Egen Command

What is egen? The egen command is an extension of generate and allows for more expansive creation of unique variables You’ll want to use “help egen” often as there are lots of different arguments to use Examples: rowtotal(varlist) [, missing] rowmean(varlist) total(exp) [, missing] *good for column totals*

Egen command Generically: egen [type] newvar = fcn(arguments) [if] [in] [, options] egen tax_credits=rowtotal(child_tc-eitc), missing Note the “-” telling STATA to execute over that variable range The “, missing” tells STATA to specify the new value as missing (instead of zero) if all values for that observation are missing egen avg_health_spnd=rowmean(fhipval-fmedval), missing egen total_eitc=total(eitc)

Preserve and Restore

What is P&R? Preserve and restore basically allows you to temporarily change the dataset then bring back the dataset to its original form Generically: preserve commands…. restore

Preserve and Restore in Practice Example: preserve drop year month restore This would initially create a dataset without year and month and then restore our original dataset with all the variables Preserve and restore should be run “together” Do not run the preserve command and some other commands and then decide to run the restore command later

Replace and Recode Commands

What is replace? Replace changes the contents/values of existing variables Generically: replace oldvar =exp [if] [in] [, nopromote] “oldvar” is an existing variable “exp” can be many things: Another variable The arithmetic combination (i.e. add, subtract, multiply, divide, exponentiate) of n variables The arithmetic combination of n variables and m constants The arithmetic combination of m constants “nopromote” option prevents STATA from automatically changing the variable type during a replace

Replace Command Examples: replace incp_all=0 if female==1 replace incp_all=incp_ern if eitc>0 Just like with generate, you only use a single equals, “=“, for the beginning part of the command (i.e. replace oldvar=exp) Any if statements used would require double-equals, “==“

What is recode? Recode is similar to replace in that it changes the content/values of existing variables Unlike replace, recode can be used on multiple variables at once (i.e. a “varlist”) with at least one (but possibly more) rules for change Generically: recode varlist (erule) [(erule) ...] [if] [in] [, options] For now we’ll cover basic “rules” rather than “erules”

What is recode?, cont. Common rules: # = missing (all of # number to missing)

What is recode?, cont. Most the options are not things you need to worry about right now Recode is typically used with categorical variables Think about our race variable where each race is its own unique category with a unique numeric value

Recode Command Example: recode wbhaom (5 6=.) This changes the values of 5 and 6 in our race variable to missing recode wbhaom citstat (4=20) (5=22) This changes the values of 4 to 20 and 5 to 22 of the race and citizenship status variables

Percentiles

Percentile Commands Percentiles are a useful thing in social science research For example, income inequality discussions are usually framed by the percentiles of income-earners STATA offers three commands for percentiles, two of which we’ll cover here: pctile xtile _pctile (we’ll disregard this command)

pctile Command Generically: pctile [type] newvar = exp [if] [in] [weight] [, pctile_options] “exp” can be many things: Another variable The arithmetic combination (i.e. add, subtract, multiply, divide, exponentiate) of n variables The arithmetic combination of n variables and m constants The arithmetic combination of m constants “, nquantiles(#)” is the main option which specifies the number of quantiles If you wanted 100 quantiles you would write “, nquant(100)”

pctile Command, cont. Example: pctile pctile=incp_all, nquantiles(100) This generates the value of income at each 1 percentile out of 100 The percentile commands generates the percentile values from the number of quantiles specified but does not specify in which percentile each observation is

xtile Command The xtile command can specify for each observation in which percentile it is Generically: xtile newvar = exp [if] [in] [weight] [, xtile_options] Just like with the pctile command, “exp” can be many things: Another variable The arithmetic combination (i.e. add, subtract, multiply, divide, exponentiate) of n variables The arithmetic combination of n variables and m constants The arithmetic combination of m constants “, nquantiles(#)” is also the main option with xtile which specifies the number of quantiles If you wanted 100 quantiles you would write “, nquant(100)”

xtile Command, cont. Example: xtile inc_pctile=incp_all, nquantiles(100) This tells STATA that for each observation, mark in which percentile it is If we sum our new percentile variable (inc_pctile)… We see all observations in our dataset have a value Values range from 1 to 100, thus suggesting for each person we know in which income percentile that person falls The mean (45.21) shows that not all values of income are unique If all values of income were unique then the mean would be 50