Stata Basic Course Lab 4.

Slides:

Advertisements

Similar presentations

Housekeeping: Variable labels, value labels, calculations and recoding

Advertisements

Stata as a Data Entry Management Tool

Research Methods Lecture 3 More STATA Ian Walker Room S2.109   Slides available at:

Understanding Microsoft Excel

Concepts of Database Management Sixth Edition

1 Creating and Tweaking Data HRP223 – 2010 October 24, 2011 Copyright © Leland Stanford Junior University. All rights reserved. Warning: This.

1 An Introduction to IBM SPSS PSY450 Experimental Psychology Dr. Dwight Hennessy.

Introduction to Structured Query Language (SQL)

Stata Introduction Sociology 229A, Class 2 Copyright © 2008 by Evan Schofer Do not copy or distribute without permission.

Getting Started with your data

Consumption calculations with real data – CORRECTED VERSION (CORRECTIONS IN RED) Gretchen Donehower Day 3, Session 2, NTA Time Use and Gender Workshop.

Stata 12 Merging Guide Nathan Favero Texas A&M University October 19, 2012.

Stata Workshop #1 Chiu-Hsieh (Paul) Hsu Associate Professor College of Public Health

API-208: Stata Review Session Daniel Yew Mao Lim Harvard University Spring 2013.

Key Data Management Tasks in Stata

XP 1 Microsoft Access 2003 Introduction To Microsoft Access 2003.

Concepts of Database Management Seventh Edition

Basics of Biostatistics for Health Research Session 1 – February 7 th, 2013 Dr. Scott Patten, Professor of Epidemiology Department of Community Health.

Overview Excel is a spreadsheet, a grid made from columns and rows. It is a software program that can make number manipulation easy and somewhat painless.

Today Introduction to Stata – Files / directories – Stata syntax – Useful commands / functions Logistic regression analysis with Stata – Estimation – GOF.

Stata Review Session Economics 1018 Abby Williamson and Hongyi Li November 17, 2006.

Topics Introduction to Stata – Files / directories – Stata syntax – Useful commands / functions Logistic regression analysis with Stata – Estimation –

Topics Introduction to Stata – Files / directories – Stata syntax – Useful commands / functions Logistic regression analysis with Stata – Estimation –

Understanding Microsoft Excel

Advanced Quantitative Techniques

Understanding Microsoft Excel

Session 15 Merging Data in SPSS

Understanding Microsoft Excel

Formulas, Functions, and other Useful Features

Multi-Step Equations How to Identify Multistep Equations |Combining Terms| How to Solve Multistep Equations | Consecutive Integers.

By Dr. Madhukar H. Dalvi Nagindas Khandwala college

Lesson 2 Tables and Charts

Exponents Scientific Notation

Lecture 3: Changing Data

Stats Lab #1 TA: Kyle Davis

Microsoft Office Illustrated Fundamentals

QS101 – Introduction to Quantitative Methods in Social Science Week 2: Introduction to Stata and Preparation of Field Work Florian Reiche Teaching Fellow.

DEPARTMENT OF COMPUTER SCIENCE

Econometrics 704 Emilio Cuilty

Lesson 2 Notes Chapter 6.

Advanced Analytics Using Enterprise Miner

ECONOMETRICS ii – spring 2018

Understanding Microsoft Excel

Introduction Introduction to Stata 2016.

Microsoft Office Access 2003

Topics Introduction to File Input and Output

Tutorial 3 – Querying a Database

Introduction to Stata Spring 2017.

Microsoft Office Access 2003

Introduction to Stata II

Objectives This is an introduction to the statistical software STATA aiming at: Preparing the participants in STATA basics (interphase and commands) for.

Computing in COBOL: The Arithmetic Verbs and Intrinsic Functions

Navya Thum January 30, 2013 Day 5: MICROSOFT EXCEL Navya Thum January 30, 2013.

Lab 3 and HRP259 Lab and Combining (with SQL)

Statistical Analysis with

Lab 2 and Merging Data (with SQL)

Lab 2 HRP223 – 2010 October 18, 2010 Copyright © Leland Stanford Junior University. All rights reserved. Warning: This presentation is protected.

Spreadsheets, Modelling & Databases

Stata Basic Course Lab 2.

Understanding Microsoft Excel

Presentation, data and programs at:

Have you signed up (or had) your meeting?

Using Complex Formulas, Functions, and Tables

Topics Introduction to File Input and Output

A Brief Introduction to Stata(2)

Evaluation of Public Policy

Ordinary Least Square estimator using STATA

Microsoft Office Illustrated Fundamentals

Presentation transcript:

Stata Basic Course Lab 4

Modifying & Managing Data: Labeling Values Value labels puts word labels on category variables Example: no=0 and yes=1 in dataset We can read Yes & No in output even though variable values are 0 or 1 To make value labels Step One: Define the label label def [label name] [value1] “[label for value1]” value2 “[label for value2]” Remember: Value always comes first and labels always go in quotes Ex: label def yn 0 “No” 1 “Yes” Step Two: Apply the value label to that variable label val [variable you are labeling] [label you want to apply] Note: the same label can be used for multiple observations!

sort and gsort commands sort arranges the observations into ascending order of the values of the variable. If we want to sort the variable income sort income arranges the observations from the lowest income observation to the highest. You can sort observations according to more than one variable. For instance, if you type sort femhead income STATA sort observations first by femhead then sort the observations according to income To see the sorted data, you can look into the data window, or you can use the list command. list shows identified variables on the screen. If you type, list income You can see income values from the lowest, separately for male and female headed households.

sort can only sort the observations in ascending order sort can only sort the observations in ascending order. Sometimes, you may want to sort observations descending order, from the largest to the smallest. For this purpose, you can use the gsort command, type: gsort income Observations from the largest to the smallest. You can also use more than one variables. gsort femhead income This will sort observations from the largest to the smallest separately for male and female headed households

Keeping and deleting variables drop/ keep commands Both used in the same way but with opposite functions drop deletes observations (rows) or variables (columns) that you want to delete keep delete everything EXCEPT the observations (rows) or variables (columns) that you want to keep. At this point we need a very important parenthesis: If we don’t want to lose information of the dataset, before cutting or keeping variables, we should use the command preserve and then restore preserve : preserve current Stata session to temporarily use another dataset or let us work with variables without changing the original one. restore restore the previous Stata session So before cutting any dataset typing the two command is recommended.

Preserve and restore and the use of tempfile Before we drop some of our observations it’s better to preserve our current data set so that we will be able to recover the dropped observations at a later stage. The preserve command saves our current data set in memory. We can then recover this same data set by typing restore. Suppose we are only interested in female observations. To make our data set smaller, we can type keep if femhead==1 when we DROP observations, all the information associated with the observations will be erased from our data set, and if we want them back, the original read-only data set should be opened again (unless you have use the preserve command). We can use the count command to see how many observation we cut. Typing restore and then count the original data set should be restored. We can also save the new dataset in a tempfile dataset. tempfile assigns names to the specified local macro that asbe used as names for temporary files. When the program or do-file concludes, any datasets created with these assigned names will be erased.

This drops the variable agehead from the dataset Specify the variables you want to drop or keep drop agehead This drops the variable agehead from the dataset keep region hhid femhead income pcexp Keeps only these 5 variables in the original dataset You can also specify the observations you want to delete using the “if” clause drop if region==1 Drops all observations for which variable region is equal to one keep if region==2 Keeps all observations for which variable region is equal to two

Algebric operators The Arithemetic operators are: + addition subtraction (or create negative of value, or negation) * multiplication / division ^ raise to a power To illustrate these operators consider the following generate statements: generate income1 = income+1 (addition) generate negwage = -income (negative or negation) generate femeduc = femhead*educave (multiplication) generate income_ed = income/educave (division) generate age_sq = agehead^2 (power)

Click on math functions Click on math functions. Scrolling down the list you will see many functions that are new to you. A few examples are: generate logincome = log(income) (natural logarithm) generate elincome = exp(income) (exponential function is antilog of natural log) generate rootpcexp = sqrt(pcexp) (square root) The if qualifier (we’ll see later about it) uses a logical expression to determine which observations to use < less than <= less than or equal == equal > greater than >= greater than or equal != not equal & and | or ! not (logical negation; ~ can also be used) ( ) parentheses are for grouping to specify order of evaluation

Using gen/egen and replace/recode Examples of gen command: generate a = 1 (column of ones) gen land = rlandown+ilandown (adds rlandown & ilandown) gen age40=1 if agehead>40 & agehead!=. Be carefull: STATA reads missing values as infinity, so be careful when using > and >=! Variable names can’t start with numbers! Examples of replace command: replace agehead=. if agehead==999 E.g. questionnaire non-response replace over10=0 if age<=10

egen is often used for more sophisticated functions (statistical functions like mean, sd, etc.) egen examples: egen mean_age=mean(agehead) egen mean_age_over40=mean(agehead) if over40==1 bysort age40: egen groupagemean=mean(agehad) egen groupagemean=mean(agehead), by(age40) recode Recoding variables involves changing a specific value of a variable to another one. It is better to generate a new variable before you recode it. For example, before recoding the values on var1, generate a new variable called var1b (gen var1b = var1) and then work with the new variable.

Recode all 0s to be 2s: gen female = femhead recode female (0=2) Recode all 0s to be 0s AND 2s to be 1s: recode female (1=0) (2=1) Recode all 999s to be “missing”: recode dw_wealth (999=.) Recode all values between 30 and 60 to be 45: gen age = agehead recode age (30/60=45)

Modifying Data: Merging Datasets merge combines multiple datasets into one by matching on the variable you choose Step One: Identify the appropriate variable to use for merging Note: The variable must exist in both datasets Step Two: Format data for merging Make sure variable has the same name in both datasets If not, rename or create a new variable Sort both datasets by the merging variable and save sort hhid save xxx.dta, replace Step Three: Open the “master” dataset (the primary dataset)

Step Five: Examine merge success Stata sintax for merge is: Step Four: Merge merge [variable used for merging] using [DATASET BEING MERGED INTO the current one] Ex: merge hhid using [DATASET] Step Five: Examine merge success tab _merge Stata sintax for merge is: One-to-one merge on specified key variables merge 1:1 varlist using filename Many-to-one merge on specified key variables merge m:1 varlist using filename One-to-many merge on specified key variables merge 1:m varlist using filename Many-to-many merge on specified key variables merge m:m varlist using filename

The master and using data are automatically sorted. There is also another way to merge different dataset. Using the command mmerge mmerge is an extension of merge that makes matched merging safe. You need to to specify the type of match to be performed; mmerge verifies that the requirements hold. It also makes merging easy, though that may not be obvious at a first look at the full syntax diagram. In contrast with merge, the resulting data after mmerge is independent of the order of observations in master and using data. As a consequence, you are not required to sort the data yourself. mmerge displays names of variables that occur in both master and using data. The master and using data are automatically sorted. If there is a _merge variable in master or using data will be silently overwritten; mmerge automatically tabulates _merge The match-variable(s) of the using data can be named differently from the master data .

Modifying Dataset: Append append using [filename] append is a much simpler command than merge Just make sure that: The variable names are exactly the same in both files. The variable types (string or numeric) are the same in both files. Both files are saved as Stata files (“.dta”) It is usually used to creates panel_data datasets (we’’l see later what a panel data is). use appending_a_day2.dta, clear append using appending_b_day2.dta append using appending_c_day2.dta

Dataset Managment bysort runs a command separately for each value of a variable Using just by requires the data to be sorted by the variable in consideration. bysort does that for you. bysort region: gen rage=sum(age>30) bysort region: reg pcexp femhead bysort region: su femhead agehead

Creating dummies From Stata Manual: A dummy variable is a variable that takes on the values 1 and 0; 1 means something is true (such as age < 25, sex is female) Dummy variables are also called indicator variables. There are three ways to create dummy variables: use generate, which creates one dummy variable at a time; tabulate, which creates whole sets of dummies at once; use xi, which may allow you to avoid the issue of dummy- creation altogether. gen youngHH = 0 replace youngHH = 1 if agehead<25 or gen youngHH1 = (agehead<25) tabulate region tabulate region, gen(dummy_region) regress logincome agehead female i.region