CMGPD-LN Methodological Lecture Day 3

Slides:



Advertisements
Similar presentations
2D Plots 1 ENGR 1181 MATLAB 12.
Advertisements

Guide to Using Excel For Basic Statistical Applications To Accompany Business Statistics: A Decision Making Approach, 6th Ed. Chapter 2: Graphs, Charts.
Excel Graphing Tutorial Lauren Ottaviano Fall 2012.
SJTU CMGPD 2012 Methodological Lecture Day 2 TABLE, COLLAPSE, HISTOGRAM, TWOWAY BAR.
LSP 120: Quantitative Reasoning and Technological Literacy Section 118 Özlem Elgün.
Stata Introduction Sociology 229A, Class 2 Copyright © 2008 by Evan Schofer Do not copy or distribute without permission.
In this tutorial you will learn how to go from THIS.
Introduction to Excel 2007 Bar Graphs & Histograms Psych 209 February 1st, 2011.
Getting Started with your data
SPSS Statistical Package for the Social Sciences is a statistical analysis and data management software package. SPSS can take data from almost any type.
How to build graphs, charts and plots. For Categorical data If the data is nominal, then: Few values: Pie Chart Many Values: Pareto Chart (order of bars.
How to Analyze Data? Aravinda Guntupalli. SPSS windows process Data window Variable view window Output window Chart editor window.
Guide to Using Excel 2003 For Basic Statistical Applications To Accompany Business Statistics: A Decision Making Approach, 7th Ed. Chapter 2: Graphs, Charts.
SJTU CMGPD 2012 Methodological Lecture Day 9 Kinship.
SJTU CMGPD Methodological Lecture Day 8 Family and contextual influences.
Key Data Management Tasks in Stata
SJTU CMGPD 2012 Methodological Lecture Day 4 Household and Relationship Variables.
SJTU CMGPD 2012 Methodological Lecture Day 3 Position and Status Variables.
Advanced Stata Workshop FHSS Research Support Center.
1 An Introduction to SPSS for Windows Jie Chen Ph.D. 6/4/20161.
SPSS Instructions for Introduction to Biostatistics Larry Winner Department of Statistics University of Florida.
Describing Data: Graphical Methods ● So far we have been concerned with moving from asking a research question to collecting good quality empirical data.
Using Google Sheets To help with data. Sheets is a spreadsheet program that can interface with Docs, or Slides A spreadsheet program has cells (little.
Today’s Goals Answer questions about homework and lecture 2 Understand what a query is Understand how to create simple queries using Microsoft Access 2007.
Comparison of different output options from Stata
SPSS Workshop Day 2 – Data Analysis. Outline Descriptive Statistics Types of data Graphical Summaries –For Categorical Variables –For Quantitative Variables.
Thinking about Graphs The Grammar of Graphics and Stata.
Mr. Magdi Morsi Statistician Department of Research and Studies, MOH
SJTU CMGPD 2012 Methodological Lecture Day 1 (supplemental) Strengths and Weaknesses of the CMGPD-LN.
Chapter 3: Organizing Data. Raw data is useless to us unless we can meaningfully organize and summarize it (descriptive statistics). Organization techniques.
DTC Quantitative Methods Summary of some SPSS commands Weeks 1 & 2, January 2012.
Organizing & Reporting Data: An Intro Statistical analysis works with data sets  A collection of data values on some variables recorded on a number cases.
Day 11 Methodological Lecture Migration. Measuring migration Create a event variable from comparison of unique values of UNIQUE_VILLAGE_ID Make sure to.
Data & Graphing vectors data frames importing data contingency tables barplots 18 September 2014 Sherubtse Training.
Pivot Table Working with Excel (2010). What can we do with a pivot table ?  Creating a pivot table  Connection between variables  Calculate data (sum,
Frequency Distributions
Descriptive statistics (2)
Statistical Analysis – Part 3
Relative Cumulative Frequency Graphs
Introduction to SPSS July 28, :00-4:00 pm 112A Stright Hall
CSE111 Introduction to Computer Applications
AP Biology: Normal Distribution
ECONOMETRICS ii – spring 2018
Using Excel to Graph Data
Lab 2 Data Manipulation and Descriptive Stats in R
Guide to Using Excel 2003 For Basic Statistical Applications
Introduction to Stata Spring 2017.
THE STAGES FOR STATISTICAL THINKING ARE:
Sexual Activity and the Lifespan of Male Fruitflies
Why study statistics?.
Agenda About Excel/Calc Spreadsheets Key Features
Types of Graphs… and when to use them!.
Graphs with SPSS.
CHAPTER 1 Exploring Data
Stata Basic Course Lab 4.
CMGPD-LN Methodological Lecture
THE STAGES FOR STATISTICAL THINKING ARE:
Using Excel to Graph Data
Week 3 Lecture Notes PSYC2021: Winter 2019.
CMGPD-LN Methodological Lecture Day 4
Presentation, data and programs at:
Introduction to Excel 2007 Part 3: Bar Graphs and Histograms
Exercise 1: Entering data into SPSS
A Brief Introduction to Stata(2)
Lesson 13 Working with Tables
Evaluation of Public Policy
Chapter 2 Excel Extension: Now You Try!
Descriptive Statistics Civil and Environmental Engineering Dept.
Presentation transcript:

CMGPD-LN Methodological Lecture Day 3 Descriptive statistics using TABLE and COLLAPSE

Descriptive statistics There are a number of ways in STATA of transforming the dataset to produce descriptive statistics to be plotted or put into a figure Slow, manual way TABULATE Copy results to Excel, parse, and plot Not recommended Transformation to produce counts, averages etc. according to the values of specified variables to use as the basis of plots TABLE, REPLACE COLLAPSE BYSORT combined with EGEN (to be discussed later)

Collapsing the data TABLE, REPLACE and COLLAPSE transform the data For each value of a specified variable, or each combination of values for specified variables, produce a single observation with summary statistics of other specified values These summary statistics can be counts, sums, means, etc.

COLLAPSE Start with a hypothetical dataset +----------------+ | x1 x2 y | |----------------| 1. | 1 3 12 | 2. | 2 3 100 | 3. | 1 3 45 | 4. | 2 3 -18 | 5. | 1 3 73 | 6. | 2 4 22 | 7. | 1 4 -129 | 8. | 2 4 -100 | 9. | 1 4 -9 | 10. | 2 4 112 | Replace the dataset with one that for each combination of x1 and x2, contains the mean of y . collapse y, by(x1 x2) . list +--------------------+ | x1 x2 y | |--------------------| 1. | 1 3 43.33333 | 2. | 1 4 -69 | 3. | 2 3 41 | 4. | 2 4 11.33333 |

Or count the numbers of records for each unique combination of x1 and x2 . collapse (count) y, by(x1 x2) . list +-------------+ | x1 x2 y | |-------------| 1. | 1 3 3 | 2. | 1 4 2 | 3. | 2 3 2 | 4. | 2 4 3 | Or both at the same time, creating count and average simultaneously. ‘avgy=‘ tells it to create a new variable name. . collapse (count) y (mean) avgy=y, by(x1 x2) . list +------------------------+ | x1 x2 y avgy | |------------------------| 1. | 1 3 3 43.33333 | 2. | 1 4 2 -69 | 3. | 2 3 2 41 | 4. | 2 4 3 11.33333 |

TABLE, REPLACE Can achieve the same thing with TABLE, REPLACE, though the resulting variable names are a bit cryptic . table x1 x2, contents(count y mean y) replace ------------------------------ | x2 x1 | 3 4 ----------+------------------- 1 | 3 2 | 43.33333 -69 | 2 | 2 3 | 41 11.33333 . list +-----------------------------+ | x1 x2 table1 table2 | |-----------------------------| 1. | 1 3 3 43.33333 | 2. | 1 4 2 -69 | 3. | 2 3 2 41 | 4. | 2 4 3 11.33333 | .

Observations by year The easy way to get a figure for numbers of observations by register year is to use histogram. histogram YEAR, discrete frequency ytitle("Observations") xtitle("Year") xlabel(1750(25)1900) To force a monochromatic color scheme, we can add scheme(s1mono) To override the default numeric format of the vertical axis labels, we can add ylabel(,format(“%5.0f”)) histogram YEAR, discrete frequency ytitle("Observations") xtitle("Year") xlabel(1750(25)1900) ylabel(,format(%5.0f)) scheme(s1mono)

Observations by year We could do the same thing with table to prepare the dataset, and then twoway bar. table YEAR, contents(freq) replace twoway bar table1 YEAR, scheme(s1mono) xlabel(1750(25)1900) ytitle("Number of observations") Or if we want to do it as a scatter plot… twoway scatter table1 YEAR, scheme(s1mono) xlabel(1750(25)1900) ytitle("Number of observations")

Registers by year The number of available registers varies year by year. This accounts for some of the year to year fluctuation in numbers of observations In some cases, may also account for some of the year to year fluctuation in other summary values We can do a year by year count of the number of available registers easily enough

Registers by year table YEAR DATASET, replace table YEAR, replace twoway bar table1 YEAR, scheme(s1mono) ytitle("Registers") Let’s use of angle and labsize on xlabel to label each register year individually twoway bar table1 YEAR, scheme(s1mono) ytitle("Registers") xlabel(1750(3)1909,angle(vertical) labsize(vsmall)) Note that coverage is much more sparse before 1789. Some years (1810) are missing an especially large number of registers No registers at all from 1888 to 1903

Population by age group Let’s use TABLE to look at the distribution of the population by age group keep if PRESENT & AGE >= 1 & AGE <= 75, clear recode AGE_IN_SUI 1/15=1 16/55=16 56/75=56, generate(AGE_GROUP) tab AGE_GROUP SEX if SEX >= 1, col row table AGE_GROUP SEX if SEX >= 1, col row

RECODE of | AGE_IN_SUI | (Age in | Sex Sui) | Female Male | Total -----------+----------------------+---------- 1 | 36,300 234,332 | 270,632 | 13.41 86.59 | 100.00 | 6.88 28.12 | 19.88 16 | 393,977 500,381 | 894,358 | 44.05 55.95 | 100.00 | 74.67 60.04 | 65.71 56 | 97,333 98,716 | 196,049 | 49.65 50.35 | 100.00 | 18.45 11.84 | 14.40 Total | 527,610 833,429 | 1,361,039 | 38.77 61.23 | 100.00 | 100.00 100.00 | 100.00

------------------------------------- RECODE of | AGE_IN_SU | I (Age in | Sex Sui) | Female Male Total ----------+-------------------------- 1 | 36,300 234,332 270,632 16 | 393,977 500,381 894,358 56 | 97,333 98,716 196,049 | Total | 527,610 833,429 1361039

Counts, averages, proportions by age and time There are a variety of options for collapsing observations to produce counts, proportions, averages, etc. by year, age, etc. One simple approach is the table command, combined with the replace option This replaces the dataset in memory with a ‘collapsed’ version Values in the ‘collapsed’ version can be plotted with twoway bar etc.

table AGE_GROUP SEX if SEX >= 1, by(YEAR) replace table AGE_GROUP SEX if SEX >= 1, by(YEAR) replace * Entries created for totals have missing values for AGE_GROUP drop if AGE_GROUP == . reshape wide table1, i(YEAR SEX) j(AGE_GROUP) * Also need to remove newly created totals with missing values for SEX drop if SEX == . reshape wide table11 table116 table156, i(YEAR) j(SEX) generate male_proportion_16_55 = table1162/(table112+table1162+table1562) twoway bar male_proportion_16_55 YEAR, ytitle("Proportion of males who are 16 to 55 sui") xtitle("Year") ylabel(0(0.1)1) scheme(s1mono) generate male_dependency_ratio = (table112+table1562)/(table1162) twoway bar male_dependency_ratio YEAR, ytitle("Male dependency ratio ((1-15 + 56-75)/(16-55) ") xtitle("Year") ylabel(0(0.1)1) scheme(s1mono) generate child_sex_ratio = table112/table111 twoway bar child_sex_ratio YEAR, ytitle("Ratio of males to females aged 1-15 sui") xtitle("Year") scheme(s1mono) yscale(log) ylabel(1 2 5 10 20 50 100 200)

Reshape Notice that TABLE (and COLLAPSE) will produce one observation for each combination of YEAR, age_group, and SEX 50*3*2=300 observations (approximately) 299 in reality because one cell is empty We would like one observation per year In order to carry out calculations Use reshape to convert to one observation per combination of YEAR and SEX, with three variables, one each for each of the age groups Use reshape again to convert to one observation per YEAR, with six variables per observation, one for each combination of SEX and age_group Can calculate dependency ratios, sex ratios etc. from these numbers

Proportions/means Proportion ever married by year We can also calculate means of specified variables by YEAR, AGE_IN_SUI, or other variables of interest use "C:\Users\Cameron Campbe\Documents\Baqi\CMGPD-LN from ICPSR\ICPSR_27063\DS0001\27063-0001-Data.dta" if PRESENT & AGE >= 16 & AGE <= 50 & SEX == 2 & MARITAL_STATUS >= 0, clear recode AGE_IN_SUI 16/30=16 31/40=31 41/50=41, generate(age_group) generate ever_married = MARITAL_STATUS != 2 table YEAR age_group, contents(mean ever_married) replace twoway bar table1 YEAR if age_group == 16,ylabel(0(0.1)1) ytitle("Proportion of men 16-30 ever married") xtitle("Year") scheme(s1mono) twoway bar table1 YEAR if age_group == 31,ylabel(0(0.1)1) ytitle("Proportion of men 31-40 ever married") xtitle("Year") scheme(s1mono)

Proportion married by age use "C:\Users\Cameron Campbe\Documents\Baqi\CMGPD-LN from ICPSR\ICPSR_27063\DS0001\27063-0001-Data.dta" if PRESENT & AGE >= 1 & AGE <= 50 & SEX == 2 & MARITAL_STATUS >= 0, clear generate ever_married = MARITAL_STATUS != 2 table AGE_IN_SUI, contents(mean ever_married) replace twoway bar table1 AGE_IN_SUI, ylabel(0(0.10)1) ytitle("Proportion of males ever married") xtitle("Age in sui") scheme(s1mono)

Multiple trends in the same graph keep if SEX == 2 & PRESENT & BIRTHYEAR >= 1750 & BIRTHYEAR <= 1900 keep if MARITAL_STATUS > 0 keep if AGE_IN_SUI >= 11 & AGE_IN_SUI <= 40 recode AGE_IN_SUI 11/15=11 16/20=16 21/25=21 26/30=26 31/35=31 36/40=36, generate(age_group) generate ever_married = MARITAL_STATUS != 2 table BIRTHYEAR age_group, contents(mean ever_married) replace twoway line table1 BIRTHYEAR if age_group == 11 || line table1 BIRTHYEAR if age_group == 16 || line table1 BIRTHYEAR if age_group == 21 || line table1 BIRTHYEAR if age_group == 26 || line table1 BIRTHYEAR if age_group == 31 || line table1 BIRTHYEAR if age_group == 36 || ,scheme(s1mono) legend(order(1 "11-15 sui" 2 "16-20 sui" 3 "21-25 sui" 4 "26-30 sui" 5 "31-35 sui" 6 "36-40 sui")) ytitle("Proportion of males ever married")

Using COLLAPSE keep if PRESENT & SEX == 2 & AGE_IN_SUI > 1 & AGE_IN_SUI <= 60 mvdecode _all, mv(-99 -98) generate MARRIED = MARITAL_STATUS == 1 By default, collapse will create variables of the same name containing means collapse MARRIED SON_COUNT DAUGHTER_COUNT FATHER_ALIVE MOTHER_ALIVE BROTHER_COUNT, by(AGE_IN_SUI) Notice use of legend to specify a label for each of the 5 lines twoway line FATHER_ALIVE MOTHER_ALIVE MARRIED SON_COUNT BROTHER_COUNT AGE_IN_SUI, scheme(s1mono) legend(order(1 "Father alive" 2 "Mother alive" 3 "Wife alive" 4 "Sons ever born" 5 "Brothers alive")) ytitle("Mean") lpattern(solid solid dash dot dash_dot)

Calculating rates Calculation of demographic rates by age and so forth is straightforward, using the AT_RISK_* and NEXT_* flag variables. Let’s calculate and compare probability of marriage in the next three years by age, for men and women keep if AT_RISK_MARRY == 1 & SEX > 0 & AGE_IN_SUI > 0 & AGE_IN_SUI <= 30 collapse NEXT_MARRY, by(AGE_IN_SUI SEX) twoway line NEXT_MARRY AGE_IN_SUI if SEX == 1 || line NEXT_MARRY AGE_IN_SUI if SEX == 2 || , legend(order(1 "Female" 2 "Male")) scheme(s1mono)