SJTU CMGPD 2012 Methodological Lecture Day 2 TABLE, COLLAPSE, HISTOGRAM, TWOWAY BAR.

Slides:



Advertisements
Similar presentations
1 SESSION 5 Graphs for data analysis. 2 Objectives To be able to use STATA to produce exploratory and presentation graphs In particular Bar Charts Histograms.
Advertisements

Displaying Data Objectives: Students should know the typical graphical displays for the different types of variables. Students should understand how frequency.
2D Plots 1 ENGR 1181 MATLAB 12.
CMGPD-LN Methodological Lecture Day 7 Health and Mortality.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved. Lecture Slides Elementary Statistics Eleventh Edition and the Triola.
Slide 1 Spring, 2005 by Dr. Lianfen Qian Lecture 2 Describing and Visualizing Data 2-1 Overview 2-2 Frequency Distributions 2-3 Visualizing Data.
Fundamental Features of Graphs All graphs have two, clearly-labeled axes that are drawn at a right angle. –The horizontal axis is the abscissa, or X-axis.
Guide to Using Excel For Basic Statistical Applications To Accompany Business Statistics: A Decision Making Approach, 6th Ed. Chapter 2: Graphs, Charts.
Graphic representations in statistics (part II). Statistics graph Data recorded in surveys are displayed by a statistical graph. There are some specific.
Reading Graphs and Charts are more attractive and easy to understand than tables enable the reader to ‘see’ patterns in the data are easy to use for comparisons.
Excel Graphing Tutorial Lauren Ottaviano Fall 2012.
LSP 120: Quantitative Reasoning and Technological Literacy Section 118 Özlem Elgün.
Stata Review: Part II Biost/Epi 536 Discussion Section October 13, 2009.
Stata Introduction Sociology 229A, Class 2 Copyright © 2008 by Evan Schofer Do not copy or distribute without permission.
Quantitative Data Analysis Definitions Examples of a data set Creating a data set Displaying and presenting data – frequency distributions Grouping and.
CMGPD-LN Methodological Lecture Day 7 Health and Mortality.
Presenting information
Introduction to Excel 2007 Bar Graphs & Histograms Psych 209 February 1st, 2011.
Getting Started with your data
SPSS Statistical Package for the Social Sciences is a statistical analysis and data management software package. SPSS can take data from almost any type.
CHAPTER 14, QUANTITATIVE DATA ANALYSIS. Chapter Outline  Quantification of Data  Univariate Analysis  Subgroup Comparisons  Bivariate Analysis  Introduction.
Basic Descriptive Statistics Healey, Chapter 2
Chapter 2 Presenting Data in Tables and Charts. Note: Sections 2.1 & examining data from 1 numerical variable. Section examining data from.
Charts and Graphs V
How to Analyze Data? Aravinda Guntupalli. SPSS windows process Data window Variable view window Output window Chart editor window.
History ChartGizmo was created by Max Kuchin and Galinkskiy Dmitriy, two software developers from Sankt- Petersburg, Russia. The first version of ChartGizmo.
SJTU CMGPD 2012 Methodological Lecture Day 9 Kinship.
SJTU CMGPD Methodological Lecture Day 8 Family and contextual influences.
Excel Worksheet # 5 Class Agenda Formulas & Functions
Key Data Management Tasks in Stata
SJTU CMGPD 2012 Methodological Lecture Day 4 Household and Relationship Variables.
Chapter 2 Frequency Distributions
Demographic Profiles of Agency Clients - Part 2 Next, we will create a table and a column chart for the conservator field in my database. Because we are.
Microsoft ® Office Excel 2007 Working with Charts.
Graphing Data: Introduction to Basic Graphs Grade 8 M.Cacciotti.
Tables and Graphing. Displaying Data Sometimes it is easier to read data in a visual format. This can come in the form of tables, graphs, charts, etc.
SJTU CMGPD 2012 Methodological Lecture Day 3 Position and Status Variables.
1 Copyright © Cengage Learning. All rights reserved. 3 Descriptive Analysis and Presentation of Bivariate Data.
Advanced Stata Workshop FHSS Research Support Center.
1 An Introduction to SPSS for Windows Jie Chen Ph.D. 6/4/20161.
Statistics: Analyzing 2 Categorical Variables MIDDLE SCHOOL LEVEL  Session #1  Presented by: Dr. Del Ferster.
SPSS Instructions for Introduction to Biostatistics Larry Winner Department of Statistics University of Florida.
Describing Data: Graphical Methods ● So far we have been concerned with moving from asking a research question to collecting good quality empirical data.
Comparison of different output options from Stata
SPSS Workshop Day 2 – Data Analysis. Outline Descriptive Statistics Types of data Graphical Summaries –For Categorical Variables –For Quantitative Variables.
Thinking about Graphs The Grammar of Graphics and Stata.
Excel 2007 Part (3) Dr. Susan Al Naqshbandi
LSP 120: Quantitative Reasoning and Technological Literacy Topic 1: Introduction to Quantitative Reasoning and Linear Models Lecture Notes 1.3 Prepared.
Mr. Magdi Morsi Statistician Department of Research and Studies, MOH
SJTU CMGPD 2012 Methodological Lecture Day 1 (supplemental) Strengths and Weaknesses of the CMGPD-LN.
Chapter 3: Organizing Data. Raw data is useless to us unless we can meaningfully organize and summarize it (descriptive statistics). Organization techniques.
DTC Quantitative Methods Summary of some SPSS commands Weeks 1 & 2, January 2012.
Organizing & Reporting Data: An Intro Statistical analysis works with data sets  A collection of data values on some variables recorded on a number cases.
SW388R6 Data Analysis and Computers I Slide 1 Comparing Central Tendency and Variability across Groups Impact of Missing Data on Group Comparisons Sample.
Day 11 Methodological Lecture Migration. Measuring migration Create a event variable from comparison of unique values of UNIQUE_VILLAGE_ID Make sure to.
Data & Graphing vectors data frames importing data contingency tables barplots 18 September 2014 Sherubtse Training.
Statistical Fundamentals: Using Microsoft Excel for Univariate and Bivariate Analysis Alfred P. Rovai Charts Overview PowerPoint Prepared by Alfred P.
Pivot Table Working with Excel (2010). What can we do with a pivot table ?  Creating a pivot table  Connection between variables  Calculate data (sum,
Frequency Distributions
Introduction to SPSS July 28, :00-4:00 pm 112A Stright Hall
ECONOMETRICS ii – spring 2018
Lab 2 Data Manipulation and Descriptive Stats in R
Guide to Using Excel 2003 For Basic Statistical Applications
CMGPD-LN Methodological Lecture
CMGPD-LN Methodological Lecture Day 4
CMGPD-LN Methodological Lecture Day 3
A Brief Introduction to Stata(2)
Lesson 13 Working with Tables
Presentation transcript:

SJTU CMGPD 2012 Methodological Lecture Day 2 TABLE, COLLAPSE, HISTOGRAM, TWOWAY BAR

Descriptive statistics There are a number of ways in STATA of transforming the dataset to produce descriptive statistics to be plotted or put into a figure Slow, manual way – TABULATE –Copy results to Excel, parse, and plot –Not recommended Transformation to produce counts, averages etc. according to the values of specified variables to use as the basis of plots – TABLE, REPLACE – COLLAPSE – BYSORT combined with EGEN (to be discussed later)

Collapsing the data TABLE, REPLACE and COLLAPSE transform the data For each value of a specified variable, or each combination of values for specified variables, produce a single observation with summary statistics of other specified values These summary statistics can be counts, sums, means, etc.

COLLAPSE Start with a hypothetical dataset | x1 x2 y | | | 1. | | 2. | | 3. | | 4. | | 5. | | | | 6. | | 7. | | 8. | | 9. | | 10. | | Replace the dataset with one that for each combination of x1 and x2, contains the mean of y. collapse y, by(x1 x2). list | x1 x2 y | | | 1. | | 2. | | 3. | | 4. | |

Or count the numbers of records for each unique combination of x1 and x2. collapse (count) y, by(x1 x2). list | x1 x2 y | | | 1. | | 2. | | 3. | | 4. | | Or both at the same time, creating count and average simultaneously. ‘avgy=‘ tells it to create a new variable name.. collapse (count) y (mean) avgy=y, by(x1 x2). list | x1 x2 y avgy | | | 1. | | 2. | | 3. | | 4. | |

TABLE, REPLACE Can achieve the same thing with TABLE, REPLACE, though the resulting variable names are a bit cryptic. table x1 x2, contents(count y mean y) replace | x2 x1 | | 3 2 | | 2 | 2 3 | list | x1 x2 table1 table2 | | | 1. | | 2. | | 3. | | 4. | |

histogram Observations by year The easy way to get a figure for numbers of observations by register year is to use histogram. histogram YEAR, discrete frequency ytitle("Observations") xtitle("Year") xlabel(1750(25)1900) To force a monochromatic color scheme, we can add scheme(s1mono) To override the default numeric format of the vertical axis labels, we can add ylabel(,format(“%5.0f”)) histogram YEAR, discrete frequency ytitle("Observations") xtitle("Year") xlabel(1750(25)1900) ylabel(,format(%5.0f)) scheme(s1mono)

histogram Restricting the data Often, in producing a histogram, it is necessary to prevent the display of invalid, implausible, or otherwise problematic observations. – Missing values are always coded as -98 or -99, and should be excluded from graphs Do this with an if restriction in the command This applies to tables as well. Compare the results of – histogram AGE_IN_SUI – histogram AGE_IN_SUI if AGE_IN_SUI >=1 & AGE_IN_SUI <= 99

if and logical expressions in STATA if AGE_IN_SUI >=1 & AGE_IN_SUI =1, and <= 99. & represents AND –Expression is evaluated as true only if ALL expressions are TRUE | represents OR –Expression is evaluated as true if ANY of the expressions are TRUE May use parentheses (, ) to specify order of evaluation ! represents NOT –In a logical expression, TRUE is typically indicated as 1, and FALSE is indicated as 0. If AGE_IN_SUI was 45, AGE_IN_SUI >= 1 would evaluate to 1, and AGE_IN_SUI <= 99 would evaluate to 1. –1 & 1 would evaluate to 1, TRUE If AGE_IN_SUI was 105, AGE_IN_SUI >= 1 would evaluate to 1, and AGE_IN_SUI <= 99 would evaluate to FALSE or 0 –1 & 0 would evaluate to 0, FALSE

histogram Some additional options Tell STATA that the values are discrete, not continuous: – histogram AGE_IN_SUI if AGE_IN_SUI >= 1 & AGE_IN_SUI <= 99, discrete Set the Y-axis to represent percentages: – histogram AGE_IN_SUI if AGE_IN_SUI >= 1 & AGE_IN_SUI <= 99, percent discrete Customize labeling of the X-axis – histogram AGE_IN_SUI if AGE_IN_SUI >= 1 & AGE_IN_SUI <= 99, percent discrete xlabel(0(10)100) Add tick marks to the X axis – histogram AGE_IN_SUI if AGE_IN_SUI >= 1 & AGE_IN_SUI <= 99, percent discrete xlabel(0(10)100) xtick(0(5)100) Produce separate graphs according to the value of another variable – histogram AGE_IN_SUI if AGE_IN_SUI >= 1 & AGE_IN_SUI <= 99 & (SEX != -99), percent discrete xlabel(0(10)100) xtick(0(5)100) by(SEX)

table and bar to produce histograms Observations by year We could do the same thing with table to prepare the dataset, and then twoway bar. table YEAR, contents(freq) replace twoway bar table1 YEAR, scheme(s1mono) xlabel(1750(25)1900) ytitle("Number of observations") Or if we want to do it as a scatter plot… twoway scatter table1 YEAR, scheme(s1mono) xlabel(1750(25)1900) ytitle("Number of observations")

Registers by year The number of available registers varies year by year. This accounts for some of the year to year fluctuation in numbers of observations In some cases, may also account for some of the year to year fluctuation in other summary values We can do a year by year count of the number of available registers easily enough

Registers by year table YEAR DATASET, replace table YEAR, replace twoway bar table1 YEAR, scheme(s1mono) ytitle("Registers") Let’s use angle and labsize on xlabel to label each register year individually twoway bar table1 YEAR, scheme(s1mono) ytitle("Registers") xlabel(1750(3)1909,angle(vertical) labsize(vsmall)) Note that coverage is much more sparse before Some years (1810) are missing an especially large number of registers No registers at all from 1888 to 1903

Population by age group Let’s use TABLE to look at the distribution of the population by age group keep if PRESENT & AGE >= 1 & AGE <= 75, clear recode AGE_IN_SUI 1/15=1 16/55=16 56/75=56, generate(AGE_GROUP) tab AGE_GROUP SEX if SEX >= 1, col row table AGE_GROUP SEX if SEX >= 1, col row recode maps values of an existing variable to new values, based on the specified rule. If generate is not specified, it transforms the existing variables. If generate is specified, it creates a new variable with the new values. In this case, all AGE_IN_SUI 1 through 15 all get converted to 1, 16 through 55 are converted to 16, and so forth.

RECODE of | AGE_IN_SUI | (Age in | Sex Sui) | Female Male | Total | 36, ,332 | 270,632 | | | | | 393, ,381 | 894,358 | | | | | 97,333 98,716 | 196,049 | | | | Total | 527, ,429 | 1,361,039 | | | |

RECODE of | AGE_IN_SU | I (Age in | Sex Sui) | Female Male Total | 36, , , | 393, , , | 97,333 98, ,049 | Total | 527, ,

Counts, averages, proportions by age and time There are a variety of options for collapsing observations to produce counts, proportions, averages, etc. by year, age, etc. One simple approach is the table command, combined with the replace option This replaces the dataset in memory with a ‘collapsed’ version Values in the ‘collapsed’ version can be plotted with twoway bar etc.

table AGE_GROUP SEX if SEX >= 1, by(YEAR) replace * Entries created for totals have missing values for AGE_GROUP drop if AGE_GROUP ==. reshape wide table1, i(YEAR SEX) j(AGE_GROUP) * Also need to remove newly created totals with missing values for SEX drop if SEX ==. reshape wide table11 table116 table156, i(YEAR) j(SEX) generate male_proportion_16_55 = table1162/(table112+table1162+table1562) twoway bar male_proportion_16_55 YEAR, ytitle("Proportion of males who are 16 to 55 sui") xtitle("Year") ylabel(0(0.1)1) scheme(s1mono) generate male_dependency_ratio = (table112+table1562)/(table1162) twoway bar male_dependency_ratio YEAR, ytitle("Male dependency ratio (( )/(16-55) ") xtitle("Year") ylabel(0(0.1)1) scheme(s1mono) generate child_sex_ratio = table112/table111 twoway bar child_sex_ratio YEAR, ytitle("Ratio of males to females aged 1-15 sui") xtitle("Year") scheme(s1mono) yscale(log) ylabel( )

Reshape Notice that TABLE (and COLLAPSE) will produce one observation for each combination of YEAR, age_group, and SEX 50*3*2=300 observations (approximately) –299 in reality because one cell is empty We would like one observation per year –In order to carry out calculations Use reshape to convert to one observation per combination of YEAR and SEX, with three variables, one each for each of the age groups Use reshape again to convert to one observation per YEAR, with six variables per observation, one for each combination of SEX and age_group Can calculate dependency ratios, sex ratios etc. from these numbers

Proportions/means Proportion ever married by year We can also calculate means of specified variables by YEAR, AGE_IN_SUI, or other variables of interest use "C:\Users\Cameron Campbe\Documents\Baqi\CMGPD-LN from ICPSR\ICPSR_27063\DS0001\ Data.dta" if PRESENT & AGE >= 16 & AGE = 0, clear recode AGE_IN_SUI 16/30=16 31/40=31 41/50=41, generate(age_group) generate ever_married = MARITAL_STATUS != 2 table YEAR age_group, contents(mean ever_married) replace twoway bar table1 YEAR if age_group == 16,ylabel(0(0.1)1) ytitle("Proportion of men ever married") xtitle("Year") scheme(s1mono) twoway bar table1 YEAR if age_group == 31,ylabel(0(0.1)1) ytitle("Proportion of men ever married") xtitle("Year") scheme(s1mono)

Proportion married by age use "C:\Users\Cameron Campbe\Documents\Baqi\CMGPD-LN from ICPSR\ICPSR_27063\DS0001\ Data.dta" if PRESENT & AGE >= 1 & AGE = 0, clear generate ever_married = MARITAL_STATUS != 2 table AGE_IN_SUI, contents(mean ever_married) replace twoway bar table1 AGE_IN_SUI, ylabel(0(0.10)1) ytitle("Proportion of males ever married") xtitle("Age in sui") scheme(s1mono)

Multiple trends in the same graph keep if SEX == 2 & PRESENT & BIRTHYEAR >= 1750 & BIRTHYEAR <= 1900 keep if MARITAL_STATUS > 0 keep if AGE_IN_SUI >= 11 & AGE_IN_SUI <= 40 recode AGE_IN_SUI 11/15=11 16/20=16 21/25=21 26/30=26 31/35=31 36/40=36, generate(age_group) generate ever_married = MARITAL_STATUS != 2 table BIRTHYEAR age_group, contents(mean ever_married) replace twoway line table1 BIRTHYEAR if age_group == 11 || line table1 BIRTHYEAR if age_group == 16 || line table1 BIRTHYEAR if age_group == 21 || line table1 BIRTHYEAR if age_group == 26 || line table1 BIRTHYEAR if age_group == 31 || line table1 BIRTHYEAR if age_group == 36 ||,scheme(s1mono) legend(order(1 "11-15 sui" 2 "16-20 sui" 3 "21-25 sui" 4 "26-30 sui" 5 "31-35 sui" 6 "36-40 sui")) ytitle("Proportion of males ever married")

Using COLLAPSE keep if PRESENT & SEX == 2 & AGE_IN_SUI > 1 & AGE_IN_SUI <= 60 mvdecode _all, mv( ) generate MARRIED = MARITAL_STATUS == 1 By default, collapse will create variables of the same name containing means collapse MARRIED SON_COUNT DAUGHTER_COUNT FATHER_ALIVE MOTHER_ALIVE BROTHER_COUNT, by(AGE_IN_SUI) Notice use of legend to specify a label for each of the 5 lines twoway line FATHER_ALIVE MOTHER_ALIVE MARRIED SON_COUNT BROTHER_COUNT AGE_IN_SUI, scheme(s1mono) legend(order(1 "Father alive" 2 "Mother alive" 3 "Wife alive" 4 "Sons ever born" 5 "Brothers alive")) ytitle("Mean") lpattern(solid solid dash dot dash_dot)

Calculating rates Calculation of demographic rates by age and so forth is straightforward, using the AT_RISK_* and NEXT_* flag variables. Let’s calculate and compare probability of marriage in the next three years by age, for men and women keep if AT_RISK_MARRY == 1 & SEX > 0 & AGE_IN_SUI > 0 & AGE_IN_SUI <= 30 collapse NEXT_MARRY, by(AGE_IN_SUI SEX) twoway line NEXT_MARRY AGE_IN_SUI if SEX == 1 || line NEXT_MARRY AGE_IN_SUI if SEX == 2 ||, legend(order(1 "Female" 2 "Male")) scheme(s1mono)