SJTU CMGPD 2012 Methodological Lecture Day 4 Household and Relationship Variables.

Slides:



Advertisements
Similar presentations
1 SESSION 5 Graphs for data analysis. 2 Objectives To be able to use STATA to produce exploratory and presentation graphs In particular Bar Charts Histograms.
Advertisements

Data, Tables and Graphs Presentation. Types of data Qualitative and quantitative Qualitative is descriptive (nominal, categories), labels or words Quantitative.
Lesson 72 Inheritance (1). Inheritance: Inheritance for Muslim relatives is an obligation.
Sociology 601 Class 24: November 19, 2009 (partial) Review –regression results for spurious & intervening effects –care with sample sizes for comparing.
CMGPD-LN Methodological Lecture Day 7 Health and Mortality.
Family. This is my father This is my mother This is my brother Paul This is my sister These are my grannies.
1 BINARY CHOICE MODELS: LOGIT ANALYSIS The linear probability model may make the nonsense predictions that an event will occur with probability greater.
1 BINARY CHOICE MODELS: PROBIT ANALYSIS In the case of probit analysis, the sigmoid function F(Z) giving the probability is the cumulative standardized.
Lecture 17: Regression for Case-control Studies BMTRY 701 Biostatistical Methods II.
SJTU CMGPD 2012 Methodological Lecture Day 2 TABLE, COLLAPSE, HISTOGRAM, TWOWAY BAR.
In previous lecture, we highlighted 3 shortcomings of the LPM. The most serious one is the unboundedness problem, i.e., the LPM may make the nonsense predictions.
Sociology 601 Class 25: November 24, 2009 Homework 9 Review –dummy variable example from ASR (finish) –regression results for dummy variables Quadratic.
Ordered probit models.
In previous lecture, we dealt with the unboundedness problem of LPM using the logit model. In this lecture, we will consider another alternative, i.e.
Sociology 601 Class 23: November 17, 2009 Homework #8 Review –spurious, intervening, & interactions effects –stata regression commands & output F-tests.
CMGPD-LN Methodological Lecture Day 7 Health and Mortality.
Log-linear analysis Summary. Focus on data analysis Focus on underlying process Focus on model specification Focus on likelihood approach Focus on ‘complete-data.
Sociology 601 Class 26: December 1, 2009 (partial) Review –curvilinear regression results –cubic polynomial Interaction effects –example: earnings on married.
Getting Started with your data
FAMILYVOCABULARY Vocabulary Unscramble This True or False Questions
BINARY CHOICE MODELS: LOGIT ANALYSIS
Consumption calculations with real data – CORRECTED VERSION (CORRECTIONS IN RED) Gretchen Donehower Day 3, Session 2, NTA Time Use and Gender Workshop.
Think of a topic to study Review the previous literature and research Develop research questions and hypotheses Specify how to measure the variables in.
July, 2000Guang Jin Statistics in Applied Science and Technology Chapter 3 Organizing and Displaying Data.
Methods Workshop (3/10/07) Topic: Event Count Models.
1 The Receiver Operating Characteristic (ROC) Curve EPP 245 Statistical Analysis of Laboratory Data.
1 BINARY CHOICE MODELS: PROBIT ANALYSIS In the case of probit analysis, the sigmoid function is the cumulative standardized normal distribution.
SJTU CMGPD 2012 Methodological Lecture Day 9 Kinship.
SJTU CMGPD Methodological Lecture Day 8 Family and contextual influences.
EDUC 200C Section 3 October 12, Goals Review correlation prediction formula Calculate z y ’ = r xy z x for a new data set Use formula to predict.
Logit model, logistic regression, and log-linear model A comparison.
Key Data Management Tasks in Stata
SJTU CMGPD 2012 Methodological Lecture Day 3 Position and Status Variables.
Consumption calculations with real data Gretchen Donehower Day 3, Session 2, NTA Time Use and Gender Workshop Wednesday, May 23, 2012 Institute for Labor,
Please turn off cell phones, pagers, etc. The lecture will begin shortly.
Lecture 18 Ordinal and Polytomous Logistic Regression BMTRY 701 Biostatistical Methods II.
Two-stage least squares 1. D1 S1 2 P Q D1 D2D2 S1 S2 Increase in income Increase in costs 3.
Organizing & Reporting Data: An Intro Statistical analysis works with data sets  A collection of data values on some variables recorded on a number cases.
The dangers of an immediate use of model based methods The chronic bronchitis study: bronc: 0= no 1=yes poll: pollution level cig: cigarettes smokes per.
Day 11 Methodological Lecture Migration. Measuring migration Create a event variable from comparison of unique values of UNIQUE_VILLAGE_ID Make sure to.
1 BINARY CHOICE MODELS: LINEAR PROBABILITY MODEL Economists are often interested in the factors behind the decision-making of individuals or enterprises,
Chapter 15, Families and Intimate Relationships Key Terms.
1 Ordinal Models. 2 Estimating gender-specific LLCA with repeated ordinal data Examining the effect of time invariant covariates on class membership The.
Stata Review Session Economics 1018 Abby Williamson and Hongyi Li November 17, 2006.
Family Tree Project and Vocabulary
Birthweight (gms) BPDNProp Total BPD (Bronchopulmonary Dysplasia) by birth weight Proportion.
1 BINARY CHOICE MODELS: LOGIT ANALYSIS The linear probability model may make the nonsense predictions that an event will occur with probability greater.
Data Workshop H397. Data Cleaning  Inputting data  Missing Values  Converting String Variables  Creating Scales  Creating Dummy Variables.
Chapter 2 Frequency Distributions PowerPoint Lecture Slides Essentials of Statistics for the Behavioral Sciences Seventh Edition by Frederick J Gravetter.
HW: Copy the Kinship symbols and kin type association for your notes
Discussion: Week 4 Phillip Keung.
Lecture 18 Matched Case Control Studies
Family Grandfather Grandmother Mother Father Aunt Uncle Sister Brother
3a Life Changes.
Problem I: Cousin’s Aunt
Introduction to Logistic Regression
Family tree. family tree Abraham and Mona are Bart’s … parents grandparents grandad.
Problems with infinite solutions in logistic regression
Warm-Up (Add to your notes!)
CMGPD-LN Methodological Lecture Day 4
Men Women Younger Older Mixed Bag
FAMILY TIES.
Complete the following sentences with the name of the relative.
CMGPD-LN Methodological Lecture Day 3
Family members.
Data, Tables and Graphs Presentation.
A Brief Introduction to Stata(2)
Introduction to Econometrics, 5th edition
Presentation transcript:

SJTU CMGPD 2012 Methodological Lecture Day 4 Household and Relationship Variables

Outline Existing household variables – Identifiers – Characteristics – Dynamics – Household relationship Creation of new variables – Use of bysort/egen Household relationship variables

Identifiers HOUSEHOLD_ID – Identifies records associated with a household in the current register HOUSEHOLD_SEQ – The order of the current household (linghu) within the current household group (yihu) UNIQUE_HH_ID – Identifies records associated with the same household across different registers – New value assigned at time of household division Each of the resulting households gets a new, different

Characteristics HH_SIZE – Number of living members of the household – Set to missing before 1789 HH_DIVIDE_NEXT – Number of households in the next register that the members of the current household are associated with. – 1 if no division – 0 if extinction – 2 or more if division – Set to missing before 1789

histogram HH_SIZE if PRESENT & HH_SIZE > 0, width(2) scheme(s1mono) fraction ytitle("Proportion of individuals") xtitle("Number of members")

This isn’t particularly appealing A log scale on the x axis would help In STATA, histogram forces fixed width bins, even when the x scale is set to log We can collapse the data and plot using twoway bar or scatter table HH_SIZE, replace twoway bar table1 HH_SIZE if HH_SIZE > 0, xscale(log) scheme(s1mono) xlabel( )

What if we would like to convert to fractions? Compute total number of households by summing table1, then divide each value of table 1 by the total sum(table1) returns the sum of table 1 up to the current observation total[_N] returns the value of total in the last observation drop if HH_SIZE <= 0 generate total = sum(table1) generate hh_fraction = table1/total[_N] twoway bar hh_fraction HH_SIZE if HH_SIZE > 0, xscale(log) scheme(s1mono) xlabel( ) ytitle("Proportion of households")

Households as units of analysis The previous figures all treated individuals as the units of an analysis Every household was represented as many times as it had members – A household with 100 members would contribute 100 observations In effect, the figures represent household size as experienced by individuals Sometimes we would like to treat households as units of analysis – So that each household only contributes one observation per register

Households as units of analysis One easy way is to create a flag variable that is set to 1 only for the first observation in each household Then select based on that flag variable for tabulations etc. This leaves the original individual level data intact bysort HOUSEHOLD_ID: generate hh_first_record = _n == 1 histogram HH_SIZE if hh_first_record & HH_SIZE > 0, width(2) scheme(s1mono) fraction ytitle("Proportion of households") xtitle("Number of members")

Another approach to plotting trends We can plot average household size by year of birth without ‘destroying’ the data with TABLE, REPLACE or COLLAPSE bysort YEAR: egen mean_hh_size = mean(HH_SIZE) if HH_SIZE > 0 bysort YEAR: egen first_in_year = _n == 1 twoway scatter mean_hh_size YEAR if first_in_year & YEAR >= 1775, scheme(s1mono) ytitle("Mean household size of individuals") xlabel(1775(25)1900)

Mean household size of individuals by age keep if AGE_IN_SUI > 0 & SEX == 2 & YEAR >= 1789 & HH_SIZE > 0 bysort AGE_IN_SUI: egen mean_hh_size = mean(HH_SIZE) bysort AGE_IN_SUI: generate first_in_age = _n == 1 twoway scatter mean_hh_size AGE_IN_SUI if first_in_age & AGE_IN_SUI <= 80, scheme(s1mono) ytitle("Mean household size of individuals") xlabel(1(5)85) xtitle("Age in sui") lowess mean_hh_size AGE_IN_SUI if first_in_age & AGE_IN_SUI <= 80, scheme(s1mono) ytitle("Mean household size of individuals") xlabel(1(5)85) xtitle("Age in sui") msize(small)

Household division Individuals by next register. tab HH_DIVIDE_NEXT if PRESENT & NEXT_3 & HH_DIVIDE_NEXT >= 0 Number of | household in | the next | available | register | Freq. Percent Cum | 789, | 33, | 5, | 1, | | | | | | Total | 830,

Household division Households by next register. bysort HOUSEHOLD_ID: generate first_in_hh = _n == 1. tab HH_DIVIDE_NEXT if PRESENT & NEXT_3 & HH_DIVIDE_NEXT >= 0 & first_in_hh Number of | household in | the next | available | register | Freq. Percent Cum | 117, | 2, | | | | | | | Total | 119,

Household division Example of a simple analysis generate byte DIVISION = HH_DIVIDE_NEXT > 1 generate l_HH_SIZE = ln(HH_SIZE)/ln(1.1) logit DIVISION HH_SIZE YEAR if HH_SIZE > 0 & NEXT_3 & HH_DIVIDE_NEXT >= 0 & first_in_hh logit DIVISION l_HH_SIZE YEAR if NEXT_3 & HH_DIVIDE_NEXT >= 0 & first_in_hh

. logit DIVISION HH_SIZE YEAR if HH_SIZE > 0 & NEXT_3 & HH_DIVIDE_NEXT >= 0 & first_in_hh Iteration 0: log likelihood = Iteration 1: log likelihood = Iteration 2: log likelihood = Iteration 3: log likelihood = Iteration 4: log likelihood = Logistic regression Number of obs = LR chi2(2) = Prob > chi2 = Log likelihood = Pseudo R2 = DIVISION | Coef. Std. Err. z P>|z| [95% Conf. Interval] HH_SIZE | YEAR | _cons |

. logit DIVISION l_HH_SIZE YEAR if NEXT_3 & HH_DIVIDE_NEXT >= 0 & first_in_hh Iteration 0: log likelihood = Iteration 1: log likelihood = Iteration 2: log likelihood = Iteration 3: log likelihood = Iteration 4: log likelihood = Iteration 5: log likelihood = Logistic regression Number of obs = LR chi2(2) = Prob > chi2 = Log likelihood = Pseudo R2 = DIVISION | Coef. Std. Err. z P>|z| [95% Conf. Interval] l_HH_SIZE | YEAR | _cons |

Creating household variables bysort and egen are your friends Use household_id to group observations of the same household in the same register Let’s start with a count of the number of live individuals in the household bysort HOUSEHOLD_ID: egen new_hh_size = total(PRESENT). corr HH_SIZE new_hh_size if YEAR >= 1789 (obs= ) | HH_SIZE new_hh~e HH_SIZE | new_hh_size |

Creating measures of age and sex composition of the household bysort HOUSEHOLD_ID: egen males_1_15 = total(PRESENT & SEX == 2 & AGE_IN_SUI >= 1 & AGE_IN_SUI <= 15) bysort HOUSEHOLD_ID: egen males_16_55 = total(PRESENT & SEX == 2 & AGE_IN_SUI >= 16 & AGE_IN_SUI <= 55) bysort HOUSEHOLD_ID: egen males_56_up = total(PRESENT & SEX == 2 & AGE_IN_SUI >= 56) bysort HOUSEHOLD_ID: egen females_1_15 = total(PRESENT & SEX == 1 & AGE_IN_SUI >= 1 & AGE_IN_SUI <= 15) bysort HOUSEHOLD_ID: egen females_16_55 = total(PRESENT & SEX == 1 & AGE_IN_SUI >= 16 & AGE_IN_SUI <= 55) bysort HOUSEHOLD_ID: egen females_56_up = total(PRESENT & SEX == 1 & AGE_IN_SUI >= 56) generate hh_dependency_ratio = (males_1_15+males56_up+females_1_15+females56_up)/HH_SIZE bysort AGE_IN_SUI: generate first_in_age = _n == 1 bysort AGE_IN_SUI: egen mean_hh_dependency_ratio = mean(hh_dependency_ratio) twoway line mean_hh_dependency_ratio AGE_IN_SUI if first_in_age & AGE_IN_SUI >= 16 & AGE_IN_SUI = 56 sui)") xtitle("Age in sui")

Numbers of individuals who co-reside with someone who holds a position. bysort HOUSEHOLD_ID: egen position_in_hh = total(PRESENT & HAS_POSITION > 0). tab position_in_hh if PRESENT & YEAR >= 1789 position_in | _hh | Freq. Percent Cum | 1,177, | 87, | 24, | 8, | 4, | 1, | | | | Total | 1,305, replace position_in_hh = position_in_hh > 0 (49183 real changes made). tab position_in_hh if PRESENT & YEAR >= 1789 position_in | _hh | Freq. Percent Cum | 1,177, | 127, Total | 1,305,

RELATIONSHIP String describes relationship of individual to the head of the household – Before 1789, describes relationship to head of yihu This is the basis of our kinship linkage – Automated linkage of children to their parents – Automated linkage of wives to their husband’s – All based on processing of strings describing relationship

RELATIONSHIP Core e is household head w is a household head’s wife m is household head’s mother f is household head’s father (usually dead) 1yb, 2yb, 2ob etc. are head’s brothers – Older brothers of the head are unusual 1yz, 2yz, 2oz etc. are head’s unmarried sisters 1s, 2s, etc. are head’s sons 1d, 2d, etc. are the head’s unmarried daughters

RELATIONSHIP Combining codes More distant relationships are built up from these core relationships by combining them Examples – ff is grandfather of head – fm is grandmother of head – f2yb is an uncle: father’s second younger brother f2ybw is his wife – f2yb1s is a cousin: father’s 2nd younger brother’s 1st son – 3yb2s is a nephew: 3rd younger brother’s 2nd son – 3s2s is a grandson: 3rd son’s 2nd son 3s2sw is his wife

RELATIONSHIP Linking wives to husbands Strip the w off of a married woman’s relationship and search the household for the remaining string. – f2yb1sw -> search for f2yb1s Exceptions – For w, search for e – For f, search for m – For fm, search for ff – Etc. Basically prepare a target string, and then make use of merge on HOUSEHOLD_ID and the target

RELATIONSHIP Linking children to fathers In most cases, strip off the last relationship code and look for the remainder. – 1s1s -> look for 1s – ff2yb3s2s -> look for ff2yb3s Exceptions – e look for f – 2yb look for f – f2yb look for ff To link married women to their fathers-in-law, strip off w first, then convert to father’s relationship

RELATIONSHIP Indicators of specify basic relationships to head generate head = RELATIONSHIP == “e” generate head_wife = RELATIONSHIP == “w” generate mother = RELATIONSHIP == “m” generate father = RELATIONSHIP == “f”. tab head SEX if PRESENT & SEX >= 1, row col | Key | | | | frequency | | row percentage | | column percentage | | Sex head | Female Male | Total | 539, ,972 | 1,211,907 | | | | | 7, ,658 | 186,806 | | | | Total | 547, ,630 | 1,398,713 | | | |

RELATIONSHIP Processing for distant relationships Strip out numbers, seniority modifiers y and b, etc. In a.do file, this will create a new variable with a stripped relationship generate new_RELATIONSHIP = RELATIONSHIP local for_removal " o y w" foreach x of local for_removal { replace new_RELATIONSHIP = subinstr(new_RELATIONSHIP,"`x'","",.) }

Examples RELATIONSHIPnew_RELATIONSHIP ee w ff mm 1obb 1obwb 1ob1sbs 3ybb 3ybwb 3yb1sbs 3yb1dbd 4ybb 4ybwb f2ybfb f2ybwfb RELATIONSHIPnew_RELATIONSHIP f2yb1dfbd f3ybfb f3ybwfb f3yb1sfbs f3yb1swfbs f3yb1s1sfbss f3yb1s1dfbsd f3yb2sfbs f3yb2swfbs f3yb2s1dfbsd f4ybwfb f4yb1swfbs f4yb1s1dfbsd f4yb1dfbd f4yb2dfbd

generate brother = new_RELATIONSHIP = “b” & SEX == 2 generate brothers_wife = “b” & SEX == 1 & MARITAL_STATUS !=2 & MARITAL_STATUS > 0 generate sister = new_RELATIONSHIP = “z” & SEX == 1 generate male_cousin = new_RELATIONSHIP = “fbs” & SEX == 2 generate nephew = new_RELATIONSHIP = “bs” & SEX == 2

Proportions of different relationships by age generate brother = new_RELATIONSHIP == "b" bysort AGE_IN_SUI: egen males = total(SEX == 2 & PRESENT) bysort AGE_IN_SUI: egen brothers = total(SEX == 2 & brother & PRESENT) generate proportion_brothers = brothers/males by AGE_IN_SUI: generate first_in_age = _n == 1 twoway line proportion_brothers AGE_IN_SUI if AGE_IN_SUI >= 1 & AGE_IN_SUI <= 80 & first_in_age, ytitle("Proportion of males who are brother of a head") scheme(s1mono) bysort AGE_IN_SUI: egen heads = total(SEX == 2 & RELATIONSHIP == "e" & PRESENT) generate proportion_heads = heads/males twoway line proportion_heads AGE_IN_SUI if AGE_IN_SUI >= 1 & AGE_IN_SUI <= 80 & first_in_age, ytitle("Proportion of males who are household head") scheme(s1mono) bysort AGE_IN_SUI: egen sons = total(SEX == 2 & new_RELATIONSHIP == "s" & PRESENT) generate proportion_sons = sons/males twoway line proportion_sons AGE_IN_SUI if AGE_IN_SUI >= 1 & AGE_IN_SUI <= 80 & first_in_age, ytitle("Proportion of males who are son of a head") scheme(s1mono)

Relationship at first appearance bysort PERSON_ID (YEAR): generate fa_nephew = new_RELATIONSHIP[1] == "bs" & AGE[1] <= 10 & SEX == 2 & PRESENT bysort PERSON_ID (YEAR): generate fa_son = new_RELATIONSHIP[1] == "s" & AGE[1] <= 10 & SEX == 2 & PRESENT generate fa_nephew_head = fa_nephew & head generate fa_son_head = fa_son & head bysort AGE_IN_SUI: egen fa_sons = total(fa_son) bysort AGE_IN_SUI: egen fa_nephews = total(fa_nephew) bysort AGE_IN_SUI: egen fa_sons_head = total(fa_son_head) bysort AGE_IN_SUI: egen fa_nephews_head = total(fa_nephew_head) generate p_fa_sons_head = fa_sons_head/fa_sons generate p_fa_nephews_head = fa_nephews_head/fa_nephews twoway line p_fa_sons_head p_fa_nephews_head AGE_IN_SUI if AGE_IN_SUI >= 1 & AGE_IN_SUI <= 80 & first_in_age, ytitle("Proportion") scheme(s1mono) twoway line p_fa_sons_head p_fa_nephews_head AGE_IN_SUI if AGE_IN_SUI >= 1 & AGE_IN_SUI <= 80 & first_in_age, ytitle("Proportion now head") scheme(s1mono) legend(order(1 "Appeared as sons of head" 2 "Appeared as nephews of head"))