Advanced quantitative methods for social scientists (2017–2018) LC & PVK Session 2 Multilevel analysis in Stata (with a focus on random slope models for comparative research) Louis Chauvel University of Luxembourg, PEARL Institute for Research on Socio-Economic Inequality (IRSEI)
Outline Background Method with example: the PISA survey Chauvel L, Leist AK. Socioeconomic hierarchy and health gradient in Europe: the role of income inequality and of social origins. International Journal for Equity in Health. 2015;14:132. doi:10.1186/s12939-015-0263-y. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4647815/ Chauvel L, Hartung A, More Inequality, More Viscosity? Intergenerational Mobility in International Comparison, March 31 – April 2, 2016: PAA Annual Meeting, Washington DC, https://paa.confex.com/paa/2016/meetingapp.cgi/Paper/6597 Outline Background Standard multiple regressions versus random effects models Fixed effects and random effects Basics on notations in multilevel analysis 2-Level models / random effect / random slope Generalization: Higher level models and cross-classified models Method with example: the PISA survey Fitting models random effects and random slopes Post-estimation techniques: BLUPs, Multilevel tools (mlt) Understanding and presenting results Examples of publication Further developments on panel analysis xtmixed as a pervasive command
Main references R Stata http://www.bristol.ac.uk/cmm/learning/support/books.html Raudenbush, S. W., & Bryk, A. S. (2002). Hierarchical linear models: Applications and data analysis methods (2nd ed.). Thousand Oaks: Sage Publications. Rabe-Hesketh, S., and A. Skrondal. 2012. Multilevel and longitudinal modeling using STATA. Stata Press. Gelman, A., and J. Hill. 2006. Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge University Press. Stata R
Multilevel (2L) data structure Simple example: 2 level data That is … Level 2 Level 1 Country Country Country Country Country Country 1 I1 I2 I3 I4 Country 2 I1 I2 I3 I4 Country 3 I1 I2 I3 I4 Country 4 I1 I2 I3 I4 NB: Minimum 20 level-2 groups
Typical example: PISA 2012 Educational performance at age circa 15 Many countries (68) Parental/family backgrounds Performance variation by country Influence of parental background (Social Reproduction) by country Explanation of Social Reproduction variation? Country GDP/capita, gini etc. The old solution: series of standard OLS To open the dataset and prepare it … http://www.louischauvel.org/ML_pisa_2012.do * PROGRAM SEGMENT 0 To process the old solution … http://www.louischauvel.org/ML_pisa_2012.do * PROGRAM SEGMENT 1
FRANCE ! LUX HKG ALBANIA
matrix R=J(1,5,.) levelsof cco foreach i of numlist `r(levels)' { di `i' ta cnt if `i'==cco capture { quietly: reg PV1READ ST04Q01 f1 stdage if `i'==cco matrix A=e(b) noisily matrix li A matrix C=`i',A matrix R=R \ C } mat li R preserve clear svmat R gen CountryScore=R5 gen SocReproduction=R3 gen cco=R1 two scatter SocR Cou, ml(cco) reg SocR Cou reg SocR Cou if R1!=1 restore
Multilevel Data: why? Multilevel models respect the structure of data we have 1. Clustered data and correlated errors in each cluster 2. ML relaxes assumption of uncorrelated (independent) errors 3. Partitioning variance-covariance components Question: At what level is most of the variance? Conceptually: Different levels and their effects? Statistically: Are your data clustered? Empirically: are there variations both at L1 and L2? … And we can “easily” refine the models
Fixed Effects Model (FEM) & Random Effects (REM) J groups For i cases within j groups aj is a separate intercept for each group at within-group, equivalent to: “within group” model : all variables are centered around mean of each group. In practice : FEM = J replications of standard OLS Models With dummy variable approach => group differences as a fixed effect * PROGRAM SEGMENT 2
Random Effects Alternatively, treat effects as random effect No estimates for each case, but model them A simple random intercept model Notation from Rabe-Hesketh & Skrondal Where b is the main intercept Zeta (z) is a random effect for each group Allowing each of j groups to have its own intercept Assumed to be independent & normally distributed Error (e) is the error term for each case Also assumed to be independent & normally distributed NB: Minimum 20 level-2 groups
xtreg syntax xtreg PV1READ ST04Q01 f1 stdage, i(cco) fe * PROGRAM SEGMENT 4 *Comparing FE and RE models xtreg PV1READ ST04Q01 f1 stdage, i(cco) fe Dependant variable X-explanatory variables level 2 group variable FE or RE model
Usual Solution => Hausman Specification Test Best Model? Fixed effects most consistent as N grows very large But less efficient than random effects when low within-group variation (big between group variation) and small sample size (not PISA…) Usual Solution => Hausman Specification Test Hausman Specification Test: tool help evaluate fit of fixed vs. random effects Logic: Both fixed & random effects models are consistent if models are properly specified However, some model violations cause random effects models to be inconsistent Ex: if X variables are correlated to random error In short: Models should give the same results… If not, random effects may be biased If results are similar, use the most efficient model: random effects If results diverge, odds are that the random effects model is biased. In that case use fixed effects…
Hausman Specification Test Strategy: Estimate both fixed & random effects models Save the estimates each time Finally invoke Hausman test Ex (here with the “old” xtreg stata command): xtreg PV1READ ST04Q01 f1 stdage, i(cco) fe est store femod xtreg PV1READ ST04Q01 f1 stdage, i(cco) re est store remod esttab femod remod hausman femod remod * PROGRAM SEGMENT 4 *Comparing FE and RE models
Linear Fixed Intercepts Model . xtreg PV1READ ST04Q01 parentalbckgrnd stdage, i(cco) fe Fixed-effects (within) regression Number of obs = 413190 Group variable: ccode Number of groups = 67 R-sq: within = 0.1440 Obs per group: min = 259 between = 0.2496 avg = 6167.0 overall = 0.1750 max = 29486 F(3,413120) = 23167.57 corr(u_i, Xb) = 0.1095 Prob > F = 0.0000 --------------------------------------------------------------------------------- PV1READ | Coef. Std. Err. t P>|t| [95% Conf. Interval] ----------------+---------------------------------------------------------------- ST04Q01 | -35.36242 .2558221 -138.23 0.000 -35.86382 -34.86101 parentalbckgrnd | 16.18388 .0727843 222.35 0.000 16.04123 16.32654 stdage | 3.513675 .1302983 26.97 0.000 3.258295 3.769056 _cons | 532.6464 .4030946 1321.39 0.000 531.8563 533.4364 sigma_u | 40.627429 sigma_e | 82.173407 rho | .19642709 (fraction of variance due to u_i) F test that all u_i=0: F(66, 413120) = 1331.73 Prob > F = 0.0000 SD of u (intercepts); SD of e; intra-class correlation
Linear Random Intercepts Model . xtreg PV1READ ST04Q01 parentalbckgrnd stdage, i(cco) re Random-effects GLS regression Number of obs = 413190 Group variable: ccode Number of groups = 67 R-sq: within = 0.1440 Obs per group: min = 259 between = 0.2496 avg = 6167.0 overall = 0.1750 max = 29486 Wald chi2(3) = 69522.37 corr(u_i, X) = 0 (assumed) Prob > chi2 = 0.0000 --------------------------------------------------------------------------------- PV1READ | Coef. Std. Err. z P>|z| [95% Conf. Interval] ----------------+---------------------------------------------------------------- ST04Q01 | -35.36185 .2558235 -138.23 0.000 -35.86326 -34.86045 parentalbckgrnd | 16.18568 .0727767 222.40 0.000 16.04305 16.32832 stdage | 3.512553 .1302975 26.96 0.000 3.257175 3.767932 _cons | 534.6695 4.79814 111.43 0.000 525.2653 544.0737 sigma_u | 39.124915 sigma_e | 82.173407 rho | .18480223 (fraction of variance due to u_i) Assumes normal uj, uncorrelated with X vars SD of u (intercepts); SD of e; intra-class correlation
Hausman Specification Test Example: Pisa read score fe vs re . hausman femod remod ---- Coefficients ---- | (b) (B) (b-B) sqrt(diag(V_b-V_B)) | femod remod Difference S.E. -------------+---------------------------------------------------------------- ST04Q01 | -35.36242 -35.36185 -.000564 . parentalbc~d | 16.18388 16.18568 -.0018003 .0010541 stdage | 3.513675 3.512553 .0011221 .0004391 ------------------------------------------------------------------------------ b = consistent under Ho and Ha; obtained from xtreg B = inconsistent under Ha, efficient under Ho; obtained from xtreg Test: Ho: difference in coefficients not systematic chi2(3) = (b-B)'[(V_b-V_B)^(-1)](b-B) = 7.65 Prob>chi2 = 0.0539 (V_b-V_B is not positive definite) Direct comparison of coefficients… Non-significant p-value indicates that models yield similar results… OK
Within & Between Effects / Centering Why do we do Multilevel models? To understand the role of inequality Between and Within countries So “Centering” variables both grand mean and group mean centering Grand mean centering: computing variables as deviations from overall mean Should be systematically done for X variables Group mean centering: computing variables as deviation from group mean Useful for decomposing within vs. between effects relative role of inequality between and within countries Often in conjunction with aggregate group mean vars.
Within & Between Effects You can estimate BOTH within- and between-group effects in a single model Strategy: Split a variable (e.g., household possession score) into two new variables… 1. Group mean household possession score 2. Within-group deviation from mean household possession score Often called “group mean centering” Then, put both variables into a random effects model Model will estimate separate coefficients for between vs. within effects Ex: egen betwparentalbckgrnd=mean(parentalbckgrnd), by(cco) gen withinparentalbckgrnd=parentalbckgrnd-betwparentalbckgrnd xtreg PV1READ ST04Q01 stdage betw withi, i(cco) re * PROGRAM SEGMENT 5 *Assessing within and between effects
Linear Random Intercepts Model . xtreg PV1READ ST04Q01 stdage betw withi, i(cco) re Random-effects GLS regression Number of obs = 413190 Group variable: ccode Number of groups = 67 R-sq: within = 0.1440 Obs per group: min = 259 between = 0.2540 avg = 6167.0 overall = 0.1833 max = 29486 Wald chi2(4) = 69526.50 corr(u_i, X) = 0 (assumed) Prob > chi2 = 0.0000 --------------------------------------------------------------------------------------- PV1READ | Coef. Std. Err. z P>|z| [95% Conf. Interval] ----------------------+---------------------------------------------------------------- ST04Q01 | -35.36199 .255825 -138.23 0.000 -35.86339 -34.86058 stdage | 3.512629 .1302982 26.96 0.000 3.257249 3.768009 betwparentalbckgrnd | 24.26494 4.669297 5.20 0.000 15.11329 33.41659 withinparentalbckgrnd | 16.18389 .0727852 222.35 0.000 16.04123 16.32655 _cons | 533.5929 4.631905 115.20 0.000 524.5145 542.6712 sigma_u | 37.414277 sigma_e | 82.173407 rho | .17170966 (fraction of variance due to u_i) Parental background has huge effect both within and between
Generalizing: Random Coefficients (=Random slopes) Linear random intercept model allows random variation in intercept (mean) for groups But, the same idea can be applied to other coefficients That is, slope coefficients can ALSO be random! Random Coefficient Model Which can be written as: Where zeta-1 is a random intercept component = differences between countries Zeta-2 is a random slope component = country specific inequality effect
Linear Random Coefficient Model Rabe-Hesketh & Skrondal Both intercepts and slopes vary randomly across j groups PV1READ Inequality between countries vary randomly Inequality within country parentalbckgrnd
xtmixed syntax * PROGRAM SEGMENT 6 * a first random slope model xtmixed – allows random intercepts & slopes “Mixed” models refer to models that have both fixed and random components xtmixed [depvar] [fixed equation] || [random eq], options xtmixed PV1READ ST04Q01 stdage || cco: parentalbckgrnd , iter(5) diff mle cov(unstr) Dependant variable fixed effect variables RE Level 2 variable slope variable estimation options cov(unstructured) cov(unstr) relaxes constraints regarding covariance among random effects (See Rabe-Hesketh & Skrondal) Stata default treats random terms (intercept, slope) as totally uncorrelated… not always reasonable
Example: PISA 2012 . xtmixed supportenv age male dmar demp educ incomerel ses || country: , mle Mixed-effects ML regression Number of obs = 413190 Group variable: ccode Number of groups = 67 Obs per group: min = 259 avg = 6167.0 max = 29486 Wald chi2(2) = 19804.39 Log likelihood = -2405736.9 Prob > chi2 = 0.0000 ------------------------------------------------------------------------------ PV1READ | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- ST04Q01 | -35.14138 .2542637 -138.21 0.000 -35.63973 -34.64304 stdage | 3.483322 .1294873 26.90 0.000 3.229531 3.737112 _cons | 494.2034 4.810792 102.73 0.000 484.7744 503.6324 .../...
Ex: PISA 2012 (cont’d) ------------------------------------------------------------------------------ Random-effects Parameters | Estimate Std. Err. [95% Conf. Interval] -----------------------------+------------------------------------------------ ccode: Unstructured | sd(parent~d) | 19.28376 1.670908 16.27183 22.8532 sd(_cons) | 54.52031 12.20434 35.15745 84.5472 corr(parent~d,_cons) | .6958577 .1663099 .2234163 .9035452 sd(Residual) | 81.63907 .0898211 81.46321 81.81531 LR test vs. linear regression: chi2(3) = 1.5e+05 Prob > chi2 = 0.0000 “cons” (constant) are intercepts for countries “parent^d” for the slopes Non-zero SDs indicates that both intercepts and slopes vary If some of the estimates are not significant you can simplify the model
What about the random slopes? Slopes = within country parental background gradient of inequality
What about the random slopes? * PROGRAM SEGMENT 8 * like 6 with BLUP predictors of intercepts and slopes best linear unbiased predictions (BLUPs) slopes intercepts
Multilevel Model Notation Random coeff (random slope) can be expressed in a single equation: Random Coefficient Model However, it is common to separate levels: Level 1 equation Gamma = constant u = random effect Here, we specify a random component for level-1 constant & slope Intercept equation Slope Equation
Cross-Level Interactions Does context (i.e., level-2) influence the effect of level-1 variables? Example: Effect of country inequality (gini) on lower achievements Can you think of others?
Cross-level interactions Idea: specify a level-2 variable that affects a level-1 slope Level 1 equation Intercept equation Slope equation with interaction Cross-level interaction: Level-2 variable Z affects slope (B2) of a level-1 X variable Coefficient g3 reflects size of interaction (effect on B2 per unit change in Z)
Cross-level Interactions Cross-level interaction in single-equation form: Random Coefficient Model with cross-level interaction Stata strategy: manually compute cross-level interaction variables Ex: Poverty*WelfareState, Gender*SingleSexSchool Then, put interaction variable in the “fixed” model Interpretation: B3 coefficient indicates the impact of each unit change in Z on slope B2 If B3 is positive, increase in Z results in larger B2 slope.
Beyond 2-level models Sometimes data has 3 levels or more Ex: School, classroom, individual Ex: Family, individual, time (repeated measures) Can be dealt with in xtmixed xtmixed syntax: specify “fixed” equation and then random effects starting with “top” level xtmixed var1 var2 var3 || schoolid: var2 || classid:var3 Again, specify unstructured covariance: cov(unstr)
Advice about building models Raudenbush & Bryk 2002 Start building the level 1 model first Then build level 2 model Keeping a close eye on level 2 N.