Introduction to secondary analysis of complex survey data Brandon Nakawaki, PhD November 8, 2017
Introduction Me What is secondary analysis What is “complex” survey data
Outline Secondary analysis Sampling concepts Use with statistical programs Goal: Basic understanding of typical secondary datasets, why weighting must be used with complex survey data, and how to use those weights. Why this talk?
Secondary analysis Advantages Cost efficient Often representative Provides a potentially useful comparison Sometimes it’s the only data source Lots of variables Large unweighted sample size Disadvantages Measures not ideal Statistical background required
Pooling datasets Cross-sectional designs Further boost stability of sample Different or adjusted weights Measures and major design elements must not change
Where to get data Interuniversity Consortium for Political and Social Research (ICPSR) www.icpsr.umich.edu Simple Online Data Archive for Population Studies (SODAPOP) http://sodapop.pop.psu.edu National Data Archive on Child Abuse and Neglect (NDACAN) Institutional websites (government, college, contractor, research) Project websites Other locations
Image adapted from Koziol & Arthur (2011) Sampling Simple random sampling (SRS) Equal probability of selection Observations are independent and identical in distribution Basis for most statistics Near SRS Image adapted from Koziol & Arthur (2011)
Image adapted from Koziol & Arthur (2011) Sampling Probability samples Stratification Divide into groups, randomly sample within those groups Ensures a “good” sample, increases precision Image adapted from Koziol & Arthur (2011)
Image adapted from Koziol & Arthur (2011) Sampling Probability samples Clustering Randomly sample entire groups Convenient, decreases precision Image adapted from Koziol & Arthur (2011)
Sampling Probability samples Multiple stages of selection With or without replacement Poststratification May be based on available accurate population totals (e.g., age, sex, geographic region) Strong correlates of key survey variables Predictors of noncoverage Oversampling May have fewer or more weighting options (strata, psu, weights, replicates, etc.)
What to look for in documentation Generalizability Missing data Weights and design variables “Weight” “Stratum” “Strata” “Cluster” “Primary sampling unit (PSU)” Variance estimation method “Linearization” “Taylorization” “Replicate” “Faye”
Common weights Sampling weights Make the marginals look like the population from which they were drawn Often include poststratification adjustment (e.g., person-level nonresponse)
NSDUH 2012 Age Unweighted 12 2,798 13 2,757 14 2,792 15 2,956 16 3,058 17 3,038 Total 17,399 Weighted 4,054,868 4,049,003 4,156,730 4,097,288 4,293,852 4,281,311 24,933,052 2010 Census 25,296,465
Common weights Design weights/variables PSU Strata Replicate weights
Variance estimation in complex surveys Again, not SRS – must account for the sampling design with weights Standard errors usually change with weights Point estimates may also change
Weighted example (n=17,062)
Unweighted example (n=17,062)
Especially notable differences
Variance estimation in complex surveys Again, not SRS – must account for the sampling design with weights Three common methods Taylor series linearization Replicate weights Model-based estimation
Variance estimation in complex surveys Taylor series linearization Uses at least one clustering variable (PSU) and at least one stratification variable Replicate weights Uses many replicate weights, usually numbered sequentially (e.g., weight01-weight100) Model-based estimation Uses clustering and stratification variables in multilevel modeling
Variance estimation with linearization Taylor series linearization Typically has stratum variable, cluster (PSU) variable, one sampling weight to use at a time Sometimes multiple sampling weights are available for use under different circumstances If stratum, cluster variables are available, assume Taylor linearization (or check with curator) Do not assume if only sampling weight available Subpopulation indicator needed
What is a subpopulation indicator? Subpopulation – interested in a specific subgroup of your sample e.g., adolescents 12-17 years of age in a general population study Indicator – binary variable coded so that 0 = do not include in analysis 1 = include in analysis If you are looking at cigarette smokers aged 12-17, code so that 1 = everyone aged 12-17 who smokes cigarettes 0 = everyone 18+ 0 = 12-17 year olds who do not smoke cigarettes
Variance estimation in complex surveys Taylor series linearization Subpopulation indicator necessary Replicate weights Subpopulation indicator not used Model-based estimation
Software for complex survey analysis Stata Mplus R (packages ‘survey,’ ‘lavaan.survey’) SAS SUDAAN LISREL EQS WesVar SPSS with Complex Samples module (Taylor linearization only)
Software for NOT for complex survey analysis AMOS HLM
SPSS Taylor Linearization Example
SPSS Taylor Linearization Example
SPSS Taylor Linearization Example
SPSS Taylor Linearization Example
SPSS Taylor Linearization Example
Finite population correction
Finite population correction
SPSS Taylor Linearization Example Skip unless n/N = .05 or more
SPSS Taylor Linearization Example
SPSS Taylor Linearization Example /*Setting up sampling plan*/ * Analysis Preparation Wizard. CSPLAN ANALYSIS /PLAN FILE=‘location\example plan.csaplan' /PLANVARS ANALYSISWEIGHT=weight /SRSESTIMATOR TYPE=WR /PRINT PLAN /DESIGN STRATA=stratavar CLUSTER=clustervar /ESTIMATOR TYPE=WR.
SPSS Taylor Linearization Example
SPSS Taylor Linearization Example
SPSS Taylor Linearization Example
SPSS Taylor Linearization Example
Don’t use Select Cases
Weighted SPSS Example (n=17,062)
Unweighted SPSS Example (n=17,062)
Especially notable differences
Stata Taylor linearization example
Stata Taylor linearization example
Stata Taylor linearization example
Stata Taylor linearization example
Stata Taylor linearization example
Stata Taylor linearization example
Stata Taylor linearization example
Stata Taylor linearization example
Stata Taylor linearization example
Mplus Taylor linearization example
Mplus Taylor linearization example
Variance estimation with replicates Jackknife repeated replicates (jkn; jrr; jrrw) Three types of jackknifed replication Certain types may require the application of a multiplier file Pay attention to documentation! Balanced repeated replicates (brr; brr-Fay; Fay) If a Fay’s adjustment is needed, documentation should say Fay’s input depends partly on program used Typically dozens of replicate weights Subpopulation indicator can be used, but not necessary
Stata replicates example
Stata replicates example
Stata replicates example If indicated by the documentation
Some syntax adapted from Koziol & Arthur (2011) Basic Stata syntax /*Taylor series linearization*/ svyset [pweight=wtvar], psu(clustervar) strata(stratavar) vce(linearized) /*Jackknife replicates*/ svyset [pweight=wtvar], jkrw(repwt1-repwtn) vce(jack) mse Note: Jackknife syntax varies by type (jk1, jk2, jkn). Additional syntax may be needed if more than 1 stratum per PSU. Stata cannot accommodate different numbers of strata in different PSUs. /*Balanced repeated replicates*/ svyset [pweight=wtvar], brrweight(repwt1-repwtn) vce(brr) mse Note: Additional syntax may be needed if documentation specifies that a Fay’s adjustment needs to be applied. Some syntax adapted from Koziol & Arthur (2011)
Some syntax adapted from Koziol & Arthur (2011) Basic Mplus syntax /*Taylor series linearization*/ DATA: FILE=“filepath\filename.csv”; ANALYSIS: NAMES ARE all variable names here in order of appearance in dataset; USEVARIABLES ARE stratavar clustervar weightvar outcome predictors and covariates; MISSING ARE ALL (missingdatacode); SUBPOPULATION IS (indicat eq 1); only if subpopulation analysis WEIGHT = weightvar; STRATIFICATION = stratavar; CLUSTER = clustervar; ANALYSIS: TYPE=COMPLEX; can be combined with other analysis types OUTCOME: outcome ON predictor; Some syntax adapted from Koziol & Arthur (2011)
Some syntax adapted from Koziol & Arthur (2011) Basic Mplus syntax /*Replicate weights*/ DATA: FILE=“filepath\filename.csv”; ANALYSIS: NAMES ARE all variable names here in order of appearance in dataset; USEVARIABLES ARE weightvar repweight1-repweightn outcome predictors and covariates; MISSING ARE ALL (missingdatacode); WEIGHT=weightvar; REPWEIGHTS=repwt1-repwtn; ANALYSIS: TYPE=COMPLEX; can be combined with other analysis types REPSE=JACKKNIFE1; substitute with other replicate type as needed OUTCOME: outcome ON predictor; Some syntax adapted from Koziol & Arthur (2011)
SAS
SUDAAN
WesVar
WesVar
WesVar
R
Additional introductory resources Heeringa, S. G., West, B. T., & Berglund, P. A. (2017). Applied Survey Data Analysis (2nd Ed.). Boca Raton, FL: Chapman & Hall/CRC. Lumley, T. (2010). Complex Surveys: A Guide to Analysis Using R. Hoboken, NJ: John Wiley & Sons, Inc. Less novice friendly Lohr, S. L. (2009). Sampling: Design and Analysis (2nd ed.). Boston, MA: Brooks/Cole, Cengage Learning.