Download presentation
Presentation is loading. Please wait.
Published byLambert Moody Modified over 6 years ago
1
Introduction to secondary analysis of complex survey data
Brandon Nakawaki, PhD November 8, 2017
2
Introduction Me What is secondary analysis
What is “complex” survey data
3
Outline Secondary analysis Sampling concepts
Use with statistical programs Goal: Basic understanding of typical secondary datasets, why weighting must be used with complex survey data, and how to use those weights. Why this talk?
4
Secondary analysis Advantages Cost efficient Often representative
Provides a potentially useful comparison Sometimes it’s the only data source Lots of variables Large unweighted sample size Disadvantages Measures not ideal Statistical background required
5
Pooling datasets Cross-sectional designs
Further boost stability of sample Different or adjusted weights Measures and major design elements must not change
6
Where to get data Interuniversity Consortium for Political and Social Research (ICPSR) Simple Online Data Archive for Population Studies (SODAPOP) National Data Archive on Child Abuse and Neglect (NDACAN) Institutional websites (government, college, contractor, research) Project websites Other locations
7
Image adapted from Koziol & Arthur (2011)
Sampling Simple random sampling (SRS) Equal probability of selection Observations are independent and identical in distribution Basis for most statistics Near SRS Image adapted from Koziol & Arthur (2011)
8
Image adapted from Koziol & Arthur (2011)
Sampling Probability samples Stratification Divide into groups, randomly sample within those groups Ensures a “good” sample, increases precision Image adapted from Koziol & Arthur (2011)
9
Image adapted from Koziol & Arthur (2011)
Sampling Probability samples Clustering Randomly sample entire groups Convenient, decreases precision Image adapted from Koziol & Arthur (2011)
10
Sampling Probability samples
Multiple stages of selection With or without replacement Poststratification May be based on available accurate population totals (e.g., age, sex, geographic region) Strong correlates of key survey variables Predictors of noncoverage Oversampling May have fewer or more weighting options (strata, psu, weights, replicates, etc.)
11
What to look for in documentation
Generalizability Missing data Weights and design variables “Weight” “Stratum” “Strata” “Cluster” “Primary sampling unit (PSU)” Variance estimation method “Linearization” “Taylorization” “Replicate” “Faye”
12
Common weights Sampling weights
Make the marginals look like the population from which they were drawn Often include poststratification adjustment (e.g., person-level nonresponse)
13
NSDUH 2012 Age Unweighted 12 2, , , , , ,038 Total 17,399 Weighted 4,054,868 4,049,003 4,156,730 4,097,288 4,293,852 4,281,311 24,933,052 2010 Census 25,296,465
14
Common weights Design weights/variables PSU Strata Replicate weights
15
Variance estimation in complex surveys
Again, not SRS – must account for the sampling design with weights Standard errors usually change with weights Point estimates may also change
16
Weighted example (n=17,062)
17
Unweighted example (n=17,062)
18
Especially notable differences
19
Variance estimation in complex surveys
Again, not SRS – must account for the sampling design with weights Three common methods Taylor series linearization Replicate weights Model-based estimation
20
Variance estimation in complex surveys
Taylor series linearization Uses at least one clustering variable (PSU) and at least one stratification variable Replicate weights Uses many replicate weights, usually numbered sequentially (e.g., weight01-weight100) Model-based estimation Uses clustering and stratification variables in multilevel modeling
21
Variance estimation with linearization
Taylor series linearization Typically has stratum variable, cluster (PSU) variable, one sampling weight to use at a time Sometimes multiple sampling weights are available for use under different circumstances If stratum, cluster variables are available, assume Taylor linearization (or check with curator) Do not assume if only sampling weight available Subpopulation indicator needed
22
What is a subpopulation indicator?
Subpopulation – interested in a specific subgroup of your sample e.g., adolescents years of age in a general population study Indicator – binary variable coded so that 0 = do not include in analysis 1 = include in analysis If you are looking at cigarette smokers aged 12-17, code so that 1 = everyone aged who smokes cigarettes 0 = everyone 18+ 0 = year olds who do not smoke cigarettes
23
Variance estimation in complex surveys
Taylor series linearization Subpopulation indicator necessary Replicate weights Subpopulation indicator not used Model-based estimation
24
Software for complex survey analysis
Stata Mplus R (packages ‘survey,’ ‘lavaan.survey’) SAS SUDAAN LISREL EQS WesVar SPSS with Complex Samples module (Taylor linearization only)
25
Software for NOT for complex survey analysis
AMOS HLM
26
SPSS Taylor Linearization Example
27
SPSS Taylor Linearization Example
28
SPSS Taylor Linearization Example
29
SPSS Taylor Linearization Example
30
SPSS Taylor Linearization Example
31
Finite population correction
32
Finite population correction
33
SPSS Taylor Linearization Example
Skip unless n/N = .05 or more
34
SPSS Taylor Linearization Example
35
SPSS Taylor Linearization Example
/*Setting up sampling plan*/ * Analysis Preparation Wizard. CSPLAN ANALYSIS /PLAN FILE=‘location\example plan.csaplan' /PLANVARS ANALYSISWEIGHT=weight /SRSESTIMATOR TYPE=WR /PRINT PLAN /DESIGN STRATA=stratavar CLUSTER=clustervar /ESTIMATOR TYPE=WR.
36
SPSS Taylor Linearization Example
37
SPSS Taylor Linearization Example
38
SPSS Taylor Linearization Example
39
SPSS Taylor Linearization Example
40
Don’t use Select Cases
41
Weighted SPSS Example (n=17,062)
42
Unweighted SPSS Example (n=17,062)
43
Especially notable differences
44
Stata Taylor linearization example
45
Stata Taylor linearization example
46
Stata Taylor linearization example
47
Stata Taylor linearization example
48
Stata Taylor linearization example
49
Stata Taylor linearization example
50
Stata Taylor linearization example
51
Stata Taylor linearization example
52
Stata Taylor linearization example
53
Mplus Taylor linearization example
54
Mplus Taylor linearization example
55
Variance estimation with replicates
Jackknife repeated replicates (jkn; jrr; jrrw) Three types of jackknifed replication Certain types may require the application of a multiplier file Pay attention to documentation! Balanced repeated replicates (brr; brr-Fay; Fay) If a Fay’s adjustment is needed, documentation should say Fay’s input depends partly on program used Typically dozens of replicate weights Subpopulation indicator can be used, but not necessary
56
Stata replicates example
57
Stata replicates example
58
Stata replicates example
If indicated by the documentation
59
Some syntax adapted from Koziol & Arthur (2011)
Basic Stata syntax /*Taylor series linearization*/ svyset [pweight=wtvar], psu(clustervar) strata(stratavar) vce(linearized) /*Jackknife replicates*/ svyset [pweight=wtvar], jkrw(repwt1-repwtn) vce(jack) mse Note: Jackknife syntax varies by type (jk1, jk2, jkn). Additional syntax may be needed if more than 1 stratum per PSU. Stata cannot accommodate different numbers of strata in different PSUs. /*Balanced repeated replicates*/ svyset [pweight=wtvar], brrweight(repwt1-repwtn) vce(brr) mse Note: Additional syntax may be needed if documentation specifies that a Fay’s adjustment needs to be applied. Some syntax adapted from Koziol & Arthur (2011)
60
Some syntax adapted from Koziol & Arthur (2011)
Basic Mplus syntax /*Taylor series linearization*/ DATA: FILE=“filepath\filename.csv”; ANALYSIS: NAMES ARE all variable names here in order of appearance in dataset; USEVARIABLES ARE stratavar clustervar weightvar outcome predictors and covariates; MISSING ARE ALL (missingdatacode); SUBPOPULATION IS (indicat eq 1); only if subpopulation analysis WEIGHT = weightvar; STRATIFICATION = stratavar; CLUSTER = clustervar; ANALYSIS: TYPE=COMPLEX; can be combined with other analysis types OUTCOME: outcome ON predictor; Some syntax adapted from Koziol & Arthur (2011)
61
Some syntax adapted from Koziol & Arthur (2011)
Basic Mplus syntax /*Replicate weights*/ DATA: FILE=“filepath\filename.csv”; ANALYSIS: NAMES ARE all variable names here in order of appearance in dataset; USEVARIABLES ARE weightvar repweight1-repweightn outcome predictors and covariates; MISSING ARE ALL (missingdatacode); WEIGHT=weightvar; REPWEIGHTS=repwt1-repwtn; ANALYSIS: TYPE=COMPLEX; can be combined with other analysis types REPSE=JACKKNIFE1; substitute with other replicate type as needed OUTCOME: outcome ON predictor; Some syntax adapted from Koziol & Arthur (2011)
62
SAS
63
SUDAAN
64
WesVar
65
WesVar
66
WesVar
67
R
68
Additional introductory resources
Heeringa, S. G., West, B. T., & Berglund, P. A. (2017). Applied Survey Data Analysis (2nd Ed.). Boca Raton, FL: Chapman & Hall/CRC. Lumley, T. (2010). Complex Surveys: A Guide to Analysis Using R. Hoboken, NJ: John Wiley & Sons, Inc. Less novice friendly Lohr, S. L. (2009). Sampling: Design and Analysis (2nd ed.). Boston, MA: Brooks/Cole, Cengage Learning.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.