Download presentation
Presentation is loading. Please wait.
1
Complex Surveys
2
A Brief Review of the “Building Blocks”
3
Building Blocks of Surveys
Sampling methods SRS Stratified Clustering Unequal probability sampling Multiple-stage Estimation methods Weighted mean/sum Ratio/regression estimators
4
Simple Random Sampling (without replacement)
There are (N choose n) possible samples Each with probability 1/(N choose n) Point estimate Standard error
5
Stratified Sampling The estimate of the population total It’s variance
Sample size allocation - proportional, optimal In general, stratified sampling with proportional allocation is more efficient than SRS The more unequal the stratum means, the more benefits
6
Cluster Sampling Usually less efficient than other methods
The relative efficiency of it and SRS depends on intra-class correlation coefficient The larger the correlation coefficient, the less efficient Can reduce cost and lead to administrative convenience One-stage, two-stage, with equal or unequal probs, point estimate, variance, c.i. Allocation of m and n for two-stage cluster sampling
7
Cluster Sampling without Replacement
Select a sample of n clusters with replacement based on Estimate cluster total and variance Estimate population total Variance can be estimated by formulas in ch5,6 or resampling methods
8
Cluster Sampling with Replacement
Select a sample of n clusters with replacement based on Estimate cluster total Calculate Estimate population total and variance
9
Ratio estimation Biased May results in smaller MSE
Useful when variables are linearly correlated Regression estimation
10
Complex Surveys Large surveys often involve several sampling strategies
11
Example 1: Malaria in Africa
12
An example: background
Malaria is a common public health problem in tropical and subtropical regions It is infectious. People get it by being bitten by a kind of female mosquitos Without timely and proper treatment, the death rate can be very high Can be prevented by using mosquito nets The prevention is only affective if the nets are in widespread use
13
Summary Goal: To estimate the prevalence of bed net use in rural areas
Sampling frame: all rural villages of <3,000 people in The Gambia
14
The survey in Gambia (1991) 3000 rural villages Stage Sampling unit
Sampling method eastern central western Stratified by region Prob district size 5 districts per region 1 district PHC Non-PHC Stratified by PHC Prob village size 4 villages per district 2 village SRS 6 compounds / village 3 compound Top-down
15
The survey in Gambia (1991) 3000 rural villages eastern central
western Stratified sampling Sampling with unequal probs, two-stage cluster, Ch 6 PHC Non-PHC district Stratified sampling Sampling with unequal probs, two-stage cluster, Ch 6 village compound SRS (average number of nets per compound) Top-down Bottom-up
16
The survey in Gambia (1991) The way to calculate the estimated total and its variance seems to be complicated It can be worse if we include ratio estimators In practice, we can Use sampling weights to obtain point estimates Use computer intensive methods to obtain standard error (ch9) Such as jackknife, bootstrap
17
Sampling weights The sampling weight is the reciprocal of Pr(being selected) Each sampled unit “represents” certain number of units in the population The whole sample “represents” the whole population
18
Sampling weights Weights are used to deal with the effects of stratification and clustering on point estimate Stratified sampling
19
Sampling weights Cluster sampling with equal probabilities
20
Sampling weights For three-stage sampling
Very large weights are often truncated Biases results Reduces the mean squared error p: primary s: secondary t: tertiary
21
Sampling weights Weights contain the information needed to construct point estimates Weights do not contain enough information for computing variance Weights can be used to find point estimates because calculating variance requires prob(pairs of units are selected) Computer-intensive methods can be used to find variances
22
Sampling weights: the malaria example
Pr(a compound in central region PHC villages is selected)=
23
Self-weighting and Non-self-weighting
Self-weighting: sampling weights for all observation units are equal A self-weighing sampling is representative of the population if nonsampling errors are ignored Most large self-weighting samples are not SRS Standard software with the usual assumption of iid leads to correct estimate of mean, proportion, percentiles; but erroneous estimation for variance
24
Ratio Estimation in Complex Surveys
Ratio estimation is part of the analysis, not the design Can be used at any level. Usually used near the top
25
Ratio Estimation in the Malaria example
Region level: Above the region level
26
Ratio Estimation in Complex Surveys
The bias of ratio estimation can be large when sample sizes are small Separate ratio estimator for a population total Improves efficiency when ratios vary from stratum to stratum; works poorly for small strata sample sizes Combined ratio estimator for a population total Has less bias when strata sizes are small; works poorly when ratios vary from stratum to stratum
27
Example 2: The National Health and Nutrition Examination Survey
28
Combines interviews and physical examinations
Designed to assess the health and nutritional conditions of US residents Combines interviews and physical examinations A major program of the National Center for Health Statistics (NCHS), which is part of the Centers for Disease Control and Prevention (CDC)
29
Started from the early 1960s
It samples a nationally representative sample of about 5,000 persons each year Interview demographic, socioeconomic, dietary, and health-related questions Examinations medical, dental, and physiological measurements, laboratory tests
31
Stage 1: Primary sampling units (PSUs):
Most PSUs are single counties Some are groups of contiguous counties Stage 2: Secondary sampling units: city blocks or something similar Stage 3: Tertiary sampling unit: households Stage 4: Individuals
32
Select counties (or groups of contiguous counties) Clustering sampling
Stage 1 Select counties (or groups of contiguous counties) Clustering sampling Sampling probabilities: probability proportional to a measure of size (PPS)
33
Stage 2 Select city blocks Clustering sampling
Sampling probabilities: probability proportional to a measure of size (PPS)
34
Stratified sampling: sampling from each selected city block
Stage 3 Select households Stratified sampling: sampling from each selected city block Sampling probabilities: oversample targeting groups: age, ethnic, or income
35
Stratified sampling: sampling from each selected household
Stage 4 Select individuals Stratified sampling: sampling from each selected household Sampling probabilities: “individuals are drawn at random within designated age-sex-race/ethnicity screening subdomains” 1.6 persons per household
36
Examples of Oversampling
African Americans, Mexican Americans Low income White Americans (from 2000) Adolescents (12-19), seniors (60+) very low income women of childbearing age
37
Several R packages provide NHANES data
RNHANES NHANES data(NHANES): 10,000 rows, 76 variables data(NHANESraw): 20,293 rows, 78 variables Survey data(nhanes): 8,591 rows, 7 variables
38
Design variables SDMVPSU: Masked Variance Unit Pseudo-PSU
not the “true” design PSU’s a collection of secondary sampling units aggregated into groups (called Masked Variance Units) for the purpose of variance estimation produce variance estimates that closely approximate the variances that would have been estimated using the “true” design structure. svydesign(id=~SDMVPSU, strata=~SDMVSTRA, weights=~WTMEC2YR, nest=TRUE,data=NHANESraw)
39
Sampling weights NHANESraw: WTMEC2YR
40
WTMEC2YR vs WTINT2YR > sum(NHANESraw$WTINT2YR) [1] 608534400
> sum(NHANESraw$WTMEC2YR) > sum(NHANESraw$WTMEC2YR==0) [1] 702 702 subjects had interview data but not MEC data The totals are the same The total weight is about twice of the US population, as the weights are for two years
41
Estimating a Distribution Function
Historically, sampling theory was developed to find population means, totals, and ratios. Other quantities, such ass, Pr(Statistics > means or totals) Median? 95th percentile? Probability mass function? Sampling weights can be used in constructing an empirical distribution of the population
42
Population quantities and functions
Probability mass function (pmf) Distribution function
43
Empirical Functions Empirical probability mass function
Empirical distribution function Empirical functions can be used to estimate population quantities such as mean, median, percentiles, variance, ect.
44
Plotting data from a complex survey
SRS Histograms/smoothed density estimates Scatterplots and scatterplot matrices In a complex sampling design, simple plots can be missleading
45
Incorporating weights
46
NHANES
47
Incorporating weights
48
NHANES
49
NHANES
50
NHANES
51
NHANES
52
NHANES
53
Design effects Cornfield’s ratio (1951)
Measure the efficiency of a sampling plan by the ratio of the variance that would be obtained from an SRS of k observation units to the variance obtained from the complex sampling plan with k observation units The design effect (deff, Kish 1965) The reciprocal of Cornfield’s ratio
54
The design effects The design effect provides a measure of the precision gained/lost by use of the more complex design instead of SRS For estimating a mean
55
The design effects Stratified Cluster
56
Design Effects and Confidence Intervals
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.