Download presentation
Presentation is loading. Please wait.
1
Cluster Sampling STAT262
2
A new sampling method Motivating example
Want to study the average amount of water used per person per month How would you design a survey?
3
A new sampling method Consider the two strategies
Sample person by person Sample household by household Which one do you prefer and why?
4
A new sampling method In the water usage example, I would sample households, in other words, I would use household as the sampling unit I do this for convenience. I am interested in the average water usage per person per month, but I sample households
5
A new sampling method The example of water usage is an example of cluster sampling Households are the primary sampling units (PSUs) or clusters Persons are the secondary sampling units (SSUs). They are the elements in the population
6
Definition of Cluster Sampling
Take an SRS of clusters An element of the population is allowed in the sample only if it belongs to a sampled cluster
7
Stratified sampling vs Cluster sampling
The two sampling methods look similar Like a stratum, a cluster is also a group of elements of the population But the sampling schemes are different Stratified: take an SRS of elements within in each stratum. If there are H strata, we have H SRSs. Cluster: take an SRS of the clusters. For each selected cluster, we select all its elements See the following two slides
8
Stratified sampling
9
Cluster sampling
10
Stratified sampling vs Cluster sampling
Variance of the estimate of depends on the variability of values within strata Stratified sampling usually improves the precision of SRS
11
Stratified sampling vs Cluster sampling
Stratified sampling: for greater precision, individual elements within each stratum should have similar values stratum means should be different from each other as much as possible
12
Stratified sampling vs Cluster sampling
The cluster is the sampling unit The more clusters we sample, the smaller the variance The variance of the estimate of depends primarily on the variability between cluster means
13
Stratified sampling vs Cluster sampling
For greater precision, individual elements with each cluster should be heterogeneous and cluster means should be similar to one another Cluster sampling usually ??? the precision of SRS
14
California Schools There are 757 school districts
A school district is a form of special-purpose district which serves to operate local public primary and secondary schools, for formal academic or scholastic teaching, in various nations The district sizes vary
15
California Schools Some large school districts Irvine Unified: 29
Los Angeles Unified: 552 San Diego Unified: 142 San Francisco Unified: 100 Irvine Unified: 29 district size 1 2 3 4 5 5-10 >10 % 24.7% 11.3% 11.9% 9.1% 6.7% 15.9% 20.0%
16
Want to estimate the average API99
17
Why does cluster sampling tend to reduce precision?
Elements of the same cluster tend to be more similar to each other than elements selected at random. E.g, Elements of the same household tend to have similar political views Fish in the same lake tend to have similar concentrations of mercury Residents of the same nursing home tend to have similar opinions of the quality of care
18
Why does cluster sampling tend to reduce precision?
The similarities arise because of some underlying factors that may or may not be measurable Residents of the same nursing home may have similar opinions because the care is poor The concentration of mercury in the fish will reflect the concentration of mercury in the lake
19
Why does cluster sampling tend to reduce precision?
Because of the similarities of elements within clusters, we do not obtain as much information By sampling everyone in the cluster, we partially repeat the same information instead of obtaining new information As a result, cluster sampling leads to less precision
20
Motivation of using cluster sampling
A sampling frame list of observation units may be difficult, expensive, or unavailable Cannot list all honeybees in a region The population may be widely distributed geographically or may occur in nature clusters Nursing home residents cluster in nursing homes Cluster sampling leads to convenience and reduced cost Cluster sampling may result in more information per dollar spent
21
Versions of cluster sampling: one-stage vs two-stage cluster sampling
We will consider one-stage and two-stage sampling One-stage sampling: every element within a sampled cluster is included in the sample Two-stage sampling: we subsample only some of the elements of selected clusters
22
One-stage cluster sampling
(1) (2) (3)
23
Two-stage cluster sampling
(1) (2) (3)
24
Notation for cluster sampling
25
Notation for cluster sampling
26
Notation for cluster sampling
27
Notation for cluster sampling
28
One-stage cluster sampling
(1) (2) (3)
29
One-stage cluster sampling
Every element within a cluster (PSU) is included in the sample Either “all” or “none” of the elements that compose a cluster (PSU) are in the sample
30
Clusters of equal sizes
Most naturally occurring clusters do not fit into this framework Can occur in agricultural and industrial sampling Estimating population means or totals is simple We treat the cluster means or totals as the observations and simply ignore the individual elements We have an SRS of n observations , where ti is the total for all the elements in PSU i.
31
Clusters of equal sizes
32
Clusters of equal sizes
Nothing is new here
33
Clusters of equal sizes: an example
34
Clusters of equal sizes: an example
35
Clusters of equal sizes: sampling weights
36
Theory of Cluster sampling with equal sizes
37
Theory of Cluster sampling with equal sizes
In one-stage cluster sampling, the variability of the unbiased estimator of t depends entirely on the between-cluster part of the variability For cluster sampling
38
Theory of Cluster sampling with equal sizes
When MSB/MSW is large MSB is relatively large: elements in different clusters vary more than elements in the same cluster cluster sampling is less precise than SRS If MSB>S^2, cluster sampling is less precise
39
Measurements of correlation
ICC (or ρ): Intraclass (or intracluster) Correlation Coefficient Describes how similar elements in the same cluster are Provides a measure of homogeneity within the clusters Definition: It can be shown that
40
Measurements of correlation
If SSB=0, then
41
One-stage cluster sampling with equal sizes vs SRS
If N is large 1+(M-1)ICC SSU’s, taken in a one-stage cluster sample, give The same amount of information as one SSU from an SRS e.g, ICC=1/2, M=5, then 1+(M-1)ICC=3 → 300 SSUs in the cluster sample = 100 SSUs in an SRS If ICC<0, cluster sampling is more efficient than SRS ICC is rarely negative in naturally occurring clusters
42
The GPA example The population ANOVA table (estimated)
43
The GPA example The population ANOVA table (estimated)
The sample mean square total should not be used to estimate when n is small The data were collected as a cluster sample. They do not reflect enough of the cluster-to-cluster variability. Multiply the unbiased estimates of MSB and MSW by the df from the population ANOVA table to estimate the population sums of squares
44
The GPA example The population ANOVA table (estimated)
45
The GPA example
46
Clusters of unequal sizes
The adjusted R2 measures the relative amount of variability in the population explained by the cluster means, adjusted for the number of degrees of freedom If the clusters are homogeneous, then the cluster means are highly variable relative to the variation within cluster, and R2 will be high.
47
An example
48
An example
49
The GPA example
50
The GPA example The population ANOVA table (estimated)
51
Clusters of unequal sizes
In social surveys, clusters are usually of equal sizes In a one-stage sample, we will introduce two methods to estimate the population total/mean Unbiased estimation Ratio estimation
52
Unbiased estimation for cluster sampling with unequal sizes
53
California Schools: api00
The “survey” library consists 6194 schools in 757 school districts A one-stage cluster sample (apiclus1): 183 schools in 15 school districts were sampled The original weights are wrong: The district sizes vary a lot:
54
California Schools: api00
It makes sense to estimate the average api00 across schools Use the unbiased estimator
55
California Schools: api00
If we use the “svytotal” function in the survey package: >svytotal(~api00, dclus1) #svytotal is a function in the survey package You will see the same results
56
California Schools: api00
How about population mean for api00 If we use the unbiased estimator, we have How does it compare with the true population mean of api00? They are quite different. Bad luck or something else?
57
Unbiased estimation for cluster sampling with unequal sizes
In the previous calculation, nothing is different from cluster sampling with equal sizes The problem is that the between cluster variance is large when the sizes of clusters are quite different from each other, as we expect large total from clusters of large sizes Therefore, we consider another estimator
58
Ratio estimation for cluster sampling with unequal sizes
59
Ratio estimation for cluster sampling with unequal sizes
where
60
Ratio estimation for cluster sampling with unequal sizes
Note, it is not difficult to find that The variance of the ratio estimator depends on the variability of the means per element in the clusters It can be much smaller than that of the unbiased estimator The ratio estimator requires the total number of elements in the population, K. The unbiased estimator for the population mean does not require K.
61
California Schools: api00
Use the ratio estimator The result is much closer to the true value! We can also use “svymean”
62
Switch to “Ratio and Regression Estimators” We will come back to cluster sampling
63
Two-stage cluster sampling
In one-stage cluster sampling, we Examine all the SSU’s within the selected PSU’s Obtain redundant information because SSU’s in a PSU tend to be similar Expensive An alternative: taking a subsample within each selected PSU – two stage cluster sampling
64
Two-stage cluster sampling with equal probability
65
Two-stage cluster sampling with equal probability
Compared with the one-stage cluster sampling, the two-stage uses one extra stage. The extra stage complicates the notation and estimators, as one needs to consider variability arising from both stages of data collection The points estimates are similar to those in one-stage, but variances are much more complicated
66
Two-stage cluster sampling with equal probability: an unbiased estimator
Since we do not observe every SSU in the sampled PSU’s, we need to estimate the totals for the sampled PSU’s An unbiased estimator of the population total is
67
Two-stage cluster sampling with equal probability: an unbiased estimator
The estimator is unbiased first stage second stage
68
Two-stage cluster sampling with equal probability: an unbiased estimator
Because are random variables, the variance of has two components The variability between PSU’s The variability within PSU’s Recall that Var[Y]=Var[E[Y|X]] + E[Var[Y|X]] Here
69
Two-stage cluster sampling with equal probability: an unbiased estimator
sample mean of the cluster totals if we use one-stage captures the between-cluster differences captures within-cluster differences
70
Two-stage cluster sampling with equal probability
71
Two-stage cluster sampling with equal probability: an unbiased estimator
It can be shown that an unbiased estimator of the variance is For the population mean
72
Two-stage cluster sampling with equal probability: a ratio estimator
As in one-stage cluster sampling with unequal sizes, the between-PSU variance can be very large since it is affected both by variations in the cluster sizes and by variation in y.
73
Ratio Estimator The variance due to within- cluster is the same as that in the unbiased estimation where
74
The egg volume example A study (Arnold 1991) on egg volume of American coot eggs in Minnesota. We looked at volumes of a subsample of eggs in clutches (nests of eggs) with at least two eggs. For each sampled clutch, two eggs were measured
75
The egg volume example
76
The egg volume example
77
The egg volume example
78
The egg volume example N is unknown but presumably to be large.
79
California Schools: api00
A two-stage cluster sample 40 districts were sampled in the first stage 126 schools were sampled
80
California Schools: api00
N=apiclus2$fpc1[1] n=length(unique(apiclus2$dnum)) Mi=sapply(split(apiclus2$fpc2, apiclus2$dnum), unique) mi=sapply(split(apiclus2$api00, apiclus2$dnum), length) y.r=sum(apiclus2$api00*apiclus2$pw)/sum(apiclus2$pw) yi=sapply(split(apiclus2$api00, apiclus2$dnum), mean) si=sapply(split(apiclus2$api00, apiclus2$dnum), var) si[is.na(si)]=0 M.bar=mean(Mi)
81
California Schools: api00
y.r.var=( sum((Mi*yi-Mi*y.r)^2) /(n-1)/n*(1-n/N) + sum(Mi^2*(1-mi/Mi)*si/mi )/n/N ) / M.bar^2 y.r sqrt(y.r.var)
82
Using weights in cluster samples
For estimating overall means and totals in cluster samples, most survey statisticians use sampling weights. Weights can be used to find a point estimate of almost any quantity of interest For cluster sampling:
83
Using weights in cluster samples
84
SRS : one-stage cluster: two-stage cluster
For simplicity, we only consider One estimator from each of the three sampling methods
85
SRS : one-stage cluster: two-stage cluster
Assume (nm) SSUs are sampled
86
SRS : one-stage cluster: two-stage cluster
Recall that Therefore,
87
SRS : one-stage cluster: two-stage cluster
We have defined ICC (ρ)
88
SRS : one-stage cluster: two-stage cluster
89
SRS : one-stage cluster: two-stage cluster
If we use nm SSU’s in a one-stage cluster sampling, #PSU’s=n’=nm/M
90
SRS : one-stage cluster: two-stage cluster
If we use nm SSU’s in an SRS
91
SRS : one-stage cluster: two-stage cluster
92
Design a cluster survey
It is worth spending a great deal of effort on designing the survey for an expensive and large-scale survey It can take several years to design and pre-test For designing a cluster sample What overall precision is needed? What size should the PSU’s be? How many SSU’s should be sampled in each sampled PSU? How many PSU’s should be sampled?
93
Choosing the PSU size In many situations, the PSU size exists naturally. E.g, a clutch of eggs, a household In some situations, one needs to choose PSU sizes. E.g., area of a region, 1km2, 2km2,… Many ways to “try out” different PSU sizes Pilot study, perform an experiment The goal is get the most information for the least cost and inconvenience
94
Two-stage cluster design with equal cluster size and equal variance
95
Two-stage cluster design with equal cluster size and equal variance
96
Two-stage cluster design with equal cluster size and equal variance
Graphing variance of varying m and n gives more information It is useful to examine What if the costs or the cost function are slightly different? What if changes slightly?
97
The GPA example
98
The GPA example
99
Summary of two-stage cluster
Cluster sampling is widely used in large surveys Variances from cluster samples are usually greater than SRSs with the same SSUs Less expensive – the per-dollar information from cluster sampling might be greater than that of SRS
100
Summary of two-stage cluster
101
Review Sampling methods: Estimation methods: SRS Stratified sampling
Clustering sampling One-stage cluster sampling Two-stage cluster sampling Estimation methods: Unbiased estimators Ratio / regression estimators
102
SRS √ √ √
103
Stratified sampling √ √
104
One-stage cluster sampling
√ √ √ √ √ √ √ √ √ √ √ √ √ √
105
One-stage cluster sampling
√ √ √ √ √ √
106
SRS The sample mean: In practice,
107
Stratified sampling: stratum h
108
Stratified Sampling: Estimation
109
Unbiased estimation for cluster sampling with unequal sizes
110
Ratio estimation for cluster sampling with unequal sizes
111
Ratio estimation for cluster sampling with unequal sizes
where
112
Ratio estimation for cluster sampling with unequal sizes
Note, it is not difficult to find that The variance of the ratio estimator depends on the variability of the means per element in the clusters It can be much smaller than that of the unbiased estimator The ratio estimator requires the total number of elements in the population, K. The unbiased estimator for the population mean does not require K.
113
Two-stage cluster sampling: an unbiased estimator
114
Two-stage cluster sampling: a ratio estimator
where
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.