Ch 5: Cluster sampling with equal probabilities Ch 5: Equal probability cluster samples 4/1/2017 Ch 5: Cluster sampling with equal probabilities DEFN: A cluster is a group of observation units (or “elements”) Stat 804
Cluster sample DEFN: A cluster sample is a probability sample in which a sampling unit is a cluster
Cluster sample – 2 1-stage cluster sampling Divide the population (of K elements) into N clusters (of size Mi for cluster i) Cluster = group of elements An element belongs to 1 and only 1 cluster Sampling unit Cluster = group of elements = PSU = primary sampling unit We’ll start by assuming a SRS of clusters (equal prob) Can use any design to select clusters (STS, PPS) – we’ll work with other designs in Ch 6 Data collection Collect information on ALL elements in the cluster
1-stage CS STS Sample of 40 elements A block of cells is a cluster A block of cells is a stratum SU is a cluster Don’t sample from every cluster SU is an element (or OU) Sample from every stratum
Cluster vs. stratified sampling Cluster sample Divide K elements into N clusters Cluster or PSU i has Mi elements Take a sample of n clusters Stratified sampling N elements divided into H strata An element belongs to 1 and only 1 stratum Take a sample of n elements, consisting of nh elements from stratum h for each of the H strata
Cluster sample – 3 2-stage cluster sampling (later) Process Select PSUs (stage 1) Select elements within each sampled PSU (stage 2) First stage sampling unit is a … PSU = primary sampling unit = cluster Second stage sampling unit is a … SSU = secondary sampling unit = element = OU Only collect data on the SSUs that were sampled from the cluster
1-stage vs. 2-stage cluster sampling 1-stage cluster sample (stop here) OR Stage 1 of 2-stage cluster sample (select PSUs) Stage 2 of 2-stage cluster sample (select SSUs w/in PSUs)
Why use cluster sampling? May not have a list of OUs for a frame, but a list of clusters may be available List of Lincoln phone numbers (= group of residents) is available, but a list of Lincoln residents is not available List of all NE primary and secondary schools (= group of students) is available, but a list of all students in NE schools is not available May be cheaper to conduct the study if OUs are clustered Occurs when cost of data collection increases with distance between elements Household surveys using in-person interviews (household = cluster of people) Field data collection (plot = cluster of plants, or animals)
Defining clusters due to frame limitations A cluster (or PSU) is a group of elements corresponding to a record (row) in the frame Example Population = employees in McDonald’s franchises Element = employee Frame = list of McDonald’s stores PSU = store = cluster of employees
Defining clusters to reduce travel costs A cluster (or PSU) is a group of nearby elements Example Population = all farms Element = farm Frame = list of sections (1 mi x 1 mi areas) in rural area PSU = section = cluster of farms
Cluster samples usually lead to less precise estimates Elements within clusters tend to be correlated due to exposure to similar conditions Members of a household Employees in a business Plants or soil within a field plot We are getting less information than if selected same number of unrelated elements Select sample of city blocks (clusters of households) Ask each household: Should city upgrade storm sewer system? PSU (city block) 1 No storm sewer households will tend to say yes PSU (city block) 2 New development households will tend to say no
Defining clusters for improved precision Define clusters for which within-cluster variation is high (rarely possible) Make each cluster as heterogeneous as possible Like making each cluster a mini-population that reflects variation in population Minimizes the amount of correlation among elements in the cluster Opposite of the approach to stratification Large variation among strata, homogeneous within strata Define clusters that are relatively small Extreme case is cluster = element Decreasing the number of correlated observations in the sample
Example for single-stage cluster sampling w/ equal prob (CSE1) Dorm has N = 100 suites (clusters) Each suite has Mi = 4 students (4 elements in cluster i , i = 1, 2, … , N) Note that there are Take SRS n = 5 suites (clusters) Ask each student living in each of the 5 suites How many nights per week do you eat dinner in the dining hall? Will get observations from a sample of 20 students = 5 suites x 4 students/suite
Dorm example – 2 Stu-dent Suite 6 Suite 21 Suite 28 Suite 54 Suite 89 3 6 2 4 Total 20 14 19 21 10
Dorm example – 3 SRS of n = 5 dorm rooms Data on each cluster (all students in dorm room) ti = total number of dining hall dinners for dorm room i t2 = 14 dining hall dinners for 4 students in dorm room 2 Estimated total number of dining hall nights for the dorm students HT estimator of total = pop size x sample mean (of cluster totals)
Notation Indices Number of PSUs (clusters) in the population i = index for PSU i i j = index for SSU j in PSU i Number of PSUs (clusters) in the population N clusters Number of SSUs (elements) in a PSU (cluster) Mi elements Number of SSUs (elements) in the polulation In Chapters 1-4, this was designated as N
Notation – 2 N = 12 PSUs K = 20 + 12 + … + 9 + 16 = 150 SSUs i =1 i =2 M1 = 20 SSUs M2 = 12 SSUs N = 12 PSUs K = 20 + 12 + … + 9 + 16 = 150 SSUs i =1 i =2 i =3 i =4 i =5 i =9 i =11 i =12 SSU i = 9 j = 1 M11 = 9 SSUs M12 = 16 SSUs SSU i = 9 j = 7
Notation – 3 Response variable for SSU j in PSU i yij e.g., age of j-th resident in household i e.g., whether or not dorm resident j in room i owns a computer
Cluster-level population parameters (for cluster i ) Cluster size = Cluster population total Note that we observe cluster population total (or mean or variance) for each sample cluster in 1-stage cluster sampling We will estimate cluster parameters in 2-stage cluster sampling Mi elements
Cluster-level population parameters (for cluster i ) – 2 Cluster population mean Within-cluster variance
Popuation 1-stage cluster sample
Cluster-level population parameters (for cluster i ) – 3 For 1-stage cluster samples Have a complete enumeration of the cluster elements Cluster population parameters are known For 2-stage cluster samples Observe data on a sample of elements in a cluster Estimate cluster population parameters
Population parameters Same parameters as in previous chapters, rewritten in notation for cluster sampling Population size (** K was referred to as N in previous chapters) Population total (sum of all cluster totals)
Population Parameters-2 Population mean (of K elements) Population variance (among K elements) Variance among N cluster totals
Data from cluster samples Work with element and cluster-level data Element data set will have columns for Cluster id Element id within cluster Variable (y) Will also summarize this data set to generate cluster parameters (1-stage) or estimates of cluster parameters (2-stage) Cluster total (or estimate) Cluster mean (or estimate) Cluster variance (or estimate)
1-stage cluster sample Element data Cluster summary i j yij 1 y11 2 y12 3 Y13 4 y14 y21 y22 y23 y31 … i ti 1 t1 2 t2 3 t3 …
Estimation for CSE1 Chapter reading Two types estimators Section 5.2.1 covers equal sized clusters (Mi constant, read) We’ll start with 5.2.3 (unequal sized clusters, Mi varies) Section 5.2.2 covers theory Two types estimators Unbiased – HT estimator Ratio estimation Equal probability sample of clusters – assume SRS of clusters
CSE1 unbiased estimation under SRS – total t Estimator for population total using data collected from a 1-stage cluster sample SRS of clusters Estimator of variance of
Dorm example – 4 Estimated population total Estimated variance
CSE1 inclusion probability for an element Two events : A and B Pr{ A and B both occur } = P { A occurs } x P { B occurs given A occurs } In our setting A = sample cluster i B = sample element j (in cluster i) Inclusion probability for for element j in cluster i ij = Pr {including element j and cluster i in sample} = Pr {including cluster i in sample} x Pr {incl. element j given cluster i has been included in sample}
CSE1 inclusion probability for an element – 2 Need to two pieces Pr {including cluster i in sample} = n / N Pr {including element j given cluster i has been included in sample} = 1 Inclusion probability ij = Pr {including element j and cluster i in sample} = Pr {including cluster i in sample} x Pr {including element j given cluster i has been included in sample} = (n / N ) x 1 = n / N
CSE1 weight for an element Weight for element j in cluster i Inverse element inclusion probability wij = 1/ ij = N /n Estimator using weights
Dorm example – 5 Inclusion probability for student j in dorm room i N = 100 dorm rooms n = 5 sample dorm rooms Take all 4 students in dorm room ij = n / N = 1/20 = 0.05 Weight for student j in dorm room i wij = N / n = 20 students
CSE1 unbiased estimation under SRS – mean Unbiased estimator for population mean For SRS, estimator for total divided by number of population elements (OUs) Units are y-units per element
Dorm example – 6
Unbiased estimation – proportion p What is y ?
Ratio estimation Usually ti (cluster total) is correlated with Mi (cluster size) As Mi (# SSUs/elements in cluster i ) increases, value for ti (total of yij for cluster i ) increases Positive correlation between Mi and ti No intercept Perfect conditions for SRS ratio estimator Notation of Ch 3 Notation of Ch 5 yi (variable of interest) ti (cluster total) xi (auxiliary info) Mi (cluster size)
Ratio estimation for CSE1 Estimator for population mean Units are y-units per element
Ratio estimation for CSE1 – 2 Estimator for variance of ratio estimator of population mean is average cluster size for population
Ratio estimation for CSE1 – 3 Average cluster size If unknown, can estimate with sample mean of cluster sizes
Dorm example – 7 Estimated population mean Average cluster size
Dorm example – 8 Estimated variance
Ratio estimation for CSE1 – 4 Estimator for population total
Dorm example – 9 Estimated population total Estimated variance
CSE1: impact of cluster size If cluster sizes Mi are variable across clusters, generally estimate population parameter with less precision If ti is related to Mi , then get large variation among cluster totals if Mi is variable Variance of population parameter estimator (unbiased or ratio) is a function of variation among cluster totals
2-stage equal probability cluster sampling (CSE2) CSE2 has 2 stages of sampling Stage 1. Select SRS of n PSUs from population of N PSUs Stage 2. Select SRS of mi SSUs from Mi elements in PSU i sampled in stage 1
2-stage cluster sampling Stage 1 of 2-stage cluster sample (select PSUs) Stage 2 of 2-stage cluster sample (select SSUs w/in PSUs)
Motivation for 2-stage cluster samples Recall motivations for cluster sampling in general Only have access to a frame that lists clusters Reduce data collection costs by going to groups of nearby elements (cluster defined by proximity)
Motivation for 2-stage cluster samples – 2 Likely that elements in cluster will be correlated May be inefficient to observe all elements in a sample PSU Extra effort required to fully enumerate a PSU does not generate that much extra information May be better to spend resources to sample many PSUs and a small number of SSUs per PSU Possible opposing force: study costs associated to going to many clusters
CSE2 unbiased estimation for population total t Have a sample of elements from a cluster We no longer know the value of cluster parameter, ti Estimate ti using data observed for mi SSUs
CSE2 unbiased estimation for population total – 2 Approach is to plug estimated cluster totals into CSE1 formula CSE1 CSE2
CSE2 unbiased estimation for population total – 3 The variance of has 2 components associated with the 2 sampling stages 1. Variation among PSUs 2. Variation among SSUs within PSUs among PSU within PSU
CSE2 unbiased estimation for population total – 4 In CSE1, we observe all elements in a cluster We know ti Have variance component 1, but no component 2 In CSE2, we sample a subset of elements in a cluster We estimate ti with Component 2 is a function of estimates variance for
CSE2 unbiased estimation for population total – 5 Estimated variance among cluster totals Estimated variance among elements in a cluster
CSE2 unbiased estimation for population total – 6
Dorm example – 10 Stage 2: select 2 students in each room Stu-dent 5 3 6 2 4 Total ?
Dorm example – 11 Stage 1 Stage 2 Cluster = N = n = SRS Element = Mi =
Dorm example – 12 1 5 3 4 2 6 Stu-dent (j) Rm 6 (i=1) Rm 21 (i=2)
Dorm example – 13
Dorm example – 14
CSE2 unbiased estimation for population mean
Dorm example – 15
CSE2 inclusion probability for an element Two events : A and B Pr{ A and B both occur } = P { A occurs } x P { B occurs | A occurs } “|” denotes “given” (a condition) In our setting A = sample cluster i B = sample element j Inclusion probability symbols ij = Pr {including element j and cluster i in sample} i = Pr {including cluster i in sample} j|i = Pr {incl. element j | cluster i has been included in sample}
CSE2 inclusion probability for an element – 2 Need to two pieces i = Pr {including cluster i in sample} = n / N j|i = Pr {including element j | cluster i has been included in sample} = mi /Mi Inclusion probability for element j in cluster i ij = i j|i =
CSE2 weight for an element Sampling Weight for element j in cluster i Estimator for population total
What does equal probability mean in Ch 5? Clusters (PSUs) sampled using SRS Equal inclusion probability for stage 1 PSUs (clusters) i is same for all i
What does equal probability mean in Ch 5? – 2 Elements (SSUs) in a given PSU are sampled using SRS All elements (j ) in a sample PSU (i ) are selected with equal probability This is a conditional probability (given PSU i ) For a given PSU i , j|i is the same for all elements j
What does equal probability mean in Ch 5? – 3 Note that Equal probability at stage 1 (i ) plus Equal probability at stage 2 given PSU i (j|i ) does NOT imply equal inclusion probability for an element In fact, element-level (unconditional) inclusion probability is not necessarily constant Depends on cluster size Mi and sample size mi for the cluster to which the element belongs
CSE2 ratio estimation for population mean
CSE2 ratio estimation for population mean – 2
Dorm example – 16 Stu-dent (j) Rm 6 (i=1) Rm 21 (i=2) Rm 28 (i=3) Rm 54 (i=4) Rm 89 (i=5) 1 5 3 4 2 6 5.5 2.5 4.5 3.0 22 10 18 12 0.5 2.0
Dorm example – 16
Dorm example – 17
CSE2 ratio estimation for population total t
Dorm example – 18
Coots egg example Target pop = American coot eggs in Minnedosa, Manitoba PSU / cluster = clutch (nest) SSU / element = egg w/in clutch Stage 1 SRS of n = 184 clutches N = ??? Clutches, but probably pretty large Stage 2 SRS of mi = 2 from Mi eggs in a clutch Do not know K = ??? eggs in population, also large Can count Mi = # eggs in sampled clutch i Measurement yij = volume of egg j from clutch i
Coots egg example – 2 Scatter plot of volumes vs. i (clutch id) Double dot pattern - high correlation among eggs WITHIN a clutch Quite a bit of clutch to clutch variation Implies May not have very high precision unless sample a large number of clutches Certainly lower precision than if obtained a SRS of eggs Could use a side-by-side plot for data with larger cluster sizes – PROC UNIVARIATE w/ BY CLUSTER and PLOTS option
Coots egg example – 3 Plot Observations Rank the mean egg volume for clutch i , Plot yij vs. rank for clutch i Draw a line between yi 1 and yi2 to show how close the 2 egg volumes in a clutch are Observations Same results as Fig 5.3, but more clear Small within-cluster variation Large between-cluster variation Also see 1 clutch with large WITHIN clutch variation check data (i = 88) i sorted by
Coots egg example – 4 Plot si vs. for clutch i Since volumes are always positive, might expect si to increase as gets larger If is very small, yi 1 and yi 2 are likely to be very small and close small si See this to moderate degree Clutch 88 has large si , as noted in previous plot
Coots egg example – 5 Estimation goal What estimator? Estimate , population mean volume per coot egg in Minnedosa, Manitoba What estimator? Unbiased estimation Don’t know N = total number of clutches or K = total number of eggs in Minnedosa, Manitoba Ratio estimation Only requires knowledge of Mi , number of eggs in selected clutch i , in addition to data collected May want to plot versus Mi
Coots egg example – 6
Coots egg example – 7 Don’t know Use Don’t know N , but assumed large FPC 1 2nd term is very small, so approximate SE ignores 2nd
Coots egg example – 8 What is first-stage PSU inclusion probability? What is conditional SSU inclusion probability at second stage? What is unconditional SSU inclusion probability?
CSE2: Unbiased vs. ratio estimation Unbiased estimator can poor precision if Cluster sizes (Mi ) are unequal ti (cluster total) is roughly proportional to Mi (cluster size) Biased (ratio estimator) can be precise if ti roughly proportional to Mi This happens frequently in pops w/cluster sizes (Mi) vary
CSE2: Self-weighting design Stage 1: Select n PSUs from N PSUs in pop using SRS Inclusion probability for PSU i : Stage 2: Choose mi proportional to Mi so that mi /Mi is constant, use SRS to select sample Inclusion probability for SSU j given PSU i : Unconditional inclusion probability for SSU j in cluster i is constant for all elements Inclusion probability may vary in practice because may not be possible for mi /Mi to be equal to c for all clusters
Self-weighting designs in general Why are self-weighting samples appealing? Are dorm student or coot egg samples self-weighting 2-stage cluster samples? What other (non-cluster) self-weighting designs have we discussed?
Self-weighting designs in general – 2 What is the caveat for variance estimation in self-weighting samples? No break on variance of estimator – must use proper formula for design Why are self-weighting samples appealing? Simple mean estimator Homogeneous weights tends to make estimates more precise
Return to systematic sampling (SYS) Have a frame, or list of N elements Determine sampling interval, k k is the next integer after N/n Select first element in the list Choose a random number, R , between 1 & k R-th element is the first element to be included in the sample Select every k-th element after the R-th element Sample includes element R, element R + k, element R + 2k, … , element R + (n-1)k
SYS example Telephone survey of members in an organization abut organization’s website use N = 500 members Have resources to do n = 75 calls N / n = 500/75 = 6.67 k = 7 Random number table entry: 52994 Rule: if pick 1, 2, …, 7, assign as R; otherwise discard # Select R = 5 Take element 5, then element 5+7 =12, then element 12+7 =19, 26, 33, 40, 47, …
Ch 5: Equal probability cluster samples 4/1/2017 SYS – 2 Arrange population in rows of length k = 7 R 1 2 3 4 5 6 7 i 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 … 491 492 493 494 495 496 497 71 498 499 500 72 Many samples have no chance of being selected Stat 804
Relationship between SYS and cluster sampling Design relationships Element = ? Cluster = ? Sampling unit(s) = ? Cluster sampling design = ? Relationship between frame ordering and expected precision of a an estimate from a cluster sample? Periodic, where cycle of pattern is coincident with sampling interval k Ordered by X , which is correlated with response variable Y Random
Ch 5: Equal probability cluster samples 4/1/2017 SYS – 3 Suppose X [age of member] is correlated with Y [use of org website] Sort list by X before selecting sample k 1 2 3 4 5 6 7 X i young 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 … mid 491 492 493 494 495 496 497 71 498 499 500 old 72 Many samples have no chance of being selected Stat 804