Ch 5: Equal probability cluster samples 4/19/2017 Cluster sampling DEFN: A cluster is a group of observation units (or “elements”) Stat 804
Cluster sample DEFN: A cluster sample is a probability sample in which a sampling unit is a cluster
Cluster sample – 2 1-stage cluster sampling Divide the population (of N elements) into NI clusters (of size Ni for cluster i) Cluster = group of elements An element belongs to 1 and only 1 cluster Sampling unit Cluster = group of elements = PSU = primary sampling unit Can use any design to select clusters (ST, PPS) Data collection Collect information on ALL elements in the cluster
1-stage CS ST Sample of 40 elements A block of cells is a cluster A block of cells is a stratum SU is a cluster Don’t sample from every cluster SU is an element (or OU) Sample from every stratum
Cluster vs. stratified sampling Cluster sample Divide N elements into NI clusters Cluster or PSU i has Ni elements Take a sample of nI clusters Stratified sampling N elements divided into H strata An element belongs to 1 and only 1 stratum Take a sample of n elements, consisting of nh elements from stratum h for each of the H strata
Cluster sample – 3 2-stage cluster sampling Process Select PSUs (stage 1) Select elements within each sampled PSU (stage 2) First stage sampling unit is a … PSU = primary sampling unit = cluster Second stage sampling unit is a … SSU = secondary sampling unit = element = OU Only collect data on the SSUs that were sampled from the cluster
1-stage vs. 2-stage cluster sampling 1-stage cluster sample (stop here) OR Stage 1 of 2-stage cluster sample (select PSUs) Stage 2 of 2-stage cluster sample (select SSUs w/in PSUs)
Why use cluster sampling? May not have a list of OUs for a frame, but a list of clusters may be available List of Lincoln phone numbers (= group of residents) is available, but a list of Lincoln residents is not available List of all NE primary and secondary schools (= group of students) is available, but a list of all students in NE schools is not available May be cheaper to conduct the study if OUs are clustered Occurs when cost of data collection increases with distance between elements Household surveys using in-person interviews (household = cluster of people) Field data collection (plot = cluster of plants, or animals)
Defining clusters due to frame limitations A cluster (or PSU) is a group of elements corresponding to a record (row) in the frame Example Population = employees in McDonald’s franchises Element = employee Frame = list of McDonald’s stores PSU = store = cluster of employees
Defining clusters to reduce travel costs A cluster (or PSU) is a group of nearby elements Example Population = all farms Element = farm Frame = list of sections (1 mi x 1 mi areas) in rural area PSU = section = cluster of farms
Cluster samples usually lead to less precise estimates Elements within clusters tend to be correlated due to exposure to similar conditions Members of a household Employees in a business Plants or soil within a field plot We are getting less information than if selected same number of unrelated elements Select sample of city blocks (clusters of households) Ask each household: Should city upgrade storm sewer system? PSU (city block) 1 No storm sewer households will tend to say yes PSU (city block) 2 New development households will tend to say no
Defining clusters for improved precision Define clusters for which within-cluster variation is high (rarely possible) Make each cluster as heterogeneous as possible Like making each cluster a mini-population that reflects variation in population Minimizes the amount of correlation among elements in the cluster Opposite of the approach to stratification Large variation among strata, homogeneous within strata Define clusters that are relatively small Extreme case is cluster = element Decreasing the number of correlated observations in the sample
Example for single-stage cluster sampling w/ equal prob (CSE1) Dorm has NI = 100 suites (clusters) Each suite has Ni = 4 students (4 elements in cluster i , i = 1, 2, … , NI) Note that there are Take SRS nI = 5 suites (clusters) Ask each student living in each of the 5 suites How many nights per week do you eat dinner in the dining hall? Will get observations from a sample of 20 students = 5 suites x 4 students/suite
Dorm example – 2 Stu-dent Suite 6 Suite 21 Suite 28 Suite 54 Suite 89 3 6 2 4 Total 20 14 19 21 10
Dorm example – 3 SRS of nI = 5 dorm rooms Data on each cluster (all students in dorm room) ti = total number of dining hall dinners for dorm room i t2 = 14 dining hall dinners for 4 students in dorm room 2 Estimated total number of dining hall nights for the dorm students HT estimator of total = pop size x sample mean (of cluster totals)
Notation Response variable for SSU j in PSU i yij e.g., age of j-th resident in household i e.g., whether or not dorm resident j in room i owns a computer
Cluster-level population parameters (for cluster i ) Cluster size = Cluster population total Note that we observe cluster population total (or mean or variance) for each sample cluster in 1-stage cluster sampling We will estimate cluster parameters in 2-stage cluster sampling Ni elements
Popuation 1-stage cluster sample
Data from cluster samples Work with element and cluster-level data Element data set will have columns for Cluster id Element id within cluster Variable (y) Will also summarize this data set to generate cluster parameters (1-stage) or estimates of cluster parameters (2-stage) Cluster total (or estimate) Cluster mean (or estimate) Cluster variance (or estimate)
1-stage cluster sample Element data Cluster summary i j yij 1 y11 2 y12 3 Y13 4 y14 y21 y22 y23 y31 … i ti 1 t1 2 t2 3 t3 …
CSE1 unbiased estimation under SI – total t Estimator for population total using data collected from a 1-stage cluster sample SI of clusters Estimator of variance of
Dorm example – 4 Estimated population total Estimated variance
Dorm example – 5 Inclusion probability for student j in dorm room i N = 100 dorm rooms n = 5 sample dorm rooms Take all 4 students in dorm room ij = nI / NI = 1/20 = 0.05