1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.

1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals

2 Survey sampling: 4 major topics 1.Traditional design-based statistical inference 7 weeks 2.Likelihood considerations 1 week 3.Model-based statistical inference 3 weeks 4.Missing data - nonresponse 2 weeks

3 Statistical demography Mortality Life expectancy Population projections 2 weeks

4 Course goals Give students knowledge about: –planning surveys in social sciences –major sampling designs –basic concepts and the most important estimation methods in traditional applied survey sampling –Likelihood principle and its consequences for survey sampling –Use of modeling in sampling –Treatment of nonresponse –A basic knowledge of demography

5 But first: Basic concepts in sampling Population (Target population): The universe of all units of interest for a certain study Denoted, with N being the size of the population: U = {1, 2,...., N} All units can be identified and labeled Ex: Political poll – All adults eligible to vote Ex: Employment/Unemployment in Norway– All persons in Norway, age 15 or more Ex: Consumer expenditure : Unit = household Sample: A subset of the population, to be observed. The sample should be ”representative” of the population

6 Sampling design: The sample is a probability sample if all units in the sample have been chosen with certain probabilities, and such that each unit in the population has a positive probability of being chosen to the sample We shall only be concerned with probability sampling Example: simple random sample (SRS). Let n denote the sample size. Every possible subset of n units has the same chance of being the sample. Then all units in the population have the same probability n/N of being chosen to the sample. The probability distribution for SRS on all subsets of U is an example of a sampling design: The probability plan for selecting a sample s from the population:

7 Basic statistical problem: Estimation A typical survey has many variables of interest Aim of a sample is to obtain information regarding totals or averages of these variables for the whole population Examples : Unemployment in Norway– Want to estimate the total number t of individuals unemployed. For each person i (at least 15 years old) in Norway:

8 In general, variable of interest: y with y i equal to the value of y for unit i in the population, and the total is denoted The typical problem is to estimate t or t/N Sometimes, of interest also to estimate ratios of totals: Example- estimating the rate of unemployment: Unemployment rate:

9 Sources of error in sample surveys 1.Target population U vs Frame population U F Access to the population is thru a list of units – a register U F. U and U F may not be the same: Three possible errors in U F : –Undercoverage: Some units in U are not in U F –Overcoverage: Some units in U F are not in U –Duplicate listings: A unit in U is listed more than once in U F U F is sometimes called the sampling frame

10 2.Nonresponse - missing data Some persons cannot be contacted Some refuse to participate in the survey Some may be ill and incapable of responding In postal surveys: Can be as much as 70% nonresponse In telephone surveys: 50% nonresponse is not uncommon Possible consequences: –Bias in the sample, not representative of the population –Estimation becomes more inaccurate Remedies: –imputation, weighting

11 3.Measurement error – the correct value of y i is not measured –In interviewer surveys: Incorrect marking interviewer effect: people may say what they think the interviewer wants to hear – underreporting of alcohol ute, tobacco use misunderstanding of the question, do not remember correctly.

12 4.Sampling «error» –The error (uncertainty, tolerance) caused by observing a sample instead of the whole population –To assess this error- margin of error: measure sample to sample variation –Design approach deals with calculating sampling errors for different sampling designs –One such measure: 95% confidence interval: If we draw repeated samples, then 95% of the calculated confidence intervals for a total t will actually include t

13 The first 3 errors: nonsampling errors –Can be much larger than the sampling error In this course: –Sampling error –nonresponse bias –Shall assume that the frame population is identical to the target population –No measurement error

14 Summary of basic concepts Population, target population unit sample sampling design estimation –estimator –measure of bias –measure of variance –confidence interval

15 survey errors: –register /frame population –mesurement error –nonresponse –sampling error

16 Example – Psychiatric Morbidity Survey 1993 from Great Britain Aim: Provide information about prevalence of psychiatric problems among adults in GB as well as their associated social disabilities and use of services Target population: Adults aged 16-64 living in private households Sample: Thru several stages: 18,000 adresses were chosen and 1 adult in each household was chosen 200 interviewers, each visiting 90 households

17 Result of the sampling process Sample of addresses18,000 Vacant premises 927 Institutions/business premises 573 Demolished 499 Second home/holiday flat 236 Private household addresses 15,765 Extra households found 669 Total private households 16,434 Households with no one 16-64 3,704 Eligible households 12,730 Nonresponse 2,622 Sample 10,108 households with responding adults aged 16-64

18 Why sampling ? reduces costs for acceptable level of accuracy (money, manpower, processing time...) may free up resources to reduce nonsampling error and collect more information from each person in the sample –ex: 400 interviewers at $5 per interview: lower sampling error 200 interviewers at 10$ per interview: lower nonsampling error much quicker results

19 When is sample representative ? Balance on gender and age: –proportion of women in sample @ proportion in population –proportions of age groups in sample @ proportions in population An ideal representative sample: –A miniature version of the population: –implying that every unit in the sample represents the characteristics of a known number of units in the population Appropriate probability sampling ensures a representative sample ”on the average”

20 Alternative approaches for statistical inference based on survey sampling Design-based: –No modeling, only stochastic element is the sample s with known distribution Model-based: The values y i are assumed to be values of random variables Y i : –Two stochastic elements: Y = (Y 1, …,Y N ) and s –Assumes a parametric distribution for Y –Example : suppose we have an auxiliary variable x. Could be: age, gender, education. A typical model is a regression of Y i on x i.

21 Statistical principles of inference imply that the model-based approach is the most sound and valid approach Start with learning the design-based approach since it is the most applied approach to survey sampling used by national statistical institutes and most research institutes for social sciences. –Is the easy way out: Do not need to model. All statisticians working with survey sampling in practice need to know this approach

22 Design-based statistical inference Can also be viewed as a distribution-free nonparametric approach The only stochastic element: Sample s, distribution p(s) for all subsets s of the population U={1,..., N} No explicit statistical modeling is done for the variable y. All y i ’s are considered fixed but unknown Focus on sampling error Sets the sample survey theory apart from usual statistical analysis The traditional approach, started by Neyman in 1934

23 Estimation theory-simple random sample Estimation of the population mean of a variable y: A natural estimator - the sample mean: Desirable properties: SRS of size n: Each sample s of size n has Can be performed in principle by drawing one unit at time at random without replacement

24 The uncertainty of an unbiased estimator is measured by its estimated sampling variance or standard error (SE): Some results for SRS:

25 usually unimportant in social surveys: n =10,000 and N = 5,000,000: 1- f = 0.998 n =1000 and N = 400,000: 1- f = 0.9975 n =1000 and N = 5,000,000: 1-f = 0.9998 effect of changing n much more important than effect of changing n/N

26 The estimated variance Usually we report the standard error of the estimate: Confidence intervals for m is based on the Central Limit Theorem:

Example – Student performance in California schools Academic Performance Index (API) for all California schools Based on standardized testing of students Data from all schools with at least 100 students Unit in population = school (Elementary/Middle/High) Full population consists of N = 6194 observations Concentrate on the variable: y = api00 = API in 2000 Mean(y) = 664.7 with min(y) =346 and max(y) =969 Data set in R: apipop and y= apipop$api00 27

Histogram of y population with fitted normal density 28

Histogram for sample mean and fitted normal density y = api scores from 2000. Sample size n =10, based on 10000simulations 29 R-code: >b =10000 >N=6194 >n=10 >ybar=numeric(b) >for (k in 1:b){ +s=sample(1:N,n) +ybar[k]=mean(y[s]) +} >hist(ybar,seq(min(ybar)-5,max(ybar)+5,5),prob=TRUE) >x=seq(mean(ybar)-4*sqrt(var(ybar)),mean(ybar)+4*sqrt(var(ybar)),0.05) >z=dnorm(x,mean(ybar),sqrt(var(ybar))) >lines(x,z)

Histogram and fitted normal density api scores. Sample size n =10, based on 10000 simulations 30

31 y = api00 for 6194 California schools nConf. level 100.915 300.940 500.943 1000.947 10000.949 20000.951 10000 simulations of SRS. Confidence level of the approximate 95% CI

32 For one sample of size n = 100:For one sample of size n = 100 R-code: >s=sample(1:6194,100) > ybar=mean(y[s]) > se=sqrt(var(y[s])*(6194-100)/(6194*100)) > ybar [1] 654.47 > var(y[s]) [1] 16179.28 > se [1] 12.61668

33 The coefficient of variation for the estimate: A measure of the relative variability of an estimate. It does not depend on the unit of measurement. More stable over repeated surveys, can be used for planning, for example determining sample size More meaningful when estimating proportions Absolute value of sampling error is not informative when not related to value of the estimate For example, SE =2 is small if estimate is 1000, but very large if estimate is 3

34 Estimation of a population proportion p with a certain characteristic A p = (number of units in the population with A)/N Let y i = 1 if unit i has characteristic A, 0 otherwise Then p is the population mean of the y i ’s. Let X be the number of units in the sample with characteristic A. Then the sample mean can be expressed as

35 So the unbiased estimate of the variance of the estimator:

36 Examples A political poll: Suppose we have a random sample of 1000 eligible voters in Norway with 280 saying they will vote for the Labor party. Then the estimated proportion of Labor votes in Norway is given by: Confidence interval requires normal approximation. Can use the guideline from binomial distribution, when N-n is large:

37 In this example : n = 1000 and N = 4,000,000 Ex: Psychiatric Morbidity Survey 1993 from Great Britain p = proportion with psychiatric problems n = 9792 (partial nonresponse on this question: 316) N @ 40,000,000

38 General probability sampling Sampling design: p(s) - known probability of selection for each subset s of the population U Actually: The sampling design is the probability distribution p(. ) over all subsets of U Typically, for most s: p(s) = 0. In SRS of size n, all s with size different from n has p(s) = 0. The inclusion probability:

39 Illustration U = {1,2,3,4} Sample of size 2; 6 possible samples Sampling design: p({1,2}) = ½, p({2,3}) = 1/4, p({3,4}) = 1/8, p({1,4}) = 1/8 The inclusion probabilities:

40 Some results

41 Estimation theory probability sampling in general Problem: Estimate a population quantity for the variable y For the sake of illustration: The population total

42 CV is a useful measure of uncertainty, especially when standard error increases as the estimate increases Because, typically we have that

43 Some peculiarities in the estimation theory Example: N=3, n=2, simple random sample

44 For this set of values of the y i ’s:

45 Let y be the population vector of the y-values. This example shows that is not uniformly best ( minimum variance for all y) among linear design-unbiased estimators Example shows that the ”usual” basic estimators do not have the same properties in design-based survey sampling as they do in ordinary statistical models In fact, we have the following much stronger result: Theorem: Let p(. ) be any sampling design. Assume each y i can take at least two values. Then there exists no uniformly best design-unbiased estimator of the total t

46 Proof: This implies that a uniformly best unbiased estimator must have variance equal to 0 for all values of y, which is impossible

47 Determining sample size The sample size has a decisive effect on the cost of the survey How large n should be depends on the purpose for doing the survey In a poll for detemining voting preference, n = 1000 is typically enough In the quarterly labor force survey in Norway, n = 24000 Mainly three factors to consider: 1. Desired accuracy of the estimates for many variables. Focus on one or two variables of primary interest 2. Homogeneity of the population. Needs smaller samples if little variation in the population 3. Estimation for subgroups, domains, of the population

48 It is often factor 3 that puts the highest demand on the survey If we want to estimate totals for domains of the population we should take a stratified sample A sample from each domain A stratified random sample: From each domain a simple random sample

49 Assume the problem is to estimate a population proportion p for a certain stratum, and we use the sample proportion from the stratum to estimate p Let n be the sample size of this stratum, and assume that n/N is negligible Desired accuracy for this stratum: 95% CI for p should be The accuracy requirement:

50 The estimate is unkown in the planning fase Use the conservative size 384 or a planning value p 0 with n = 1536 p 0 (1- p 0 ) F.ex.: With p 0 = 0.2: n = 246 In general with accuracy requirement d, 95% CI

51 With e = 0.1, then we require approximately that

52 Example: Monthly unemployment rate Important to detect changes in unemployment rates from month to month planning value p 0 = 0.05

53 Two basic estimators: Ratio estimator Horvitz-Thompson estimator Ratio estimator in simple random samples H-T estimator for unequal probability sampling: The inclusion probabilities are unequal The goal is to estimate a population total t for a variable y

54 Ratio estimator Suppose we have known auxiliary information for the whole population: Ex: age, gender, education, employment status The ratio estimator for the y-total t:

55 We can express the ratio estimator on the following form: It adjusts the usual “sample mean estimator” in the cases where the x-values in the sample are too small or too large. Reasonable if there is a positive correlation between x and y Example: University of 4000 students, SRS of 400 Estimate the total number t of women that is planning a career in teaching, t=Np, p is the proportion y i = 1 if student i is a woman planning to be a teacher, t is the y-total

56 Results : 84 out of 240 women in the sample plans to be a teacher HOWEVER: It was noticed that the university has 2700 women (67,5%) while in the sample we had 60% women. A better estimate that corrects for the underrepresentation of women is obtained by the ratio estimate using the auxiliary x = 1 if student is a woman

57 In business surveys it is very common to use a ratio estimator. Ex: y i = amount spent on health insurance by business i x i = number of employees in business i We shall now do a comparison between the ratio estimator and the sample mean based estimator. We need to derive expectation and variance for the ratio estimator

58 First: Must define the population covariance The population correlation coefficient:

60 It follows that Hence, in SRS, the absolute bias of the ratio estimator is small relative to the true SE of the estimator if the coefficient of variation of the x- sample mean is small Certainly true for large n

62 Note: The ratio estimator is very precise when the population points (y i, x i ) lie close around a straight line thru the origin with slope R. The regression model generates the ratio estimator

63 The ratio estimator is more accurate if Rx i predicts y i better than m y does

64 Estimated variance for the ratio estimator

65 For large n, N-n: Approximate normality holds and an approximate 95% confidence interval is given by

R-computing of estimate, variance estimate and confidence interval >y=apipop$api00 >x=apipop$col.grad #col.grad = percent of parents with college degree #calculating the ratio estimator: >s=c(20,2000,3900,5000) >N=6194 >n=4 >r=mean(y[s])/mean(x[s]) >#ratio estimate of t/N: >muhatr=r*mean(x) >muhatr [1] 542.3055 #variance estimate ssqr=(1/(n-1))*sum((y[s]-r*x[s])^2) varestr=(mean(x)/mean(x[s]))^2*(1-n/N)*ssqr/n ser=sqrt(varestr) ser [1] 63.85705 #confidence interval: >CI=muhatr+qnorm(c(0.025,0.975))*se >CI [1] 417.1479 667.4630

67 nConf.level 100.927 300.946 500.946 1000.947 10000.947 20000.948 y = api00 for 6194 California schools 10000 simulations of SRS. Confidence level of approximate 95% CI

68 R-code for simulations to estimate true confidence level of 95% CI, based on the ratio estimator >simtratio=function(b,n,N) +{ +muhatr=numeric(b) +se=numeric(b) +for (k in 1:b){ +s=sample(1:N,n) +r[k]=mean(y[s])/mean(x[s]) +muhatr[k]=r[k]*mean(x) +ssqr[k]=(1/(n-1))*sum((y[s]-r[k]*x[s])^2) +se[k]=sqrt((mean(x)/mean(x[s]))^2*(1-n/N)*ssqr[k]/n) } +sum(mean(y)<muhatr+1.96*se)-sum(mean(y)<muhatr-1.96*se) }

69 Unequal probability sampling Example: Psychiatric Morbidity Survey: Selected individuals from households Inclusion probabilities:

70 Horvitz-Thompson estimator- unequal probability sampling Let’s try and use Bias is large if inclusion probabilities tend to increase or decrease systematically with y i

71 Use weighting to correct for bias:

72 Horvitz-Thompson estimator is widely used f.ex., in official statistics

73 Note that the variance is small if we determine the inclusion probabilities such that Of course, we do not know the value of y i when planning the survey, use known auxiliary x i and choose

74 Example: Population of 3 elephants, to be shipped. Needs an estimate for the total weight Weighing an elephant is no simple matter. Owner wants to estimate the total weight by weighing just one elephant. Knows from earlier: Elephant 2 has a weight y 2 close to the average weight. Wants to use this elephant and use 3y 2 as an estimate However: To get an unbiased estimator, all inclusion probabilities must be positive.

75 Sampling design: The weights: 1,2, 4 tons, total = 7 tons H-T estimator: Hopeless! Always far from true total of 7 Can not be used, even though

76 Problem: The planned estimator, even though not a SRS: Possible values: 3, 6, 12

78 Variance estimate for H-T estimator Assume the size of the sample is determined in advance to be n.

79 Can always compute the variance estimate!! Since, necessarily p ij > 0 for all i,j in the sample s But: If not all p ij > 0, should not use this estimate! It can give very incorrect estimates The variance estimate can be negative, but for most sampling designs it is always positive

80 A modified H-T estimator Consider first estimating the population mean An obvious choice: Alternative: Estimate N as well, whether N is known or not

82 If sample size varies then the “ratio” estimator performs better than the H-T estimator, the ratio is more stable than the numerator Example:

83 H-T estimator varies because n varies, while the modified H-T is perfectly stable

84 Review of Advantages of Probability Sampling Objective basis for inference Permits unbiased or approximately unbiased estimation Permits estimation of sampling errors of estimators –Use central limit theorem for confidence interval –Can choose n to reduce SE or CV for estimator

85 Outstanding issues in design-based inference Estimation for subpopulations, domains Choice of sampling design – –discuss several different sampling designs –appropriate estimators More on use of auxiliary information to improve estimates More on variance estimation

86 Estimation for domains Domain (subpopulation): a subset of the population of interest Ex: Population = all adults aged 16-64 Examples of domains: –Women –Adults aged 35-39 –Men aged 25-29 –Women of a certain ethnic group –Adults living in a certain city Partition population U into D disjoint domains U 1,…,U d,..., U D of sizes N 1,…,N d,…,N D

87 Estimating domain means Simple random sample from the population e.g., proportion of divorced women with psychiatric problems. Note: n d is a random variable

88 The estimator is a ratio estimator:

90 Can then treat s d as a SRS from U d Whatever size of n is, conditional on n d, s d is a SRS from U d – conditional inference Example: Psychiatric Morbidity Survey 1993 Proportions with psychiatric problems Domain d ndnd SE women 49330.18 Divorced women 3140.29

91 Estimating domain totals N d is known: Use N d unknown, must be estimated

92 Stratified sampling Basic idea: Partition the population U into H subpopulations, called strata. N h = size of stratum h, known Draw a separate sample from each stratum, s h of size n h from stratum h, independently between the strata In social surveys: Stratify by geographic regions, age groups, gender Ex –business survey. Canadian survey of employment. Establishments stratified by oStandard Industrial Classification – 16 industry divisions oSize – number of employees, 4 groups, 0-19, 20-49, 50- 199, 200+ oProvince – 12 provinces Total number of strata: 16x4x12=768

93 Reasons for stratification 1.Strata form domains of interest for which separate estimates of given precision is required, e.g. strata = geographical regions 2.To “spread” the sample over the whole population. Easier to get a representative sample 3.To get more accurate estimates of population totals, reduce sampling variance 4.Can use different modes of data collection in different strata, e.g. telephone versus home interviews

94 Stratified simple random sampling The most common stratified sampling design SRS from each stratum Notation:

95 t h = y-total for stratum h: Consider estimation of t h : Assuming no auxiliary information in addition to the “stratifying variables” The stratified estimator of t:

96 A weighted average of the sample stratum means. Properties of the stratified estimator follows from properties of SRS estimators. Notation:

97 Estimated variance is obtained by estimating the stratum variance with the stratum sample variance Approximate 95% confidence interval if n and N-n are large:

98 Estimating population proportion in stratified simple random sampling p h : proportion in stratum h with a certain characteristic A p is the population mean: p = t/N Stratum mean estimator: Stratified estimator of the total t = number of units in the with characteristic A:

99 Estimated variance:

100 Allocation of the sample units Important to determine the sizes of the stratum samples, given the total sample size n and given the strata partitioning –how to allocate the sample units to the different strata Proportional allocation –A representative sample should mirror the population –Strata proportions: W h =N h /N –Strata sample proportions should be the same: n h /n = W h –Proportional allocation:

101 The stratified estimator under proportional allocation The equally weighted sample mean ( sample is self- weighting: Every unit in the sample represents the same number of units in the population, N/n)

102 Variance and estimated variance under proportional allocation

103 The estimator in simple random sample: Under proportional allocation: but the variances are different:

104 Total variance = variance within strata + variance between strata Implications: 1.No matter what the stratification scheme is: Proportional allocation gives more accurate estimates of population total than SRS 2.Choose strata with little variability, smaller strata variances. Then the strata means will vary more and between variance becomes larger and precision of estimates increases compared to SRS

105 Constructing stratification and drawing stratified sample in R Use API in California schools as example with schooltype as stratifier. 3 strata: Elementary, middle and high schools. Stratum1: Elementary schools, N 1 =4421 Stratum 2: Middle schools, N 2 = 1018 Stratum 3: High schools, N 3 = 755 5% stratified sample with proportional allocation: n 1 = 221 n 2 = 51 n 3 = 38 n = 310

106 R-code: making strata >x=apipop$stype # To make a stratified variable from schooltype: >make123 = function(x) +{ + x=as.factor(x) + levels_x = levels(x) +x=as.numeric(x) +attr(x,"levels") = levels_x + x +} > strata=make123(x) > y=apipop$api00 > tapply(y,strata,mean) 1 2 3 672.0627 633.7947 655.7230 # 1=E, 2=H, 3 = M. Will change stratum 2 and 3

107 > x1=as.numeric(strata<1.5) > x2=as.numeric(strata<2.5)-x1 > x3=as.numeric(strata>2.5) > stratum=x1+2*x3+3*x2 > tapply(y,stratum,mean) 1 2 3 672.0627 655.7230 633.7947 > # stratified random sample with proportional allocation > N1=4421 > N2=1018 > N3=755 > n1=221 > n2=51 > n3=38 > s1=sample(N1,n1) > s2=sample(N2,n2) > s3=sample(N3,n3)

108 > y1=y[stratum==1] > y2=y[stratum==2] > y3=y[stratum==3] > y1s=y1[s1] > y2s=y2[s2] > y3s=y3[s3] > t_hat1=N1*mean(y1[s1]) > t_hat2=N2*mean(y2[s2]) > t_hat3=N3*mean(y3[s3]) > t_hat=t_hat1+t_hat2+t_hat3 > muhat=t_hat/6194 > muhat [1] 661.8897 > mean(y1s) [1] 671.1493 > mean(y2s) [1] 652.6078 > mean(y3s) [1] 620.1842

109 > varest1=N1^2*var(y1s)*(N1-n1)/(N1*n1) > varest2=N2^2*var(y2s)*(N2-n2)/(N2*n2) > varest3=N3^2*var(y3s)*(N3-n3)/(N3*n3) > se=sqrt(varest1+varest2+varest3) > se [1] 44915.56 > semean=se/6194 > semean [1] 7.251463 > CI=muhat+qnorm(c(0.025,0.975))*semean > CI [1] 647.6771 676.1023 #CI = (647.7, 676.1)

110 Suppose we regard the sample as a SRS > z=c(y1s,y2s,y3s) > mean(z) [1] 661.8516 > var(z) [1] 17345.13 > sesrs=sqrt(var(z)*(6194- 310)/(6194*310)) > sesrs [1] 7.290523 Compared to 7.25 for the stratified SE. Note: the estimate is the same, 661.9, since we have proportional allocation

111 Optimal allocation If the only concern is to estimate the population total t: Choose n h such that the variance of the stratified estimator is minimum Solution depends on the unkown stratum variances If the stratum variances are approximately equal, proportional allocation minimizes the variance of the stratified estimator

112 Result follows since the sample sizes must add up to n

113 Called Neyman allocation (Neyman, 1934) Should sample heavily in strata if –The stratum accounts for a large part of the population –The stratum variance is large If the stratum variances are equal, this is proportional allocation Problem, of course: Stratum variances are unknown –Take a small preliminary sample (pilot) –The variance of the stratified estimator is not very sensitive to deviations from the optimal allocation. Need just crude approximations of the stratum variances

114 Optimal allocation when considering the cost of a survey C represents the total cost of the survey, fixed – our budget c 0 : overhead cost, like maintaining an office c h : cost of taking an observation in stratum h –Home interviews: traveling cost +interview –Telephone or postal surveys: c h is the same for all strata – In some strata: telephone, in others home interviews Minimize the variance of the stratified estimator for a given total cost C

115 Solution:

116 We can express the optimal sample sizes in relation to n In particular, if c h = c for all h:

117 Other issues with optimal allocation Many survey variables Each variable leads to a different optimal solution –Choose one or two key variables –Use proportional allocation as a compromise If n h > N h, let n h =N h and use optimal allocation for the remaining strata If n h =1, can not estimate variance. Force n h =2 or collapse strata for variance estimation Number of strata: For a given n often best to increase number of strata as much as possible. Depends on available information

118 Sometimes the main interest is in precision of the estimates for stratum totals and less interest in the precision of the estimate for the population total Need to decide n h to achieve desired accuracy for estimate of t h, discussed earlier –If we decide to do proportional allocation, it can mean in small strata (small N h ) the sample size n h must be increased

119 Poststratification Stratification reduces the uncertainty of the estimator compared to SRS In many cases one wants to stratify according to variables that are not known or used in sampling Can then stratify after the data have been collected Hence, the term poststratification The estimator is then the usual stratified estimator according to the poststratification If we take a SRS and N-n and n are large, the estimator behaves like the stratified estimator with proportional allocation

120 Poststratification to reduce nonresponse bias Poststratification is mostly used to correct for nonresponse Choose strata with different response rates Poststratification amounts to assuming that the response sample in poststratum h is representative for the nonresponse group in the sample from poststratum h

121 Systematic sampling Idea:Order the population and select every kth unit Procedure: U = {1,…,N} and N=nk + c, c < n 1.Select a random integer r between 1 and k, with equal probability 2.Select the sample s r by the systematic rule s r = {i: i = r + (j-1)k: j= 1, …, n r } where the actual sample size n r takes values [N/k] or [N/k] +1 k : sampling interval = [N/n] Very easy to implement: Visit every 10 th house or interview every 50 th name in the telephone book

122 k distinct samples each selected with probability 1/k Unlike in SRS, many subsets of U have zero probability Examples: 1) N =20, n=4. Then k=5 and c=0. Suppose we select r =1. Then the sample is {1,6,11,16} 5 possible distinct samples. In SRS: 4845 distinct samples 2) N= 149, n = 12. Then k = 12, c=5. Suppose r = 3. s 3 = {3,15,27,39,51,63,75,87,99,111,123,135,147} and sample size is 13 3) N=20, n=8. Then k=2 and c = 4. Sample size is n r =10 4) N= 100 000, n = 1500. Then k = 66, c=1000 and c/k =15.15 with [c/k]=15. n r = 1515 or 1516

123 Estimation of the population total These estimators are approximately the same:

124 Advantage of systematic sampling: Can be implemented even where no population frame exists E.g. sample every 10 th person admitted to a hospital, every 100 th tourist arriving at LA airport.

125 The variance is small if Or, equivalently, if the values within the possible samples s r are very different; the samples are heterogeneous Problem: The variance cannot be estimated properly because we have only one observation of t(s r )

126 Systematic sampling as Implicit Stratification In practice: Very often when using systematic sampling (common design in national statistical institutes): The population is ordered such that the first k units constitute a homogeneous “stratum”, the second k units another “stratum”, etc. Implicit strataUnits 11,2….,k 2k+1,…,2k :: n = N/k assumed(n-1)k+1,.., nk Systematic sampling selects 1 unit from each stratum at random

127 Systematic sampling vs SRS Systematic sampling is more efficient if the study variable is homogeneous within the implicit strata –Ex: households ordered according to house numbers within neighbourhooods and study variable related to income Households in the same neighbourhood are usually homogeneous with respect socio-economic variables If population is in random order (all N! permutations are equally likely): systematic sampling is similar to SRS Systematic sampling can be very bad if y has periodic variation relative to k: –Approximately: y 1 = y k+1, y 2 = y k+2, etc

128 Variance estimation No direct estimate, impossible to obtain unbiased estimate If population is in random order: can use the variance estimate form SRS as an approximation Develop a conservative variance estimator by collapsing the “implicit strata”, overestimate the variance The most promising approach may be: Under a statistical model, estimate the expected value of the design variance Typically, systematic sampling is used in the second stage of two-stage sampling (to be discussed later), may not be necessary to estimate this variance then.

129 Cluster sampling and multistage sampling Sampling designs so far: Direct sampling of the units in a single stage of sampling Of economial and practical reasons: may be necessary to modify these sampling designs –There exists no population frame (register: list of all units in the population), and it is impossible or very costly to produce such a register –The population units are scattered over a wide area, and a direct sample will also be widely scattered. In case of personal interviews, the traveling costs would be very high and it would not be possible to visit the whole sample

130 Modified sampling can be done by 1.Selecting the sample indirectly in groups, called clusters; cluster sampling –Population is grouped into clusters –Sample is obtained by selecting a sample of clusters and observing all units within the clusters –Ex: In Labor Force Surveys: Clusters = Households, units = persons 2.Selecting the sample in several stages; multistage sampling

131 3.In two-stage sampling: Population is grouped into primary sampling units (PSU) Stage 1: A sample of PSUs Stage 2: For each PSU in the sample at stage 1, we take a sample of population units, now also called secondary sampling units (SSU) Ex: PSUs are often geographical regions

132 Examples 1.Cluster sampling. Want a sample of high school students in a certain area, to investigate smoking and alcohol use. If a list of high school classes is available,we can then select a sample of high school classes and give the questionaire to every student in the selected classes; cluster sampling with high school class being the clusters 2.Two-stage cluster sampling. If a list of classes is not available, we can first select high schools, then classes and finally all students in the selected classes. Then we have 2-stage cluster sample. 1.PSU = high school 2.SSU = classes 3.Units = students

133 Psychiatric Morbidity Survey is a 4-stage sample –Population: adults aged 16-64 living in private households in Great Britain –PSUs = postal sectors –SSUs = addresses –3SUs = households –Units = individuals Sampling process: 1)200 PSUs selected 2)90 SSUs selected within each sampled PSU (interviewer workload) 3)All households selected per SSU 4)1 adult selected per household

134 Cluster sampling Number of clusters in the population : N Number of units in cluster i: M i

135 Simple random cluster sampling Ratio-to-size estimator Use auxiliary information: Size of the sampled clusters Approximately unbiased with approximate variance

136 Note that this ratio estimator is in fact the usual sample mean based estimator with respect to the y- variable And corresponding estimator of the population mean of y is Can be used also if M is unknown

137 Estimator’s variance is highly influenced by how the clusters are constructed. Note: The opposite in stratified sampling Typically, clusters are formed by “nearby units’ like households, schools, hospitals because of economical and practical reasons, with little variation within the clusters: Simple random cluster sampling will lead to much less precise estimates compared to SRS, but this is offset by big cost reductions Sometimes SRS is not possible; information only known for clusters

138 Design Effects A design effect (deff) compares efficiency of two design-estimation strategies (sampling design and estimator) for same sample size Now: Compare Strategy 1:simple random cluster sampling with ratio estimator Strategy 2: SRS, of same sample size m, with usual sample mean estimator In terms of estimating population mean: The design effect of simple random cluster sampling, SCS, is then Estimated deff:

139 Selecting a systematic sample in R Want to sample rainfall every 10 th day over a year. >N=365 > k=10 > start=sample(k,1) > start [1] 2 > s=seq(start,N,k) > s [1] 2 12 22 32 42 52 62 72 82 92 102 112 122 132 142 152 162 172 182 [20] 192 202 212 222 232 242 252 262 272 282 292 302 312 322 332 342 352 362

140 Two-stage sampling Basic justification: With homogeneous clusters and a given budget, it is inefficient to survey all units in the cluster- can instead select more clusters Populations partioned into N primary sampling units (PSU) Stage 1: Select a sample s I of PSUs Stage 2: For each selected PSU i in s I : Select a sample s i of units (secondary sampling units, SSU) The cluster totals t i must be estimated from the sample

141 General two-stage sampling plan:

142 Suggested estimator for population total t : Unbiased estimator

143 1.The first component expresses the sampling uncertainty on stage 1, since we are selecting a sample of PSU’s. It is the variance of the HT-estimator with t i as observations 2.The second component is stage 2 variance and tells us how well we are able to estimate each t i in the whole population 3.The second component is often negligible because of little variability within the clusters

144 A special case: Clusters of equal size and SRS on stage 1 and stage 2 Self-weighting sample: equal inclusion probabilities for all units in the population

145 Unequal cluster sizes. PPS – SRS sampling In social surveys: good reasons to have equal inclusion probabilities (self-weighting sample) for all units in the population (similar representation to all domains) Stage 1: Select PSUs with probability proportional to size M i Stage 2: SRS (or systematic sample) of SSUs Such that sample is self-weighting m i = m/n = equal sample sizes in all selected PSUs

146 Remarks Usually one interviewer for each selected PSU First stage sampling is often stratified PPS With self-weighting PPS-SRS: –equal workload for each interviewer –Total sample size m is fixed

147 II. Likelihood in statistical inference and survey sampling Problems with design-based inference Likelihood principle, conditionality principle and sufficiency principle Fundamental equivalence Likelihood and likelihood principle in survey sampling

148 Traditional approach Design-based inference Population (Target population): The universe of all units of interest for a certain study: U = {1,2, …, N} –All units can be identified and labeled –Variable of interest y with population values –Typical problem: Estimate total t or population mean t/N Sample: A subset s of the population, to be observed Sampling design p(s) is known for all possible subsets; –The probability distribution of the stochastic sample

149 Problems with design-based inference Generally: Design-based inference is with respect to hypothetical replications of sampling for a fixed population vector y Variance estimates may fail to reflect information in a given sample If we want to measure how a certain estimation method does in quarterly or monthly surveys, then y will vary from quarter to quarter or month to month –need to assume that y is a realization of a random vector Use: Likelihood and likelihood principle as guideline on how to deal with these issues

150 Problem with design-based variance measure Illustration 1 a) N +1 possible samples: {1}, {2},…,{N}, {1,2,…N} b)Sampling design: p({i}) =1/2N, for i = 1,..,N ; p({1,2,…N})= 1/2 d)Assume we select the “sample” {1,2,…,N}. Then we claim that the “precision” of the resulting sample (known to be without error) is

151 Problem with design-based variance measure Illustration 2 Both experts select the same sample, compute the same estimate, but give different measures of precision…

152 The likelihood principle, LP general model LP: The likelihood function contains all information about the unknown parameters More precisely: Two proportional likelihood functions for q, from the same or different experiments, should give identically the same statistical inference The likelihood function, with data x: l is quite a different animal than f !! Measures the likelihood of different q values in light of the data x

153 Maximum likelihood estimation satisfies LP, using the curvature of the likelihood as a measure of precision (Fisher) LP is controversial, but hard to argue against because of the fundamental result by Birnbaum, 1962: LP follows from sufficiency (SP) and conditionality principles (CP) that ”no one” disagrees with. SP: Statistical inference should be based on sufficient statistics CP: If you have 2 possible experiments and choose one at random, the inference should depend only on the chosen experiment

154 Illustration of CP A choice is to be made between a census og taking a sample of size 1. Each with probability ½. Census is chosen Unconditional approach:

155 The Horvitz-Thompson estimator: Conditional approach: p i = 1 and HT estimate is t

156 LP, SP and CP Likelihood principle: This includes the case where E 1 = E 2 and x 1 and x 2 are two different observations from the same experiment

157 Sufficiency principle: Let T be a sufficient statistics for  in the experiment E. Assume T(x 1 ) = T(x 2 ). Then I(E, x 1 ) = I(E, x 2 ). Conditionality principle:

160 Consequences for statistical analysis Statistical analysis, given the observed data: The sample space is irrelevant The usual criteria like confidence levels and P-values do not necessarily measure reliability for the actual inference given the observed data Frequentistic measures evaluate methods –not necessarily relevant criteria for the observed data

161 Illustration- Bernoulli trials

162 The likelihood functions: Proportional likelihoods: LP: Inference about q should be identical in the two cases Frequentistic analyses give different results: because different sample spaces: (0,1,..,12) and (0,1,...)

163 Frequentistic vs. likelihood Frequentistic approach: Statistical methods are evaluated pre-experimental, over the sample space LP evaluate statistical methods post-experimental, given the data History and dicussion after Birnbaum, 1962: An overview in ”Breakthroughs in Statistics,1890-1989, Springer 1991”

164 Likelihood function in design-based inference Unknown parameter: Data: Likelihood function = Probability of the data, considered as a function of the parameters Sampling design: p(s) Likelihood function: All possible y are equally likely !!

165 Likehood principle, LP : The likelihood function contains all information about the unknown parameters According to LP: –The design-model is such that the data contains no information about the unobserved part of y, y unobs –One has to assume in advance that there is a relation between the data and y unobs : As a consequence of LP: Necessary to assume a model – The sampling design is irrelevant for statistical inference, because two sampling designs leading to the same s will have proportional likelihoods

166 Let p 0 and p 1 be two sampling designs. Assume we get the same sample s in either case. Then the data x are the same and W x is the same for both experiments. The likelihood function for sampling design p i, i = 0,1:

167 Same inference under the two different designs. This is in direct opposition to usual design-based inference, where the only stochastic evaluation is thru the sampling design, for example the Horvitz-Thompson estimator Concepts like design unbiasedness and design variance are irrelevant according to LP when it comes to do the actual statistical analysis. Note: LP is not concerned about method performance, but the statistical analysis after the data have been observed This does not mean the sampling design is not important. It is important to assure we get a good representative sample. But once the sample is collected the sampling design should not play a role in the inference phase, according to LP

168 Model-based inference Assumes a model for the y vector Conditoning on the actual sample Use modeling to combine information Problem: dependence on model –Introduces a subjective element – almost impossible to model all variables in a survey Design approach is “objective” in a perfect world of no nonsampling errors

169 III. Model-based inference in survey sampling Model-based approach. Also called the prediction approach –Assumes a model for the y vector –Use modeling to construct estimator –Ex: ratio estimator Model-based inference –Inference is based on the assumed model –Treating the sample s as fixed, conditioning on the actual sample Best linear unbiased predictors Variance estimation for different variance measures

170 Model-based approach We can decompose the total t as follows: Treat the sample s as fixed Two stochastic elements: [Model-assisted approach: use the distribution assumption of Y to construct estimator, and evaluate according to distribution of s, given the realized vector y]

171 The unobserved z is a realized value of the random variable Z, so the problem is actually to predict the value z of Z. Can be done by predicting each unobserved y i : The prediction approach, the prediction based estimator

172 Remarks: 1. Any estimator can be expressed on the “prediction form: 2. Can then use this form to see if the estimator makes any sense

173 Ex 1. Ex.2 Reasonable sampling design when y and x are positively correlated

174 Three common models I.A model for business surveys, the ratio model: assume the existence of an auxiliary variable x for all units in the population.

175 II. A model for social surveys, simple linear regression: III. Common mean model: Ex: x i is a measure of the “size” of unit i, and y i tends to increase with increasing x i. In business surveys, the regression goes thru the origin in many cases

176 Model-based estimators (predictors) 1. Predictor: 2. Model parameters: q 4.Model variance of model-unbiased predictor is the variance of the prediction error, also called the prediction variance 5.From now on, skip s in the notation: all expectations and variances are given the selected sample s, for example

177 Prediction variance as a variance measure for the actual observed sample N +1 possible samples: {1}, {2},…,{N}, {1,2,…N} Assume we select the “sample” {1,2,…,N}. Prediction variance: Illustration 1, slide 151 Illustration 2, slide 152: Exactly the same prediction variance for the two sampling designs

178 6.Definition: Linear predictor:

179 Suggested Predictor:

181 This is the least squares estimate based on

182 We shall show that

183 The prediction variance of model-unbiased predictor: To minimize the prediction variance is equivalent to minimizing Giving us

184 The prediction variance of the BLU predictor: A variance estimate is obtained by using the model- unbiased estimator for s 2

185 The central limit theorem applies such that for large n, N-n we have that Approximate 95% confidence interval for the value t of T: Also called a 95% prediction interval for the random variable T

186 Three special cases: 1) v(x) = x, the ratio model, 2) v(x)= x 2 and 3) x i =1 for all i, the common mean model 1.v(x) = x the usual ratio estimator

187 2.v(x) =x 2

188 When the sampling fraction f is small or when the x i values vary little, these two estimators are approximately the same. In the latter case: Also model-unbiased

189 3.x i =1 This is also the usual, design-based variance formula under SRS

190 We see that the variance estimate is given by Exactly the same as in the design-approach, but the interpretation is different

191 Simple Linear regression model BLU predictor:

193 We shall now show that this predictor is BLU Hence, any predictor can be expressed on this form and the predictor is linear if and only if b is linear in the Y i ’s

194 Prediction variance: So we need to minimize the prediction variance with respect to the c i ’s under (1) and (2)

197 The prediction variance is given by

200 Anticipated variance (method variance) We want a variance measure that tells us about the expected uncertainty in repeated surveys 3. This is called the anticipated variance. 4. It can be regarded as a variance measure that describes how the estimation method is doing in repeated surveys

201 And the anticipated MSE becomes the expected design- variance, also called the anticipated design variance

202 Example: Simple linear regression and simple random sample

203 Let us now study the BLU predictor.( It can be shown that it is approximately design-unbiased )

205 Remarks From a design-based approach, the sample mean based estimator is unbiased, while the linear regression estimator is not Considering only the design-bias, we might choose the sample mean based estimator The linear regression estimator would only be selected over the sample mean based estimator because it has smaller anticipated variance Hence, difficult to see design-unbiasedness as a means to choose among estimators

206 Robust variance estimation The model assumed is really a “working model” Especially, the variance assumption may be misspecified and it is not always easy to detect this kind of model failure – like constant variance –variance proportional to size measure x i Standard least squares variance estimates is sensitive to misspecification of variance assumption Concerned with robust variance estimators

207 Variance estimation for the ratio estimator Working model: Under this working model, the unbiased estimator of the prediction variance of the ratio estimator is

208 This variance estimator is non-robust to misspecification of the variance model. Suppose the true model has Ratio estimator is still model-unbiased but prediction variance is now

209 Moreover,

210 Robust variance estimator for the ratio estimator

211 Suggests we may use: Leading to the robust variance estimator: Almost the same as the design variance estimator in SRS:

212 Can we do better? Require estimator to be exactly unbiased under ratio model, v(x) = x:

213 So a robust variance estimator that is exactly unbiased under the working model, v(x) = x: The prediction variance when v(x) = x:

214 General approach to robust variance estimation 1.Find robust estimators of Var(Y i ), that does not depend on model assumptions about the variance 4.Estimate only leading term in the prediction variance, typically dominating, or estimate the second term from the more general model

215 Reference to robust variance estimation: Valliant, Dorfman and Royall (2000): Finite Population Sampling and Inference. A Prediction Approach, ch. 5

216 Model-assisted approach Design-based approach Use modeling to improve the basic HT- estimator. Assume the population values y are realized values of random Y Assume the existence of auxiliary variables, known for all units in the population Basic idea:

217 Final estimator, the regression estimator: Alternative expression:

218 Simple random sample

219 In general with this “ratio model”, in order to get approximately design-unbiased estimators:

220 Reference: Sarndal, Swensson and Wretman : Model Assisted Survey Sampling (1992, ch. 6), Springer Regression estimator is approximately unbiased Variance and variance estimation Variance estimation:

221 Approximate 95% CI, for large n, N-n: Remark: In SSW (1992,ch.6), an alternative variance estimator is mentioned that may be preferable in many cases

222 Common mean model The ratio model with x i =1. This is the modified H-T estimator (slides 80,81) Typically much better than the H-T estimator when different

223 Alternatively,

224 1.The model-assisted regression estimator has often the form 2.The prediction approach makes it clear: no need to estimate the observed y i Remarks: 3.Any estimator can be expressed on the “prediction form: 4.Can then use this form to see if the estimator makes any sense

1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.

Similar presentations

Presentation on theme: "1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.

Similar presentations

Presentation on theme: "1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals."— Presentation transcript:

Similar presentations

About project

Feedback