STK 4600: Statistical methods for social sciences.

STK 4600: Statistical methods for social sciences.
Survey sampling and statistical demography Surveys for households and individuals Anders Holmberg and Li-Chun Zhang (based on original notes by) Jan F. Bjørnstad

Survey sampling: 4 major topics
Traditional design-based statistical inference 5 weeks Likelihood considerations 1 week Model-based statistical inference 2 weeks Missing data - nonresponse 1 weeks

Statistical demography
Mortality Life expectancy Population projections 1 week

Course goals Knowledge about: planning surveys in social sciences
major sampling designs basic concepts and the most important estimation methods in traditional applied survey sampling Likelihood principle and its consequences for survey sampling Use of modeling in sampling Treatment of nonresponse A basic knowledge of demography

What is a Survey? The need for statistical information seems endless in modern society. One important mode for data collection is a sample survey In many countries, a central statistical office is mandated by law to provide statistical information about the state of the nation and surveys are an important part of this activity.

For example, in Canada, the 1971 Statistics Act mandates Statistics Canada to,
”… collect, compile, analyze, abstract, and publish statistical information relating to commercial, industrial, financial, social, economic, and general activities and condition of the people.”

What is a survey? (Dalenius 7 points)
1. A survey concerns a set of objects comprising a population A finite set of objects like individuals, businesses or farms Events occurring at specified time intervals, like crimes and accidents. Processes in the environment, like land use or the occurrences of wildlife species in an area. 2. The population under study has one or more measurable properties.

3. The goal is to describe the population by one ore more
parameters defined in terms of the measurable properties. 4. To get observational access to the population, a frame, is needed. 5. A sample of objects is selected from the frame in accordance with a sampling design that specifies a probability mechanism and a sample size.

6. Observations are made on the sample in accordance
with a measurement process. 7. Based on the measurements, an estimation process is applied to compute estimates of the parameters when making inference from the sample to the population.

Example Labor force surveys Population Domains of interest Variables
Population characteristics of interest Sample

What is a Survey? (ASA) The word is most often used to describe a method of gathering information from a sample of individuals. The sample is usually just a fraction of the population being studied. Data can be collected in many ways – including telephone, by mail, by web or in person. The size of the sample depends on the purpose of the study.

The sample is scientifically chosen so that each unit in the population will have a measurable chance (>0) of selection. Information is collected by means of standardized procedures. The individual respondents should never be identified in the result. The results should be presented in completely anonymous summaries.

ASA (cont.) How large must the sample size be?
What are some common survey methods? What survey questions do you ask? What about confidentiality and integrity?

How to plan a survey The first step is to lay out the objectives of the investigation. What do we want to know? Defining the target population. Determine the mode of administration. Developing the questionnaire Designing the sampling approach

Phases of a Survey General Problem Statistical Problem Population
Variables Tabulation Plan Frame Sample Method of Measurement Measurement Instrument Data Collection Coding, Data Entry Editing, Updating Quality, Documentation Estimation/Tabulation Analysis Publication

How to plan a survey, cont.
How to plan a survey questionnaire? How to get good coverage? How to choose a random sample? How to plan in quality? How to schedule? How to budget?

How to collect survey data?
Mail surveys, telephone surveys, internet, interviewing, mixed mode … CATI, CAPI. Failure to follow up non-respondents can ruin an otherwise well-designed survey. Murphy’s Law: “If anything can go wrong it will” … but “If you didn’t check on it, it did”.

Margin of errors An estimate from a survey is unlikely to exactly equal the true population quantity of interest. The “margin of error” is a common summary of sampling errors that quantifies uncertainty about a survey result. The sampling error as well as the non-sampling error in a survey will affect the margin of errors.

Summary Unfortunately, there are no absolute criteria to dictate the best choice of mode, questionnaire design, data collection protocol, and so on to use in each situation. Rather survey design is guided more by past experience, theories, and good advice on the advantage and disadvantages of alternative of alternative design choices so that we can make intelligent decisions for each situation we encounter. The aim of a good design is to use practical and reliable processes whose outcomes are reasonably predictable.

Basic concepts in sampling
Population (Target population): The universe of all units of interest for a certain study Denoted, with N being the size of the population: U = {1, 2, ...., N} All units can be identified and labeled Ex: Political poll – All adults eligible to vote Ex: Employment/Unemployment in Norway– All persons in Norway, age 15 or more Ex: Consumer expenditure : Unit = household Sample: A subset of the population, to be observed. The sample should be ”representative” of the population

Sampling design: The sample is a probability sample if all units in the sample have been chosen with certain probabilities, and such that each unit in the population has a positive probability of being chosen to the sample We shall only be concerned with probability sampling Example: simple random sample (SRS). Let n denote the sample size. Every possible subset of n units has the same chance of being the sample. Then all units in the population have the same probability n/N of being chosen to the sample. The probability distribution for SRS on all subsets of U is an example of a sampling design: The probability plan for selecting a sample s from the population:

Basic statistical problem: Estimation
A typical survey has many variables of interest Aim of a sample is to obtain information regarding totals or averages of these variables for the whole population Examples : Unemployment in Norway– Want to estimate the total number t of individuals unemployed. For each person i (at least 15 years old) in Norway:

In general, variable of interest: y with yi equal to the value of y for unit i in the population, and the total is denoted The typical problem is to estimate t or t/N Sometimes, of interest also to estimate ratios of totals: Example- estimating the rate of unemployment: Unemployment rate:

Sampling a finite population
UF s r U

Rules of Association Example? One-to-One Example? Many-to-One Example?
One-to-Many Many-to-Many Example?

Important properties of a frame
The frame must be virtually complete: it must provide observational access to “almost all” objects in the target population. What degree of coverage is sufficient may be a matter of judgment. The frame must serve to yield a sample of objects, which can be unambiguously identified. The frame must be such that it is possible to determine how the units in the frame are associated with objects in the population. The statistician must know the exact chance that the sampled object had of being selected.

Desirable properties of a frame
The frame should be simple to use. The frame should contain “auxiliary information” to be used in the estimation process. The frame should be reasonable stable in time. Moreover it should be easy and inexpensive to update the frame.

Sources of error in sample surveys
Coverage errors Target population U vs Frame population UF Access to the population is thru a list of units – a register UF . U and UF may not be the same: Three possible errors in UF: Undercoverage: Some units in U are not in UF Overcoverage: Some units in UF are not in U Duplicate listings: A unit in U is listed more than once in UF UF is sometimes called the sampling frame

Nonresponse - missing data
Some persons cannot be contacted Some refuse to participate in the survey Some may be ill and incapable of responding In postal surveys: Can be as much as 70% nonresponse In telephone surveys: 50% nonresponse is not uncommon Possible consequences: Bias in the sample, not representative of the population Estimation becomes more inaccurate Remedies: imputation, weighting

Measurement error – the correct value of yi is not measured
In interviewer surveys: Incorrect marking interviewer effect: people may say what they think the interviewer wants to hear – underreporting of alcohol ute, tobacco use misunderstanding of the question, do not remember correctly.

To assess this error- margin of error:
Sampling «error» The error (uncertainty, tolerance) caused by observing a sample instead of the whole population To assess this error- margin of error: measure sample to sample variation Design approach deals with calculating sampling errors for different sampling designs One such measure: 95% confidence interval: If we draw repeated samples, then 95% of the calculated confidence intervals for a total t will actually include t

The first 3 errors: nonsampling errors In this course:
Can be much larger than the sampling error In this course: Sampling error Nonresponse bias Shall assume that the frame population is identical to the target population No measurement error

Summary of basic concepts
Population, target population unit sample sampling design estimation estimator measure of bias measure of variance confidence interval

survey errors: register /frame population mesurement error nonresponse
sampling error

Example – Psychiatric Morbidity Survey 1993 from Great Britain
Aim: Provide information about prevalence of psychiatric problems among adults in GB as well as their associated social disabilities and use of services Target population: Adults aged living in private households Sample: Thru several stages: 18,000 adresses were chosen and 1 adult in each household was chosen 200 interviewers, each visiting 90 households

Result of the sampling process
Sample of addresses 18,000 Vacant premises Institutions/business premises Demolished Second home/holiday flat Private household addresses 15,765 Extra households found Total private households 16,434 Households with no one ,704 Eligible households ,730 Nonresponse ,622 Sample ,108 households with responding adults aged 16-64

Why sampling ? reduces costs for acceptable level of accuracy (money, manpower, processing time...) may free up resources to reduce nonsampling error and collect more information from each person in the sample ex: 400 interviewers at $5 per interview: lower sampling error 200 interviewers at 10$ per interview: lower nonsampling error much quicker results

When is sample representative ?
Balance on gender and age: proportion of women in proportion in population proportions of age groups in proportions in population An ideal representative sample: A miniature version of the population: implying that every unit in the sample represents the characteristics of a known number of units in the population Appropriate probability sampling ensures a representative sample ”on the average”

Alternative approaches for statistical inference based on survey sampling
Design-based: No modeling, only stochastic element is the sample s with known distribution Model-based: The values yi are assumed to be values of random variables Yi: Two stochastic elements: Y = (Y1, …,YN) and s Assumes a parametric distribution for Y Example : suppose we have an auxiliary variable x. Could be: age, gender, education. A typical model is a regression of Yi on xi.

Statistical principles of inference imply that the model-based approach is the most sound and valid approach Start with learning the design-based approach since it is the most applied approach to survey sampling used by national statistical institutes and most research institutes for social sciences. Is the easy way out: Do not need to model. All statisticians working with survey sampling in practice need to know this approach

Design-based statistical inference
Can also be viewed as a distribution-free nonparametric approach The only stochastic element: Sample s, distribution p(s) for all subsets s of the population U={1, ..., N} No explicit statistical modeling is done for the variable y. All yi’s are considered fixed but unknown Focus on sampling error Sets the sample survey theory apart from usual statistical analysis The traditional approach, started by Neyman in 1934

Estimation theory-simple random sample
SRS of size n: Each sample s of size n has Can be performed in principle by drawing one unit at time at random without replacement Estimation of the population mean of a variable y: A natural estimator - the sample mean: Desirable properties:

The uncertainty of an unbiased estimator is measured by its estimated sampling variance or standard error (SE): Some results for SRS:

usually unimportant in social surveys:
n =10,000 and N = 5,000,000: 1- f = 0.998 n =1000 and N = 400,000: 1- f = n =1000 and N = 5,000,000: 1-f = effect of changing n much more important than effect of changing n/N

The estimated variance
Usually we report the standard error of the estimate: Confidence intervals for m is based on the Central Limit Theorem:

Example – Student performance in California schools
Academic Performance Index (API) for all California schools Based on standardized testing of students Data from all schools with at least 100 students Unit in population = school (Elementary/Middle/High) Full population consists of N = 6194 observations Concentrate on the variable: y = api00 = API in 2000 Mean(y) = with min(y) =346 and max(y) =969 Data set in R: apipop and y= apipop$api00

Histogram of y population with fitted normal density

Histogram for sample mean and fitted normal density y = api scores from Sample size n =10, based on 10000simulations R-code: >b =10000 >N=6194 >n=10 >ybar=numeric(b) >for (k in 1:b){ +s=sample(1:N,n) +ybar[k]=mean(y[s]) +} >hist(ybar,seq(min(ybar)-5,max(ybar)+5,5),prob=TRUE) >x=seq(mean(ybar)-4*sqrt(var(ybar)),mean(ybar)+4*sqrt(var(ybar)),0.05) >z=dnorm(x,mean(ybar),sqrt(var(ybar))) >lines(x,z)

Histogram and fitted normal density api scores
Histogram and fitted normal density api scores. Sample size n =10, based on simulations

y = api00 for 6194 California schools
10000 simulations of SRS. Confidence level of the approximate 95% CI n Conf. level 10 0.915 30 0.940 50 0.943 100 0.947 1000 0.949 2000 0.951

For one sample of size n = 100: For one sample of size n = 100
R-code: >s=sample(1:6194,100) > ybar=mean(y[s]) > se=sqrt(var(y[s])*( )/(6194*100)) > ybar [1] > var(y[s]) [1] > se [1]

Absolute value of sampling error is not informative when not related to value of the estimate
For example, SE =2 is small if estimate is 1000, but very large if estimate is 3 The coefficient of variation for the estimate: A measure of the relative variability of an estimate. It does not depend on the unit of measurement. More stable over repeated surveys, can be used for planning, for example determining sample size More meaningful when estimating proportions

Estimation of a population proportion p with a certain characteristic A
p = (number of units in the population with A)/N Let yi = 1 if unit i has characteristic A, 0 otherwise Then p is the population mean of the yi’s. Let X be the number of units in the sample with characteristic A. Then the sample mean can be expressed as

So the unbiased estimate of the variance of the estimator:

Examples A political poll: Suppose we have a random sample of 1000 eligible voters in Norway with 280 saying they will vote for the Labor party. Then the estimated proportion of Labor votes in Norway is given by: Confidence interval requires normal approximation. Can use the guideline from binomial distribution, when N-n is large:

In this example : n = 1000 and N = 4,000,000
Ex: Psychiatric Morbidity Survey 1993 from Great Britain p = proportion with psychiatric problems n = 9792 (partial nonresponse on this question: 316) 40,000,000

General probability sampling
Sampling design: p(s) - known probability of selection for each subset s of the population U Actually: The sampling design is the probability distribution p(.) over all subsets of U Typically, for most s: p(s) = 0 . In SRS of size n, all s with size different from n has p(s) = 0. The first order inclusion probability: The second-order selection or inclusion probability is:

Notation Let  denote any characteristic of a population (such as t, , or p), and let (or for short) denote any estimator of  calculated from S. The distribution of over all possible samples is called the …… the sampling distribution of The expected value of is The sampling bias of is Efter denna KÖR exempel 2.1 MED Bernoulli design The variance of is

Illustration U = {1,2,3,4} Sample of size 2; 6 possible samples
Sampling design: p({1,2}) = ½, p({2,3}) = 1/4, p({3,4}) = 1/8, p({1,4}) = 1/8 The inclusion probabilities:

Some results

Estimation theory probability sampling in general
Problem: Estimate a population quantity for the variable y For the sake of illustration: The population total

CV is a useful measure of uncertainty, especially when standard error increases as the estimate increases Because, typically we have that

Determining sample size
The sample size has a decisive effect on the cost of the survey How large n should be depends on the purpose for doing the survey In a poll for detemining voting preference, n = 1000 is typically enough In the quarterly labor force survey in Norway, n = 24000 Mainly three factors to consider: Desired accuracy of the estimates for many variables. Focus on one or two variables of primary interest Homogeneity of the population. Needs smaller samples if little variation in the population Estimation for subgroups, domains, of the population

It is often factor 3 that puts the highest demand on the survey
If we want to estimate totals for domains of the population we should take a stratified sample A sample from each domain A stratified random sample: From each domain a simple random sample

Assume the problem is to estimate a population proportion p for a certain stratum, and we use the sample proportion from the stratum to estimate p Let n be the sample size of this stratum, and assume that n/N is negligible Desired accuracy for this stratum: 95% CI for p should be The accuracy requirement:

The estimate is unkown in the planning fase
Use the conservative size 384 or a planning value p0 with n = 1536 p0(1- p0 ) F.ex.: With p0 = 0.2: n = 246 In general with accuracy requirement d, 95% CI

With e = 0.1, then we require approximately that

Example: Monthly unemployment rate
Important to detect changes in unemployment rates from month to month planning value p0 = 0.05

Generally, if we want to estimate the popultion meanm:
n depends on how large the y-variation s is in the population, If we use the coefficient of variation as measure of accuracy and sample mean as estimate: Table shows how n varies with s/m for a given requirement on CV (N is so large that finite population correction can be ignored) : CV s/m 0,1 0,25 0,5 0,025 16 100 400 0,05 4 25 0,10 1 7

Estimation theory We are interested in different kinds of parameters in a population U of fixed size N The parameters may be totals, Mean values, Proportions, (note that a proportion is just a special case of the mean value) A ratio between two totals, (note that the mean is just a special case of a ratio)

Two basic estimators: Horvitz-Thompson estimator Ratio estimator
Ratio estimator in simple random samples H-T estimator for unequal probability sampling: The inclusion probabilities are unequal The goal is to estimate a population total t for a variable y

Now suppose that our parameter of interest is a population total,
An unbiased estimator of this parameter is This is called the Horvitz Thompson (HT) or the  estimator of the total t. The HT estimator of the population mean is SRS

The variance of the HT estimator
The variance of the estimated total under SRS without replacement is, where

With SRSThe variance is estimated by
where The variance of the estimated sample mean is obtained by which is estimated by

Ratio estimator Suppose we have known auxiliary information for the whole population: Ex: age, gender, education, employment status The ratio estimator for the y-total t:

We can express the ratio estimator on the following form:
It adjusts the usual “sample mean estimator” in the cases where the x-values in the sample are too small or too large. Reasonable if there is a positive correlation between x and y Example: University of 4000 students, SRS of 400 Estimate the total number t of women that is planning a career in teaching, t=Np, p is the proportion yi = 1 if student i is a woman planning to be a teacher, t is the y-total

Results : 84 out of 240 women in the sample plans to be a teacher
HOWEVER: It was noticed that the university has 2700 women (67,5%) while in the sample we had 60% women. A better estimate that corrects for the underrepresentation of women is obtained by the ratio estimate using the auxiliary x = 1 if student is a woman

In business surveys it is very common to use a ratio estimator.
Ex: yi = amount spent on health insurance by business i xi = number of employees in business i We shall now do a comparison between the ratio estimator and the sample mean based estimator. We need to derive expectation and variance for the ratio estimator

First: Must define the population covariance
The population correlation coefficient:

It follows that Hence, in SRS, if the coefficient of variation of the x-sample mean is small, the absolute bias of the ratio estimator is small relative to the true SE of the estimator Certainly true for large n

Note: The ratio estimator is very precise when the population points (yi , xi) lie close around a straight line thru the origin with slope R. The ratio estimator is a special case of a Generalised regression estimator (GREG) which covers more (yi , xi) situations, including multivariate and is usually better. However the ratio estimator is computationally attractive.

Estimated variance for the ratio estimator

For large n, N-n: Approximate normality holds and an approximate 95% confidence interval is given by

R-computing of estimate, variance estimate and confidence interval
>y=apipop$api00 >x=apipop$col.grad #col.grad = percent of parents with college degree #calculating the ratio estimator: >s=c(20,2000,3900,5000) >N=6194 >n=4 >r=mean(y[s])/mean(x[s]) >#ratio estimate of t/N: >muhatr=r*mean(x) >muhatr [1] #variance estimate ssqr=(1/(n-1))*sum((y[s]-r*x[s])^2) varestr=(mean(x)/mean(x[s]))^2*(1-n/N)*ssqr/n ser=sqrt(varestr) ser [1] #confidence interval: >CI=muhatr+qnorm(c(0.025,0.975))*se >CI [1]

y = api00 for 6194 California schools
10000 simulations of SRS. Confidence level of approximate 95% CI n Conf.level 10 0.927 30 0.946 50 100 0.947 1000 2000 0.948

R-code for simulations to estimate true confidence level of 95% CI, based on the ratio estimator
>simtratio=function(b,n,N) +{ +muhatr=numeric(b) +se=numeric(b) +for (k in 1:b){ +s=sample(1:N,n) +r[k]=mean(y[s])/mean(x[s]) +muhatr[k]=r[k]*mean(x) +ssqr[k]=(1/(n-1))*sum((y[s]-r[k]*x[s])^2) +se[k]=sqrt((mean(x)/mean(x[s]))^2*(1-n/N)*ssqr[k]/n) } +sum(mean(y)<muhatr+1.96*se)-sum(mean(y)<muhatr-1.96*se)

Unequal probability sampling
Inclusion probabilities: Example: Psychiatric Morbidity Survey: Selected individuals from households

Horvitz-Thompson estimator- unequal probability sampling
Let’s try and use Bias is large if inclusion probabilities tend to increase or decrease systematically with yi

Use weighting to correct for bias:

Horvitz-Thompson estimator is widely used f. ex
Horvitz-Thompson estimator is widely used f.ex., in official statistics

Note that the variance is small if we determine the inclusion probabilities such that
Of course, we do not know the value of yi when planning the survey, use known auxiliary xi and choose

Variance estimate for H-T estimator
Assume the size of the sample is determined in advance to be n.

Can always compute the variance estimate!!
Since, necessarily pij > 0 for all i,j in the sample s But: If not all pij > 0 , should not use this estimate! It can give very incorrect estimates The variance estimate can be negative, but for most sampling designs it is always positive

STK 4600: Statistical methods for social sciences.

Similar presentations

Presentation on theme: "STK 4600: Statistical methods for social sciences."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

STK 4600: Statistical methods for social sciences.

Similar presentations

Presentation on theme: "STK 4600: Statistical methods for social sciences."— Presentation transcript:

Similar presentations

About project

Feedback