1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.

Slides:



Advertisements
Similar presentations
Mean, Proportion, CLT Bootstrap
Advertisements

Chapter 6 Sampling and Sampling Distributions
Chap 8-1 Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chapter 8 Estimation: Single Population Statistics for Business and Economics.
Statistics for Managers Using Microsoft® Excel 5th Edition
Economics 105: Statistics Review #1 due next Tuesday in class Go over GH 8 No GH’s due until next Thur! GH 9 and 10 due next Thur. Do go to lab this week.
Chapter 3 Producing Data 1. During most of this semester we go about statistics as if we already have data to work with. This is okay, but a little misleading.
QBM117 Business Statistics Statistical Inference Sampling 1.
Chapter 7 Sampling Distributions
Resampling techniques Why resampling? Jacknife Cross-validation Bootstrap Examples of application of bootstrap.
Topic 7 Sampling And Sampling Distributions. The term Population represents everything we want to study, bearing in mind that the population is ever changing.
Chapter 7 Sampling and Sampling Distributions
Introduction to Formal Statistical Inference
Chapter 8 Estimation: Single Population
Part III: Inference Topic 6 Sampling and Sampling Distributions
Chapter 7 Estimation: Single Population
Statistical inference Population - collection of all subjects or objects of interest (not necessarily people) Sample - subset of the population used to.
Formalizing the Concepts: Simple Random Sampling.
Sampling Theory and Surveys GV917. Introduction to Sampling In statistics the population refers to the total universe of objects being studied. Examples.
STK 4600: Statistical methods for social sciences.
Chapter 7 Estimation: Single Population
Sampling. Concerns 1)Representativeness of the Sample: Does the sample accurately portray the population from which it is drawn 2)Time and Change: Was.
Sampling: Theory and Methods
Chapter 6 Confidence Intervals 1 Larson/Farber 4th ed.
From Sample to Population Often we want to understand the attitudes, beliefs, opinions or behaviour of some population, but only have data on a sample.
Chapter 11: Estimation Estimation Defined Confidence Levels
STA Lecture 161 STA 291 Lecture 16 Normal distributions: ( mean and SD ) use table or web page. The sampling distribution of and are both (approximately)
Topic 5 Statistical inference: point and interval estimate
6 Chapter Confidence Intervals © 2012 Pearson Education, Inc.
Chap 20-1 Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chapter 20 Sampling: Additional Topics in Sampling Statistics for Business.
Population All members of a set which have a given characteristic. Population Data Data associated with a certain population. Population Parameter A measure.
Estimation Bias, Standard Error and Sampling Distribution Estimation Bias, Standard Error and Sampling Distribution Topic 9.
PROBABILITY (6MTCOAE205) Chapter 6 Estimation. Confidence Intervals Contents of this chapter: Confidence Intervals for the Population Mean, μ when Population.
Random Sampling, Point Estimation and Maximum Likelihood.
PARAMETRIC STATISTICAL INFERENCE
STK 4600: Statistical methods for social sciences.
Copyright ©2011 Pearson Education 7-1 Chapter 7 Sampling and Sampling Distributions Statistics for Managers using Microsoft Excel 6 th Global Edition.
Sampling Design and Analysis MTH 494 Lecture-30 Ossam Chohan Assistant Professor CIIT Abbottabad.
Sampling Design and Analysis MTH 494 LECTURE-12 Ossam Chohan Assistant Professor CIIT Abbottabad.
Sampling Error.  When we take a sample, our results will not exactly equal the correct results for the whole population. That is, our results will be.
Sampling Theory The procedure for drawing a random sample a distribution is that numbers 1, 2, … are assigned to the elements of the distribution and tables.
Stat 112: Notes 2 Today’s class: Section 3.3. –Full description of simple linear regression model. –Checking the assumptions of the simple linear regression.
Chapter 5 Parameter estimation. What is sample inference? Distinguish between managerial & financial accounting. Understand how managers can use accounting.
Unit 6 Confidence Intervals If you arrive late (or leave early) please do not announce it to everyone as we get side tracked, instead send me an .
Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc.. Chap 7-1 Chapter 7 Sampling Distributions Basic Business Statistics.
Statistics for Managers Using Microsoft Excel, 5e © 2008 Pearson Prentice-Hall, Inc.Chap 7-1 Statistics for Managers Using Microsoft® Excel 5th Edition.
Agresti/Franklin Statistics, 1 of 87  Section 7.2 How Can We Construct a Confidence Interval to Estimate a Population Proportion?
Statistics : Statistical Inference Krishna.V.Palem Kenneth and Audrey Kennedy Professor of Computing Department of Computer Science, Rice University 1.
Confidence Intervals (Dr. Monticino). Assignment Sheet  Read Chapter 21  Assignment # 14 (Due Monday May 2 nd )  Chapter 21 Exercise Set A: 1,2,3,7.
Statistics and Quantitative Analysis U4320 Segment 5: Sampling and inference Prof. Sharyn O’Halloran.
Sampling and estimation Petter Mostad
Point Estimation of Parameters and Sampling Distributions Outlines:  Sampling Distributions and the central limit theorem  Point estimation  Methods.
1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.
Basic Business Statistics, 11e © 2009 Prentice-Hall, Inc. Chap 7-1 Chapter 7 Sampling and Sampling Distributions Basic Business Statistics 11 th Edition.
Sampling and Statistical Analysis for Decision Making A. A. Elimam College of Business San Francisco State University.
Basic Business Statistics
POLS 7000X STATISTICS IN POLITICAL SCIENCE CLASS 5 BROOKLYN COLLEGE-CUNY SHANG E. HA Leon-Guerrero and Frankfort-Nachmias, Essentials of Statistics for.
Chapter 7 Data for Decisions. Population vs Sample A Population in a statistical study is the entire group of individuals about which we want information.
1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals.
Topics Semester I Descriptive statistics Time series Semester II Sampling Statistical Inference: Estimation, Hypothesis testing Relationships, casual models.
Sampling Design and Analysis MTH 494 LECTURE-11 Ossam Chohan Assistant Professor CIIT Abbottabad.
Statistics for Business and Economics 8 th Edition Chapter 7 Estimation: Single Population Copyright © 2013 Pearson Education, Inc. Publishing as Prentice.
1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals by Jan F. Bjørnstad.
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 7 Inferences Concerning Means.
CHAPTER 6: SAMPLING, SAMPLING DISTRIBUTIONS, AND ESTIMATION Leon-Guerrero and Frankfort-Nachmias, Essentials of Statistics for a Diverse Society.
usually unimportant in social surveys:
Representativeness The aim of any sample is to represent the characteristics of the sample frame. There are a number of different methods used to generate.
CONCEPTS OF ESTIMATION
Econ 3790: Business and Economics Statistics
Estimating population size and a ratio
STK 4600: Statistical methods for social sciences.
Presentation transcript:

1 STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals

2 Survey sampling: 4 major topics 1.Traditional design-based statistical inference 7 weeks 2.Likelihood considerations 1 week 3.Model-based statistical inference 3 weeks 4.Missing data - nonresponse 2 weeks

3 Statistical demography Mortality Life expectancy Population projections 2 weeks

4 Course goals Give students knowledge about: –planning surveys in social sciences –major sampling designs –basic concepts and the most important estimation methods in traditional applied survey sampling –Likelihood principle and its consequences for survey sampling –Use of modeling in sampling –Treatment of nonresponse –A basic knowledge of demography

5 But first: Basic concepts in sampling Population (Target population): The universe of all units of interest for a certain study Denoted, with N being the size of the population: U = {1, 2,...., N} All units can be identified and labeled Ex: Political poll – All adults eligible to vote Ex: Employment/Unemployment in Norway– All persons in Norway, age 15 or more Ex: Consumer expenditure : Unit = household Sample: A subset of the population, to be observed. The sample should be ”representative” of the population

6 Sampling design: The sample is a probability sample if all units in the sample have been chosen with certain probabilities, and such that each unit in the population has a positive probability of being chosen to the sample We shall only be concerned with probability sampling Example: simple random sample (SRS). Let n denote the sample size. Every possible subset of n units has the same chance of being the sample. Then all units in the population have the same probability n/N of being chosen to the sample. The probability distribution for SRS on all subsets of U is an example of a sampling design: The probability plan for selecting a sample s from the population:

7 Basic statistical problem: Estimation A typical survey has many variables of interest Aim of a sample is to obtain information regarding totals or averages of these variables for the whole population Examples : Unemployment in Norway– Want to estimate the total number t of individuals unemployed. For each person i (at least 15 years old) in Norway:

8 In general, variable of interest: y with y i equal to the value of y for unit i in the population, and the total is denoted The typical problem is to estimate t or t/N Sometimes, of interest also to estimate ratios of totals: Example- estimating the rate of unemployment: Unemployment rate:

9 Sources of error in sample surveys 1.Target population U vs Frame population U F Access to the population is thru a list of units – a register U F. U and U F may not be the same: Three possible errors in U F : –Undercoverage: Some units in U are not in U F –Overcoverage: Some units in U F are not in U –Duplicate listings: A unit in U is listed more than once in U F U F is sometimes called the sampling frame

10 2.Nonresponse - missing data Some persons cannot be contacted Some refuse to participate in the survey Some may be ill and incapable of responding In postal surveys: Can be as much as 70% nonresponse In telephone surveys: 50% nonresponse is not uncommon Possible consequences: –Bias in the sample, not representative of the population –Estimation becomes more inaccurate Remedies: –imputation, weighting

11 3.Measurement error – the correct value of y i is not measured –In interviewer surveys: Incorrect marking interviewer effect: people may say what they think the interviewer wants to hear – underreporting of alcohol ute, tobacco use misunderstanding of the question, do not remember correctly.

12 4.Sampling «error» –The error (uncertainty, tolerance) caused by observing a sample instead of the whole population –To assess this error- margin of error: measure sample to sample variation –Design approach deals with calculating sampling errors for different sampling designs –One such measure: 95% confidence interval: If we draw repeated samples, then 95% of the calculated confidence intervals for a total t will actually include t

13 The first 3 errors: nonsampling errors –Can be much larger than the sampling error In this course: –Sampling error –nonresponse bias –Shall assume that the frame population is identical to the target population –No measurement error

14 Summary of basic concepts Population, target population unit sample sampling design estimation –estimator –measure of bias –measure of variance –confidence interval

15 survey errors: –register /frame population –mesurement error –nonresponse –sampling error

16 Example – Psychiatric Morbidity Survey 1993 from Great Britain Aim: Provide information about prevalence of psychiatric problems among adults in GB as well as their associated social disabilities and use of services Target population: Adults aged living in private households Sample: Thru several stages: 18,000 adresses were chosen and 1 adult in each household was chosen 200 interviewers, each visiting 90 households

17 Result of the sampling process Sample of addresses18,000 Vacant premises 927 Institutions/business premises 573 Demolished 499 Second home/holiday flat 236 Private household addresses 15,765 Extra households found 669 Total private households 16,434 Households with no one ,704 Eligible households 12,730 Nonresponse 2,622 Sample 10,108 households with responding adults aged 16-64

18 Why sampling ? reduces costs for acceptable level of accuracy (money, manpower, processing time...) may free up resources to reduce nonsampling error and collect more information from each person in the sample –ex: 400 interviewers at $5 per interview: lower sampling error 200 interviewers at 10$ per interview: lower nonsampling error much quicker results

19 When is sample representative ? Balance on gender and age: –proportion of women in proportion in population –proportions of age groups in proportions in population An ideal representative sample: –A miniature version of the population: –implying that every unit in the sample represents the characteristics of a known number of units in the population Appropriate probability sampling ensures a representative sample ”on the average”

20 Alternative approaches for statistical inference based on survey sampling Design-based: –No modeling, only stochastic element is the sample s with known distribution Model-based: The values y i are assumed to be values of random variables Y i : –Two stochastic elements: Y = (Y 1, …,Y N ) and s –Assumes a parametric distribution for Y –Example : suppose we have an auxiliary variable x. Could be: age, gender, education. A typical model is a regression of Y i on x i.

21 Statistical principles of inference imply that the model-based approach is the most sound and valid approach Start with learning the design-based approach since it is the most applied approach to survey sampling used by national statistical institutes and most research institutes for social sciences. –Is the easy way out: Do not need to model. All statisticians working with survey sampling in practice need to know this approach

22 Design-based statistical inference Can also be viewed as a distribution-free nonparametric approach The only stochastic element: Sample s, distribution p(s) for all subsets s of the population U={1,..., N} No explicit statistical modeling is done for the variable y. All y i ’s are considered fixed but unknown Focus on sampling error Sets the sample survey theory apart from usual statistical analysis The traditional approach, started by Neyman in 1934

23 Estimation theory-simple random sample Estimation of the population mean of a variable y: A natural estimator - the sample mean: Desirable properties: SRS of size n: Each sample s of size n has Can be performed in principle by drawing one unit at time at random without replacement

24 The uncertainty of an unbiased estimator is measured by its estimated sampling variance or standard error (SE): Some results for SRS:

25 usually unimportant in social surveys: n =10,000 and N = 5,000,000: 1- f = n =1000 and N = 400,000: 1- f = n =1000 and N = 5,000,000: 1-f = effect of changing n much more important than effect of changing n/N

26 The estimated variance Usually we report the standard error of the estimate: Confidence intervals for m is based on the Central Limit Theorem:

Example – Student performance in California schools Academic Performance Index (API) for all California schools Based on standardized testing of students Data from all schools with at least 100 students Unit in population = school (Elementary/Middle/High) Full population consists of N = 6194 observations Concentrate on the variable: y = api00 = API in 2000 Mean(y) = with min(y) =346 and max(y) =969 Data set in R: apipop and y= apipop$api00 27

Histogram of y population with fitted normal density 28

Histogram for sample mean and fitted normal density y = api scores from Sample size n =10, based on 10000simulations 29 R-code: >b =10000 >N=6194 >n=10 >ybar=numeric(b) >for (k in 1:b){ +s=sample(1:N,n) +ybar[k]=mean(y[s]) +} >hist(ybar,seq(min(ybar)-5,max(ybar)+5,5),prob=TRUE) >x=seq(mean(ybar)-4*sqrt(var(ybar)),mean(ybar)+4*sqrt(var(ybar)),0.05) >z=dnorm(x,mean(ybar),sqrt(var(ybar))) >lines(x,z)

Histogram and fitted normal density api scores. Sample size n =10, based on simulations 30

31 y = api00 for 6194 California schools nConf. level simulations of SRS. Confidence level of the approximate 95% CI

32 For one sample of size n = 100:For one sample of size n = 100 R-code: >s=sample(1:6194,100) > ybar=mean(y[s]) > se=sqrt(var(y[s])*( )/(6194*100)) > ybar [1] > var(y[s]) [1] > se [1]

33 The coefficient of variation for the estimate: A measure of the relative variability of an estimate. It does not depend on the unit of measurement. More stable over repeated surveys, can be used for planning, for example determining sample size More meaningful when estimating proportions Absolute value of sampling error is not informative when not related to value of the estimate For example, SE =2 is small if estimate is 1000, but very large if estimate is 3

34 Estimation of a population proportion p with a certain characteristic A p = (number of units in the population with A)/N Let y i = 1 if unit i has characteristic A, 0 otherwise Then p is the population mean of the y i ’s. Let X be the number of units in the sample with characteristic A. Then the sample mean can be expressed as

35 So the unbiased estimate of the variance of the estimator:

36 Examples A political poll: Suppose we have a random sample of 1000 eligible voters in Norway with 280 saying they will vote for the Labor party. Then the estimated proportion of Labor votes in Norway is given by: Confidence interval requires normal approximation. Can use the guideline from binomial distribution, when N-n is large:

37 In this example : n = 1000 and N = 4,000,000 Ex: Psychiatric Morbidity Survey 1993 from Great Britain p = proportion with psychiatric problems n = 9792 (partial nonresponse on this question: 316) 40,000,000

38 General probability sampling Sampling design: p(s) - known probability of selection for each subset s of the population U Actually: The sampling design is the probability distribution p(. ) over all subsets of U Typically, for most s: p(s) = 0. In SRS of size n, all s with size different from n has p(s) = 0. The inclusion probability:

39 Illustration U = {1,2,3,4} Sample of size 2; 6 possible samples Sampling design: p({1,2}) = ½, p({2,3}) = 1/4, p({3,4}) = 1/8, p({1,4}) = 1/8 The inclusion probabilities:

40 Some results

41 Estimation theory probability sampling in general Problem: Estimate a population quantity for the variable y For the sake of illustration: The population total

42 CV is a useful measure of uncertainty, especially when standard error increases as the estimate increases Because, typically we have that

43 Some peculiarities in the estimation theory Example: N=3, n=2, simple random sample

44 For this set of values of the y i ’s:

45 Let y be the population vector of the y-values. This example shows that is not uniformly best ( minimum variance for all y) among linear design-unbiased estimators Example shows that the ”usual” basic estimators do not have the same properties in design-based survey sampling as they do in ordinary statistical models In fact, we have the following much stronger result: Theorem: Let p(. ) be any sampling design. Assume each y i can take at least two values. Then there exists no uniformly best design-unbiased estimator of the total t

46 Proof: This implies that a uniformly best unbiased estimator must have variance equal to 0 for all values of y, which is impossible