The perils of non-probability sampling

Slides:



Advertisements
Similar presentations
Sampling.
Advertisements

VI. Sampling: (Nov. 2, 4) Frankfort-Nachmias & Nachmias (Chapter 8 – Sampling and Sample Designs) King, Keohane and Verba (Chapter 4) Barbara Geddes
Sampling.
Chapter 7: Data for Decisions Lesson Plan
Chapter 7 Sampling Distributions
© 2002 Prentice-Hall, Inc.Chap 1-1 Statistics for Managers using Microsoft Excel 3 rd Edition Chapter 1 Introduction and Data Collection.
Beginning the Research Design
The eternal tension in statistics.... Between what you really really want (the population) but can never get to...
Chapter 12 Sample Surveys. At the end of this chapter, you should be able to Identify populations, samples, parameters and statistics for a given problem.
The Polls and The 2015 Election John Curtice 9 June 2015.
Basic Business Statistics (8th Edition)
How We Form Political Opinions Political Opinions Personal Beliefs Political Knowledge Cues From Leaders.
PUBLIC OPINION AND POLITICAL SOCIALIZATION
Statistical Inference: Which Statistical Test To Use? Pınar Ay, MD, MPH Marmara University School of Medicine Department of Public Health
BULLSEYE VOCABULARY UNIT 2. Political Culture, Political Socialization, Particiapation Good Luck on your Test!!!!
C1, L2, S1 Design Method of Data Collection Surveys and Polls Experimentation Observational Studies.
4.2 Statistics Notes What are Good Ways and Bad Ways to Sample?
SAMPLING Nuances of sample size determination Brett Oppegaard, Washington State University Vancouver Language, Texts and Technology, Spring 2011.
Sampling 12/4/2012. Readings Chapter 8 Correlation and Linear Regression (Pollock) (pp ) Chapter 6 Foundations of Statistical Inference (Pollock)
7. Logic of Sampling Jin-Wan Seo, Professor Dept. of Public Administration, University of Incheon.
DATA COLLECTION METHODS Sampling
Designing Social Inquiry week 4 I36005 Soohyung Ahn Case Study 1936 PRESIDENTIAL ELECTION : Roosevelt VS Landon.
Scot Exec Course Nov/Dec 04 Survey design overview Gillian Raab Professor of Applied Statistics Napier University.
Pitfalls of Surveys. The Literary Digest Poll 1936 US Presidential Election Alf Landon (R) vs. Franklin D. Roosevelt (D)
Variables, sampling, and sample size. Overview  Variables  Types of variables  Sampling  Types of samples  Why specific sampling methods are used.
Chapter 7: Data for Decisions Lesson Plan Sampling Bad Sampling Methods Simple Random Samples Cautions About Sample Surveys Experiments Thinking About.
Chapter 15 Sampling and Sample Size Winston Jackson and Norine Verberg Methods: Doing Social Research, 4e.
Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc.. Chap 7-1 Chapter 7 Sampling Distributions Basic Business Statistics.
Random Samples 12/5/2013. Readings Chapter 6 Foundations of Statistical Inference (Pollock) (pp )
 Elections: The voice of the people. › Frequently interpreted as voters acceptance or rejection of a party platform. › Affected by many factors and give.
An importer of Herbs and Spices claims that average weight of packets of Saffron is 20 grams. However packets are actually filled to an average weight,
Basic Business Statistics, 8e © 2002 Prentice-Hall, Inc. Chap 1-1 Inferential Statistics for Forecasting Dr. Ghada Abo-zaid Inferential Statistics for.
Chapter 7 Introduction to Sampling Distributions Business Statistics: QMIS 220, by Dr. M. Zainal.
1 Data Collection and Sampling ST Methods of Collecting Data The reliability and accuracy of the data affect the validity of the results of a statistical.
Opinion polls and survey research. What do surveys do? Measure concepts –Provide indicators for concepts Allow us track opinion and test theories.
Section 4.1 Why Take Samples and.
AC 1.2 present the survey methodology and sampling frame used
Chapter 12 Sample Surveys.
Chapter 1: Statistics, Data and Statistical Thinking
Chapter 1 Introduction and Data Collection
Sampling.
Sources of Bias 1. Voluntary response 2. Undercoverage 3. Nonresponse
Propensity Score Adjustments for Internet Survey of Voting Behavior:
Part III – Gathering Data
Chapter 1: Statistics, Data, and Statistical Thinking
Can we trust the opinion polls – a panel discussion
How Have The Polls Changed Since 2015?
Holding an election Elections have been called the “only poll that matters” since they decide who will hold power—and the people’s approval or rejection.
Inference for Sampling
Chapter 4 Sampling Design.
Sampling.
Sampling Population – any well-defined set of units of analysis; the group to which our theories apply Sample – any subset of units collected in some manner.
Chapter 2: The nonresponse problem
The European Statistical Training Programme (ESTP)
Random sampling Carlo Azzarri IFPRI Datathon APSU, Dhaka
The European Statistical Training Programme (ESTP)
The European Statistical Training Programme (ESTP)
Chapter 8: Weighting adjustment
Chapter 12: Other nonresponse correction techniques
Chapter 1: Basic concepts of surveys
The European Statistical Training Programme (ESTP)
Chapter: 9: Propensity scores
Business Statistics: A First Course (3rd Edition)
Bias in Studies Section 1.4.
The European Statistical Training Programme (ESTP)
What do Samples Tell Us Variability and Bias.
Chapter 13: Item nonresponse
Chapter 2: The nonresponse problem
Chapter 5: The analysis of nonresponse
Chapter 1: Statistics, Data and Statistical Thinking
Presentation transcript:

The perils of non-probability sampling Jelke Bethlehem Leiden University, The Netherlands Inference from Non-Probability Samples 16-17 March 2017

The perils of non-probability sampling Overview The rise of survey sampling. The fundamental principles of probability sampling. Case 1: Nonresponse in surveys based on random sampling. Case 2: Self-selection in online surveys. Some examples of self-selection. A random sample + nonresponse, or a self-selection sample. Conclusion. The perils of non-probability sampling

Surveys, polls, … Different terms for the same type of research Investigation of a group of people (the population). Measurement by means of asking questions. Observation of only a small selection from the population (the sample). Generalisation of the outcomes from the sample to the population. Does this work? Yes, if it is a good survey. No, if it is a bad survey. What is a good survey? Generalisation Population Conclusions Selection Sample The perils of non-probability sampling

The rise of sample surveys Until 1895: complete enumeration (censuses) New France (Canada): 1666, Jean Talon (intendant), 3215 people. Sweden: 1748, Denmark: 1769. Netherlands: 1795, new system of electoral constituencies. Standardized questionnaires. Legal obligation to participate. No sampling It was considered not proper to replace people by computations. Sampling is discrimination. No reliable conclusions possible based on sample data. Data are required on all people. The perils of non-probability sampling

The rise of sample surveys Developments 1895: Anders Kiaer proposes his ‘Representative Method’. A kind of quota sampling. Create a miniature of the population. Accuracy of estimates cannot be computed. 1906: Arthur Bowley shows the importance of random sampling. Probability Theory can be applied. Estimates have a normal distribution. Variances can be computed. 1934: Jerzy Neyman introduces the confidence interval. He also shows that quota sampling does not work. The perils of non-probability sampling

The rise of sample surveys The fundamental principles of survey sampling Samples must be selected by means of probability sampling. Every person must have a positive probability of selection. All selection probabilities must be known. Consequences It is always possible to construct an unbiased (valid) estimator. Estimators often have a (approximately) normal distribution. Accuracy of estimators can be computed (confidence intervals). Warning For other forms of sampling (e.g. quota sampling, or self-selection ), it is not clear how reliable and accurate the outcomes are. The perils of non-probability sampling

Bad example: the presidential elections in the U.S. in 1936 The Literary Digest poll Sample: Lists of car owners and telephone directories. Sample size: 2,400,000. Prediction: Alf Landon (Republican) will win. The George Gallup poll Quota sample (gender, age, socioeconomic class, region). on quota. Sample size: 50,000. Hundreds of interviewers spread over the country. Prediction: Franklin Roosevelt (Democrat) will win. The perils of non-probability sampling

Bad example: the presidential elections in the U.S. in 1936 Result Prediction of Literary Digest was wrong. Prediction of Gallup was right. Conclusion Sample of Literary Digest was not representative. Sample contained too many middle and upper class people. They tend to vote Republican. A large sample does not help if it lacks representativity. Candidate Prediction by Literary Digest Prediction by Gallup Election result Roosevelt (Dem) Landon (Rep) 43% 57% 56% 44% 61% 37% The perils of non-probability sampling

Bad example: the presidential elections in the U.S. in 1948 The Gallup Poll Thomas Dewey (Republican) vs Harry Truman (Democrat). Prediction: Dewey will win. Newspapers did not wait for final results. Harry Truman turned out to be the winner. The perils of non-probability sampling

Example: the presidential elections in the U.S. in 1948 The result Causes Quota sampling instead of probability sampling. Over-representation of Republicans in quota samples. Poll stopped too early (two weeks before the elections). Consequence Gallup replaced quota sampling by random sampling. Candidate Prediction by Gallup Election result Truman (Dem) Dewey (Rep) 44% 50% 45% The perils of non-probability sampling

Random sampling really works … Simulation: election survey in the town of Rhinewood Population of 30,000 voters. Estimate percentage voting for the New Internet Party (NIP). True population percentage: 39.5%. Repeat sample selection a large number of times. Samples of size 500 Samples of size 2,000 The perils of non-probability sampling

Problems Increased costs Sampling issues Interviewer-assisted surveys (CAPI, CATI) is becoming too expensive. Can we change to online surveys without sacrificing quality? Sampling issues There are no proper sampling frames for online surveys. It becomes difficult to select a sample for a telephone survey. Increasing nonresponse problems Response rates < 10% for telephone surveys (RDD). Response rates < 40% for online surveys. Do the principles of probability sampling still apply? Now, let’s have a look at the current situation in general. Then there are a number of issues. The perils of non-probability sampling

The nonresponse problem Nonresponse in surveys Persons who are selected in the sample (and who belong to the target population) do not provide the requested information. Main causes of nonresponse Non-contact. Refusal. Not-able. Consequences of nonresponse Estimators are less precise due to fewer observations. Representativity is affected, because nonresponse is selective. Therefore, estimators can be biased. The perils of non-probability sampling

The nonresponse problem Simulation: election survey in the town of Rhinewood Estimate percentage voting for the New Internet Party (NIP). True population percentage: 39.5%. People with internet respond more than those without. Repeat sample selection a large number of times. Samples of size 2,000 Samples of size 2,000 Full response 61% response The perils of non-probability sampling

The nonresponse problem Modelling nonresponse Each person k has an unknown probability ρk to respond. Simple random sample. The response percentage is a biased estimator of the population percentage. The bias is of the estimator is equal to Bias depends on Correlation RρY between target variable and response probabilities. The standard deviation S of the response probabilities. Mean response probability. The perils of non-probability sampling

The nonresponse bias The bias due to nonresponse: The perils of non-probability sampling

The nonresponse bias The bias due to nonresponse: Correlation between response probabilities and survey variable The perils of non-probability sampling

The nonresponse bias The bias due to nonresponse: Correlation between response probabilities and survey variable Variation of response probabilities The perils of non-probability sampling

The nonresponse bias The bias due to nonresponse: Correlation between response probabilities and survey variable Variation of response probabilities Response rate The perils of non-probability sampling

The nonresponse bias The bias due to nonresponse: Note: Increasing the sample size does not help to reduce the bias. Correlation between response probabilities and survey variable Variation of response probabilities Response rate The perils of non-probability sampling

Solving the nonresponse problem Solution 1: reduction of the nonresponse Reduce nonresponse in the field as much as possible. For example: more training for interviewers, more reminders, use of incentives. This is expensive and time-consuming. There will always remain nonresponse. Solution 2: correction for the effects of non-response Is the usual approach. Attempt to reduce the bias of estimators. Apply correction technique: adjustment weighting. The perils of non-probability sampling

Solving the nonresponse problem What is adjustment weighting? Assignment of weights to responding persons. Use of weighted values to compute estimates. People in over-represented groups get weight smaller than 1. People in under-represented groups get weight larger than 1. Ingredients: auxiliary variables Individual values must be measured in the survey. Distribution in population must be available. Weighting is only effective, if There is strong correlation with target variables of the survey. There is strong correlation with response behaviour in the survey. The perils of non-probability sampling

Solving the nonresponse problem Weighting techniques (1) Make the response representative with respect to auxiliary variables. Post-stratification (simple weighting). Generalized regression estimation (linear weighting). Raking ratio estimation (multiplicative weighting). Weighting techniques (2) Based on estimated response probabilities (response propensities). Adapt Horvitz-Thompson estimator by including estimated response probabilities. Post-stratification based on estimated response probabilities. Problem Insufficient effective auxiliary variables. The perils of non-probability sampling

Online surveys Online survey Problem Rapidly became very popular. Simple access to large group of potential respondents. Fast data collection. One can do a survey in a day. Cheap: no interviewers, no printing costs, no mailing costs. Attractive: use of video, audio, pictures, and animation. Everyone can do it! Problem How to select a sample for an online survey. The perils of non-probability sampling

Online surveys Selection of a random sample for an online survey A sampling frame is required for a probability sample. There is often no sampling frame containing e-mail addresses. So it is not possible to send an e-mail with a link to the questionnaire website. Alternative 1: survey with different recruitment mode Draw a random sample from a population register, or an address list, and send a letter (with a link) to each selected address. Draw a random sample of telephone numbers, call the selected people, and give them a link. Disadvantages: more cumbersome, not so fast, more expensive. The perils of non-probability sampling

Online surveys Alternative 2: set up an online panel Bad alternative Recruit members with a random sample. Recruitment mode: mail or CATI. Only setup is time-consuming and expensive. Surveys based on random sample from the panel Bad alternative Rely on self-selection (opt-in) of respondents. Self-selection sampling = non-probability sampling. The perils of non-probability sampling

Online surveys, self-selection problems No probability sampling is applied. Participants are people that happen to see the invitation, have internet, and spontaneously decide to participate. It is a cheap and fast way to collect a lot of data. However, the sample is not representative. Problems Also people outside the target population of the survey can respond. Often people can respond more than once (on the same, or on a different computer). Groups of people may attempt to manipulate the outcomes of the web survey. The perils of non-probability sampling

Online surveys, self-selection problems Example 1 of self-selection 2005 Book of the Year Award. High profile literary prize in the Netherlands. A self-selection survey was carried out to select the best book. One could vote for one of nominated books, or suggest another book. Number of participants: 92,000. 72% voted for a non-nominated book: the new Bible translation. Result of a campaign by Bible societies, a Christian TV-channel, and a Christian newspaper. The perils of non-probability sampling

Online surveys, self-selection problems Example 2 of self-selection Local elections in Amsterdam in 2014. Debate between local party leaders. Online poll: who was the best? Two campaign teams discovered one could vote more than once. They voted all night. Results: The poll was cancelled. Party Votes D66 SP PvdA GL VVD 3,890 3,816 1,121 852 214 The perils of non-probability sampling

Online surveys, self-selection problems Example 3 of self-selection Two major Dutch election polls. Both based on self-selection online panels. Estimates for largest party (right-wing, populist PVV). Systematic difference of 8-10 seats (on a total of 150 seats). The perils of non-probability sampling

Online surveys, self-selection problems Example 4: The UK Polling Disaster General election of 7 May 2015 in the United Kingdom. There were many polls (telephone and online). All predicted a neck-to-neck race between the Conservative Party and the Labour Party, likely leading to a ‘hung parliament’. They were all wrong: the Conservatives got a comfortable majority of 6.5 percentage points. Sky News 4 May 2015 The perils of non-probability sampling

Online surveys, self-selection problems Example 4: The UK Polling Disaster Difference between Conservatives and Labour. Difference in election: 6.5% Poll Mode Sample Difference Populus YouGov Survation PanelBase Opinium TNS BMG Ipsos MORI ComRes ICM Lord Askcroft web panel telephone 3,917 10,307 4,088 3,019 2,916 1,185 1,009 1,186 1,007 2,023 3,028 0% -2% 1% -1% The perils of non-probability sampling

Online surveys, self-selection problems Example 4: The UK Polling Disaster Variable: the difference between Conservatives and Labour. There are significant differences between polls and election result. The perils of non-probability sampling

Online surveys, self-selection problems Example 4: The UK Polling Disaster There was no ‘Shy Tory Factor’. There was no ‘Late Swing’. ‘Herding’ could not be excluded completely. Conclusions The web surveys were not representative because they were based on self-selection web panels. The telephone surveys were not representative because they suffered from very low response rates (20%). The weighting adjustment techniques used, were not effective. They were not able to reduce the bias. The perils of non-probability sampling

Online surveys, self-selection problems Example 5: Shopping Sundays in a municipality Urban town Alphen (70,000 people) Seven rural villages: Aarlanderveen, Benthuizen, Boskoop. Hazerswoude-Dorp, Hazerswoude-Rijndijk, Koudekerk, Zwammerdam (together 30,000 people). Alphen Benthuizen The perils of non-probability sampling

Online surveys, self-selection problems Example 5: Shopping Sundays in a municipality Should the shops be open on Sunday? Liberal parties in favour, Christian parties opposed. Three surveys at the same time Face-to-face interviews by members of political parties in shopping centres on one Saturday afternoon. 754 interviews. Citizen panel, based on random sample from population register. 857 interviews. Response: 54%. Self-selection web survey, to give everyone the possibility to express an opinion. 1550 interviews. Note Appeal by churches to their members to vote. The perils of non-probability sampling

Online surveys, self-selection problems Example 5: Shopping Sundays in a municipality Distribution of the response over town and villages. Small Christian villages are over-represented The perils of non-probability sampling

Online surveys, self-selection problems Example 5: Shopping Sundays in a municipality Results of the surveys Large differences between surveys!. Which one is correct? The perils of non-probability sampling

Self-selection problems Self-selection bias Respondents are people that have internet, happen to see the invitation, and spontaneously decide to participate. Each person k has unknown probability πk to participate. The bias of the estimator is equal to Bias depends on Correlation RπY between target variable and participation probabilities. The standard deviation Sπ of the participation probabilities. Mean participation probability. The perils of non-probability sampling

Self-selection problems Question Isn’t a random sample with a low response rate just as bad as a self-selection sample? The expressions for the bias look similar. However, the response probabilities are much larger than the participation probabilities. Worst case: maximum absolute bias Random sample + nonresponse: Self-selection sample: The perils of non-probability sampling

Self-selection problems Examples CAPI survey, random sample, response rate = 60%: Telephone survey, random sample (RDD), response rate = 10%: Self-selection survey. Population = 12 million, response = 120,000, participation rate = 1%. The perils of non-probability sampling

Conclusions Random sampling or self-selection? But The worst case bias of a large self-selection sample is much larger than a random sample with a very low response rate. Use random sampling. Do not throw out the baby with the bath water. But Improve the quality of web panels . Work on better weighting adjustment techniques. The perils of non-probability sampling