New Techniques and Technologies for Statistics 2017  Estimation of Response Propensities and Indicators of Representative Response Using Population-Level.

Slides:



Advertisements
Similar presentations
Page 1 Measuring Survey Quality through Representativity Indicators using Sample and Population based Information Chris Skinner, Natalie Shlomo, Barry.
Advertisements

Introduction Simple Random Sampling Stratified Random Sampling
Sampling: Final and Initial Sample Size Determination
Evaluating Hypotheses
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Evaluating Hypotheses.
FINAL REPORT: OUTLINE & OVERVIEW OF SURVEY ERRORS
Chapter 1 Introduction and Data Collection
Evidence-Based Medicine 3 More Knowledge and Skills for Critical Reading Karen E. Schetzina, MD, MPH.
Chap 20-1 Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chapter 20 Sampling: Additional Topics in Sampling Statistics for Business.
1 Ratio estimation under SRS Assume Absence of nonsampling error SRS of size n from a pop of size N Ratio estimation is alternative to under SRS, uses.
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 1-1 Statistics for Managers Using Microsoft ® Excel 4 th Edition Chapter.
Calibrated imputation of numerical data under linear edit restrictions Jeroen Pannekoek Natalie Shlomo Ton de Waal.
Chapter 5 Parameter estimation. What is sample inference? Distinguish between managerial & financial accounting. Understand how managers can use accounting.
A Theoretical Framework for Adaptive Collection Designs Jean-François Beaumont, Statistics Canada David Haziza, Université de Montréal International Total.
Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc.. Chap 7-1 Chapter 7 Sampling Distributions Basic Business Statistics.
Eurostat Weighting and Estimation. Presented by Loredana Di Consiglio Istituto Nazionale di Statistica, ISTAT.
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 11: Models Marshall University Genomics Core Facility.
Case Selection and Resampling Lucila Ohno-Machado HST951.
Q2010 – special topic session 33 - Page 1 Indicators for representative response Barry Schouten (Statistics Netherlands) Natalie Shlomo and Chris Skinner.
Chapter 6 Sampling and Sampling Distributions
Class Six Turn In: Chapter 15: 30, 32, 38, 44, 48, 50 Chapter 17: 28, 38, 44 For Class Seven: Chapter 18: 32, 34, 36 Chapter 19: 26, 34, 44 Quiz 3 Read.
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 1-1 Statistics for Managers Using Microsoft ® Excel 4 th Edition Chapter.
Learning Objectives : After completing this lesson, you should be able to: Describe key data collection methods Know key definitions: Population vs. Sample.
ThiQar college of Medicine Family & Community medicine dept
Statistical Estimation
Chapter 7 Confidence Interval Estimation
This will help you understand the limitations of the data and the uses to which it can be put (and the confidence with which you can put it to those.
Peter Linde, Interviewservice Statistics Denmark
Sampling Why use sampling? Terms and definitions
Multiple Imputation using SOLAS for Missing Data Analysis
Statistical Quality Control, 7th Edition by Douglas C. Montgomery.
Assessing Disclosure Risk in Microdata
Introduction, class rules, error analysis Julia Velkovska
UNECE Work Session on Gender Statistics Belgrade November, 2017
Sampling-big picture Want to estimate a characteristic of population (population parameter). Estimate a corresponding sample statistic Sample must be representative.
Defining and Collecting Data
Chapter Eight: Quantitative Methods
Statistical Methods For Engineers
Sampling Design.
An Active Collection using Intermediate Estimates to Manage Follow-Up of Non-Response and Measurement Errors Jeannine Claveau, Serge Godbout and Claude.
The European Statistical Training Programme (ESTP)
Chapter 7 Sampling Distributions
Random sampling Carlo Azzarri IFPRI Datathon APSU, Dhaka
The European Statistical Training Programme (ESTP)
The European Statistical Training Programme (ESTP)
Chapter 8: Weighting adjustment
Chapter 12: Other nonresponse correction techniques
The European Statistical Training Programme (ESTP)
Chapter 1 The Where, Why, and How of Data Collection
Trip Generation II Meeghat Habibian Transportation Demand Analysis
Chapter 10: Selection of auxiliary variables
Trip Generation II Meeghat Habibian Transportation Demand Analysis
Chapter 1 The Where, Why, and How of Data Collection
The European Statistical Training Programme (ESTP)
Chapter: 9: Propensity scores
Business Statistics: A First Course (3rd Edition)
Chapter 8: Estimating with Confidence
Sampling and estimation
Defining and Collecting Data
The European Statistical Training Programme (ESTP)
A bootstrap method for estimators based on combined administrative and survey data Sander Scholtus (Statistics Netherlands) NTTS Conference 13 March 2019.
The Where, Why, and How of Data Collection
Chapter 6: Measures of representativity
Defining and Collecting Data
Chapter 13: Item nonresponse
Adaptive mixed-mode design WP1
Chapter 5: The analysis of nonresponse
CHAPTER – 1.2 UNCERTAINTIES IN MEASUREMENTS.
Defining and Collecting Data
Chapter 1 The Where, Why, and How of Data Collection
Presentation transcript:

New Techniques and Technologies for Statistics 2017  Estimation of Response Propensities and Indicators of Representative Response Using Population-Level Information Annamaria Bianchi, Natalie Shlomo, Barry Schouten, Damiao Da Silva and Chris Skinner 1

Contents Introduction Population Based Response Propensities Population Based R-indicators Evaluation Study Real Application Discussion 2

Introduction Indirect measures of nonresponse bias supplement the response rate Measures come at a time where there is an increased interest in adapting data collection: the level of effort targeted at different subgroups varied over time, possibly through a change of strategy, according to patterns of response Taxonomy of measures: indicators that include only observed auxiliary variables and indicators that also include observed survey variables which may or may not account for non-response weighting Indicators that use observed auxiliary variables are R-indicators (Schouten, Cobben and Bethlehem 2009, Schouten, Shlomo and Skinner 2011) and balance indicators (Särndal 2011, Lundquist and Särndal 2013) R-indicators presume availability of auxiliary variables through linked data from sample frames, registers, etc. which is not always available, especially to users outside of NSIs We develop R-indicators that are based on population statistics and that can be computed without any knowledge about the non- respondents 3

Introduction R-indicators and their statistical properties (Shlomo, Skinner and Schouten, 2012) relate to the case where we have linked sample level auxiliary information for non-respondents For R-indicators based on population statistics, we propose a new method for estimating response propensities that does not need auxiliary information for non-respondents: population-based response propensities. Auxiliary information for population-based response propensities is obtained from population tables and population counts Distinguish two settings. (1) is known for all sample units, respondents and non-respondents (sample based auxiliary information) , and (2) is known only at the aggregate level, i.e. the population total and/or the population cross-products (population-based auxiliary information) 4

Population Based Response Propensities response propensities where assume auxiliary variables missing at random holds (Little and Rubin, 2002) Generally, response propensities are modelled by generalized linear model, eg. logistic regression In the population-based setting, it is convenient to consider the identity link function Identity link function good approximation to logistic link function when response rates are mid-range, between 30% and 70%, which is typical response rate obtained in national and other surveys Identity link function also form the basis for other representativeness indicators, such as the imbalance and distance indicators proposed by Särndal (2011) true response propensities satisfy the linear probability model and estimated by weighted least squares, where di is the design weight 5

Population Based R-indicators Replace sums and/or cross-products with population based estimates: where or: where and Population based R-indicator: where and takes values on the interval: (we linearize for ease of bias computation) - Population based CV: 6

R-indicators Estimated R-indicator: Sample based: where Population based: and according to T1 or T2 type of information and estimated CV: Empirical results show that population based R-indicators have standard errors and biases that increase with higher response rates: ignore the sampling which causes sample covariances in the denominator of the estimated response propensities to vary along with the numerator. By ‘plugging’ in a fixed population covariance in the denominator, there is no variation arising from sampling. Propose: where 𝜆 should be an increasing function of the response rate and converge to 1 with higher response rates (estimated response propensities greater than 1 due to the linear link function under high response rates will be closer to 1) 7

R-indicators In addition: Analytical expressions for bias correction under SRS and complex survey designs Variance estimation using resampling methods Estimation of an optimal 𝜆 opt for the composite response propensity Evaluation Study based on 1995 Census Sample of Israel N=322,411 households where we defined probabilities of response and a 1-response, 0 non-response indicator for different overall response rates . Next, ran a linear and logistic regression model on population where response variable is the {0,1} indicator and under the real (Model 1) and mis-specified model (Model 2) RR1 RR2 RR3 Overall response rate   27.1 67.0 87.0 Population R-indicator (logistic) Model 1 0.9031 0.9005 0.9063 Model 2 0.9103 0.9074 0.9137 Population R-indicator (linear) 0.9033 0.9006 0.9076 0.9104 0.9145 8

Draw 500 samples under 3 sampling fractions: 1%, 2% and 4% Evaluation Study Draw 500 samples under 3 sampling fractions: 1%, 2% and 4% 1% and 4% Samples Model 1 and RR1 1% and 4% Samples Model 1 and RR3 9

- 2002 HS data. The net sample size is 33,584 persons. Dutch Health Survey - 2002 HS data. The net sample size is 33,584 persons. We see differences in respondents vs sample and population based distributions. This will impact on the use of population estimates in the R-indicators as seen in the estimation of Variables Categories Respon-dents Sample Popula-tion Age 20-24 7.5 7.9 8.1 25-29 7.3 8.2 8.9 30-34 9.9 10.2 10.9 35-39 10.8 11 40-44 10.3 10.4 45-49 9.7 9.4 9.6 50-54 9.5 55-59 8.8 8 60-64 7.1 6.7 6.3 65-69 5.9 5.6 5.4 70-74 4.7 4.6 75+ 7.7 7.8 7.2 Gender Male 48.9 49.8 49.2 Female 51.1 50.2 50.8 Marital status Not married 23.7 26.8 26.9 Married 63.3 59.3 58.8 Widowed 6.5   Divorced 6.4 7.6 𝜆 𝑜𝑝𝑡 Smoothing parameter 𝜆 𝑜𝑝𝑡 Type 1 Type 2 Population-based response propensities 0.043 0.038 Sample-based response propensities 0.076 0.095 10

Dutch Health Survey   Unadjusted Bias-adjusted  Estimator R-indicator 95% CI Sample-based 0.899 0.888 0.909 0.901 0.890 0.912 Type 1 – original 0.876 0.860 0.891 0.879 0.864 0.895 Type 1 – composite population-based 0.880 0.865 0.896 Type 1 – composite sample-based 0.883 0.868 0.898 Type 2 - original 0.873 0.858 0.889 0.877 0.861 0.894 Type 2 – composite population-based 0.878 0.863 0.862 0.893 Type 2 – composite sample-based 0.881 0.866 0.897 Population-based R-indicators are lower than sample based R-indicators as a result of the large differences between sample and population distributions to the respondent distributions of the auxiliary variables 11

Discussion Caveats from this work: The survey measures have to be the same quantities as in the population information, i.e. the survey questions have the same definitions and classifications as the population tables Best to avoid questions that are prone to measurement errors, such as questions that require a strong cognitive effort or that may lead to socially desirable answers Strongly recommended to use population statistics that are based on registrations or administrative data. The population-based R-indicators can be used for population statistics that are based on surveys, but these statistics may not reflect the true population distribution accurately and one would draw erroneous conclusions about the representativeness of the response if the population estimates are biased 12

Discussion Caveats from this work: In settings where only population information is available, options to improve representativeness during data collection through adaptive survey designs are much more limited; for the non-respondents no individual auxiliary information is available In these settings, assessments of representativeness may still be useful in the design of advance and reminder letters, in interviewer training and in paradata collection Extensions that are relatively straightforward for future research: Consider hybrid settings where the R-indicator is based on both linked data and population tables Develop the case where if there is no aggregated population information available, we can use weighted survey estimates. This will impact on the bias and variance estimates for the population based R-indicators 13

Thank you for your attention 14