New Techniques and Technologies for Statistics 2017 Estimation of Response Propensities and Indicators of Representative Response Using Population-Level.

Slides:

Advertisements

Similar presentations

Page 1 Measuring Survey Quality through Representativity Indicators using Sample and Population based Information Chris Skinner, Natalie Shlomo, Barry.

Advertisements

Introduction Simple Random Sampling Stratified Random Sampling

Sampling: Final and Initial Sample Size Determination

Evaluating Hypotheses

Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Evaluating Hypotheses.

FINAL REPORT: OUTLINE & OVERVIEW OF SURVEY ERRORS

Chapter 1 Introduction and Data Collection

Evidence-Based Medicine 3 More Knowledge and Skills for Critical Reading Karen E. Schetzina, MD, MPH.

Chap 20-1 Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chapter 20 Sampling: Additional Topics in Sampling Statistics for Business.

1 Ratio estimation under SRS Assume Absence of nonsampling error SRS of size n from a pop of size N Ratio estimation is alternative to under SRS, uses.

Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 1-1 Statistics for Managers Using Microsoft ® Excel 4 th Edition Chapter.

Calibrated imputation of numerical data under linear edit restrictions Jeroen Pannekoek Natalie Shlomo Ton de Waal.

Chapter 5 Parameter estimation. What is sample inference? Distinguish between managerial & financial accounting. Understand how managers can use accounting.

A Theoretical Framework for Adaptive Collection Designs Jean-François Beaumont, Statistics Canada David Haziza, Université de Montréal International Total.

Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc.. Chap 7-1 Chapter 7 Sampling Distributions Basic Business Statistics.

Eurostat Weighting and Estimation. Presented by Loredana Di Consiglio Istituto Nazionale di Statistica, ISTAT.

Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 11: Models Marshall University Genomics Core Facility.

Case Selection and Resampling Lucila Ohno-Machado HST951.

Q2010 – special topic session 33 - Page 1 Indicators for representative response Barry Schouten (Statistics Netherlands) Natalie Shlomo and Chris Skinner.

Chapter 6 Sampling and Sampling Distributions

Class Six Turn In: Chapter 15: 30, 32, 38, 44, 48, 50 Chapter 17: 28, 38, 44 For Class Seven: Chapter 18: 32, 34, 36 Chapter 19: 26, 34, 44 Quiz 3 Read.

Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 1-1 Statistics for Managers Using Microsoft ® Excel 4 th Edition Chapter.

Learning Objectives : After completing this lesson, you should be able to: Describe key data collection methods Know key definitions: Population vs. Sample.

ThiQar college of Medicine Family & Community medicine dept

Statistical Estimation

Chapter 7 Confidence Interval Estimation

This will help you understand the limitations of the data and the uses to which it can be put (and the confidence with which you can put it to those.

Peter Linde, Interviewservice Statistics Denmark

Sampling Why use sampling? Terms and definitions

Multiple Imputation using SOLAS for Missing Data Analysis

Statistical Quality Control, 7th Edition by Douglas C. Montgomery.

Assessing Disclosure Risk in Microdata

Introduction, class rules, error analysis Julia Velkovska

UNECE Work Session on Gender Statistics Belgrade November, 2017

Sampling-big picture Want to estimate a characteristic of population (population parameter). Estimate a corresponding sample statistic Sample must be representative.

Defining and Collecting Data

Chapter Eight: Quantitative Methods

Statistical Methods For Engineers

Sampling Design.

An Active Collection using Intermediate Estimates to Manage Follow-Up of Non-Response and Measurement Errors Jeannine Claveau, Serge Godbout and Claude.

The European Statistical Training Programme (ESTP)

Chapter 7 Sampling Distributions

Random sampling Carlo Azzarri IFPRI Datathon APSU, Dhaka

The European Statistical Training Programme (ESTP)

The European Statistical Training Programme (ESTP)

Chapter 8: Weighting adjustment

Chapter 12: Other nonresponse correction techniques

The European Statistical Training Programme (ESTP)

Chapter 1 The Where, Why, and How of Data Collection

Trip Generation II Meeghat Habibian Transportation Demand Analysis

Chapter 10: Selection of auxiliary variables

Trip Generation II Meeghat Habibian Transportation Demand Analysis

Chapter 1 The Where, Why, and How of Data Collection

The European Statistical Training Programme (ESTP)

Chapter: 9: Propensity scores

Business Statistics: A First Course (3rd Edition)

Chapter 8: Estimating with Confidence

Sampling and estimation

Defining and Collecting Data

The European Statistical Training Programme (ESTP)

A bootstrap method for estimators based on combined administrative and survey data Sander Scholtus (Statistics Netherlands) NTTS Conference 13 March 2019.

The Where, Why, and How of Data Collection

Chapter 6: Measures of representativity

Defining and Collecting Data

Chapter 13: Item nonresponse

Adaptive mixed-mode design WP1

Chapter 5: The analysis of nonresponse

CHAPTER – 1.2 UNCERTAINTIES IN MEASUREMENTS.

Defining and Collecting Data

Chapter 1 The Where, Why, and How of Data Collection

Presentation transcript:

New Techniques and Technologies for Statistics 2017 Estimation of Response Propensities and Indicators of Representative Response Using Population-Level Information Annamaria Bianchi, Natalie Shlomo, Barry Schouten, Damiao Da Silva and Chris Skinner 1

Contents Introduction Population Based Response Propensities Population Based R-indicators Evaluation Study Real Application Discussion 2

Introduction Indirect measures of nonresponse bias supplement the response rate Measures come at a time where there is an increased interest in adapting data collection: the level of effort targeted at different subgroups varied over time, possibly through a change of strategy, according to patterns of response Taxonomy of measures: indicators that include only observed auxiliary variables and indicators that also include observed survey variables which may or may not account for non-response weighting Indicators that use observed auxiliary variables are R-indicators (Schouten, Cobben and Bethlehem 2009, Schouten, Shlomo and Skinner 2011) and balance indicators (Särndal 2011, Lundquist and Särndal 2013) R-indicators presume availability of auxiliary variables through linked data from sample frames, registers, etc. which is not always available, especially to users outside of NSIs We develop R-indicators that are based on population statistics and that can be computed without any knowledge about the non- respondents 3

Introduction R-indicators and their statistical properties (Shlomo, Skinner and Schouten, 2012) relate to the case where we have linked sample level auxiliary information for non-respondents For R-indicators based on population statistics, we propose a new method for estimating response propensities that does not need auxiliary information for non-respondents: population-based response propensities. Auxiliary information for population-based response propensities is obtained from population tables and population counts Distinguish two settings. (1) is known for all sample units, respondents and non-respondents (sample based auxiliary information) , and (2) is known only at the aggregate level, i.e. the population total and/or the population cross-products (population-based auxiliary information) 4

Population Based Response Propensities response propensities where assume auxiliary variables missing at random holds (Little and Rubin, 2002) Generally, response propensities are modelled by generalized linear model, eg. logistic regression In the population-based setting, it is convenient to consider the identity link function Identity link function good approximation to logistic link function when response rates are mid-range, between 30% and 70%, which is typical response rate obtained in national and other surveys Identity link function also form the basis for other representativeness indicators, such as the imbalance and distance indicators proposed by Särndal (2011) true response propensities satisfy the linear probability model and estimated by weighted least squares, where di is the design weight 5

Population Based R-indicators Replace sums and/or cross-products with population based estimates: where or: where and Population based R-indicator: where and takes values on the interval: (we linearize for ease of bias computation) - Population based CV: 6

R-indicators Estimated R-indicator: Sample based: where Population based: and according to T1 or T2 type of information and estimated CV: Empirical results show that population based R-indicators have standard errors and biases that increase with higher response rates: ignore the sampling which causes sample covariances in the denominator of the estimated response propensities to vary along with the numerator. By ‘plugging’ in a fixed population covariance in the denominator, there is no variation arising from sampling. Propose: where 𝜆 should be an increasing function of the response rate and converge to 1 with higher response rates (estimated response propensities greater than 1 due to the linear link function under high response rates will be closer to 1) 7

R-indicators In addition: Analytical expressions for bias correction under SRS and complex survey designs Variance estimation using resampling methods Estimation of an optimal 𝜆 opt for the composite response propensity Evaluation Study based on 1995 Census Sample of Israel N=322,411 households where we defined probabilities of response and a 1-response, 0 non-response indicator for different overall response rates . Next, ran a linear and logistic regression model on population where response variable is the {0,1} indicator and under the real (Model 1) and mis-specified model (Model 2) RR1 RR2 RR3 Overall response rate 27.1 67.0 87.0 Population R-indicator (logistic) Model 1 0.9031 0.9005 0.9063 Model 2 0.9103 0.9074 0.9137 Population R-indicator (linear) 0.9033 0.9006 0.9076 0.9104 0.9145 8

Draw 500 samples under 3 sampling fractions: 1%, 2% and 4% Evaluation Study Draw 500 samples under 3 sampling fractions: 1%, 2% and 4% 1% and 4% Samples Model 1 and RR1 1% and 4% Samples Model 1 and RR3 9

- 2002 HS data. The net sample size is 33,584 persons. Dutch Health Survey - 2002 HS data. The net sample size is 33,584 persons. We see differences in respondents vs sample and population based distributions. This will impact on the use of population estimates in the R-indicators as seen in the estimation of Variables Categories Respon-dents Sample Popula-tion Age 20-24 7.5 7.9 8.1 25-29 7.3 8.2 8.9 30-34 9.9 10.2 10.9 35-39 10.8 11 40-44 10.3 10.4 45-49 9.7 9.4 9.6 50-54 9.5 55-59 8.8 8 60-64 7.1 6.7 6.3 65-69 5.9 5.6 5.4 70-74 4.7 4.6 75+ 7.7 7.8 7.2 Gender Male 48.9 49.8 49.2 Female 51.1 50.2 50.8 Marital status Not married 23.7 26.8 26.9 Married 63.3 59.3 58.8 Widowed 6.5 Divorced 6.4 7.6 𝜆 𝑜𝑝𝑡 Smoothing parameter 𝜆 𝑜𝑝𝑡 Type 1 Type 2 Population-based response propensities 0.043 0.038 Sample-based response propensities 0.076 0.095 10

Dutch Health Survey Unadjusted Bias-adjusted Estimator R-indicator 95% CI Sample-based 0.899 0.888 0.909 0.901 0.890 0.912 Type 1 – original 0.876 0.860 0.891 0.879 0.864 0.895 Type 1 – composite population-based 0.880 0.865 0.896 Type 1 – composite sample-based 0.883 0.868 0.898 Type 2 - original 0.873 0.858 0.889 0.877 0.861 0.894 Type 2 – composite population-based 0.878 0.863 0.862 0.893 Type 2 – composite sample-based 0.881 0.866 0.897 Population-based R-indicators are lower than sample based R-indicators as a result of the large differences between sample and population distributions to the respondent distributions of the auxiliary variables 11

Discussion Caveats from this work: The survey measures have to be the same quantities as in the population information, i.e. the survey questions have the same definitions and classifications as the population tables Best to avoid questions that are prone to measurement errors, such as questions that require a strong cognitive effort or that may lead to socially desirable answers Strongly recommended to use population statistics that are based on registrations or administrative data. The population-based R-indicators can be used for population statistics that are based on surveys, but these statistics may not reflect the true population distribution accurately and one would draw erroneous conclusions about the representativeness of the response if the population estimates are biased 12

Discussion Caveats from this work: In settings where only population information is available, options to improve representativeness during data collection through adaptive survey designs are much more limited; for the non-respondents no individual auxiliary information is available In these settings, assessments of representativeness may still be useful in the design of advance and reminder letters, in interviewer training and in paradata collection Extensions that are relatively straightforward for future research: Consider hybrid settings where the R-indicator is based on both linked data and population tables Develop the case where if there is no aggregated population information available, we can use weighted survey estimates. This will impact on the bias and variance estimates for the population based R-indicators 13

Thank you for your attention 14