Analysis of Complex Survey Data Day 5, Special topics: Developing weights and imputing data.

Slides:



Advertisements
Similar presentations
1 ESTIMATION IN THE PRESENCE OF TAX DATA IN BUSINESS SURVEYS David Haziza, Gordon Kuromi and Joana Bérubé Université de Montréal & Statistics Canada ICESIII.
Advertisements

Chapter 7 Hypothesis Testing
Multiple Regression and Model Building
Module B-4: Processing ICT survey data TRAINING COURSE ON THE PRODUCTION OF STATISTICS ON THE INFORMATION ECONOMY Module B-4 Processing ICT Survey data.
Three or more categorical variables
Research on Improvements to Current SIPP Imputation Methods ASA-SRM SIPP Working Group September 16, 2008 Martha Stinson.
Random Assignment Experiments
Lecture 28 Categorical variables: –Review of slides from lecture 27 (reprint of lecture 27 categorical variables slides with typos corrected) –Practice.
Analysis of variance (ANOVA)-the General Linear Model (GLM)
Chapter 11 Contingency Table Analysis. Nonparametric Systems Another method of examining the relationship between independent (X) and dependant (Y) variables.
Multiple Linear Regression Model
Dynamic Treatment Regimes, STAR*D & Voting D. Lizotte, E. Laber & S. Murphy LSU ---- Geaux Tigers! April 2009.
1/55 EF 507 QUANTITATIVE METHODS FOR ECONOMICS AND FINANCE FALL 2008 Chapter 10 Hypothesis Testing.
Bridging the Gaps: Dealing with Major Survey Changes in Data Set Harmonization Joint Statistical Meetings Minneapolis, MN August 9, 2005 Presented by:
Dynamic Treatment Regimes, STAR*D & Voting D. Lizotte, E. Laber & S. Murphy ENAR March 2009.
BIOST 536 Lecture 9 1 Lecture 9 – Prediction and Association example Low birth weight dataset Consider a prediction model for low birth weight (< 2500.
Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc. Chap 9-1 Chapter 9 Fundamentals of Hypothesis Testing: One-Sample Tests Basic Business Statistics.
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 6-1 Chapter 6 The Normal Distribution and Other Continuous Distributions.
Copyright © 2010 Pearson Education, Inc. Publishing as Prentice Hall Statistics for Business and Economics 7 th Edition Chapter 9 Hypothesis Testing: Single.
Notes on Logistic Regression STAT 4330/8330. Introduction Previously, you learned about odds ratios (OR’s). We now transition and begin discussion of.
7-2 Estimating a Population Proportion
Experimental Evaluation
Copyright © 2010, 2007, 2004 Pearson Education, Inc. Lecture Slides Elementary Statistics Eleventh Edition and the Triola Statistics Series by.
7-1 Copyright ©2011 Pearson Education, Inc. publishing as Prentice Hall Chapter 7 Sampling and Sampling Distributions Statistics for Managers using Microsoft.
Statistics for Managers Using Microsoft® Excel 5th Edition
FINAL REPORT: OUTLINE & OVERVIEW OF SURVEY ERRORS
PEAS wprkshop 2 Non-response and what to do about it Gillian Raab Professor of Applied Statistics Napier University.
Math 116 Chapter 12.
Statistics 11 Hypothesis Testing Discover the relationships that exist between events/things Accomplished by: Asking questions Getting answers In accord.
Hypothesis Testing:.
McGraw-Hill/IrwinCopyright © 2009 by The McGraw-Hill Companies, Inc. All Rights Reserved. Chapter 9 Hypothesis Testing.
Chapter 10 Hypothesis Testing
Overview Definition Hypothesis
Business Statistics, A First Course (4e) © 2006 Prentice-Hall, Inc. Chap 9-1 Chapter 9 Fundamentals of Hypothesis Testing: One-Sample Tests Business Statistics,
Fundamentals of Hypothesis Testing: One-Sample Tests
Hypothesis Testing in Linear Regression Analysis
Copyright 2010, The World Bank Group. All Rights Reserved. Estimation and Weighting, Part I.
1 CSI5388: Functional Elements of Statistics for Machine Learning Part I.
Section Copyright © 2014, 2012, 2010 Pearson Education, Inc. Lecture Slides Elementary Statistics Twelfth Edition and the Triola Statistics Series.
Chap 20-1 Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chapter 20 Sampling: Additional Topics in Sampling Statistics for Business.
SW388R6 Data Analysis and Computers I Slide 1 Central Tendency and Variability Sample Homework Problem Solving the Problem with SPSS Logic for Central.
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
Correlation and Linear Regression. Evaluating Relations Between Interval Level Variables Up to now you have learned to evaluate differences between the.
Copyright ©2011 Pearson Education 7-1 Chapter 7 Sampling and Sampling Distributions Statistics for Managers using Microsoft Excel 6 th Global Edition.
1 Introduction to Survey Data Analysis Linda K. Owens, PhD Assistant Director for Sampling & Analysis Survey Research Laboratory University of Illinois.
Panel Study of Entrepreneurial Dynamics Richard Curtin University of Michigan.
HSRP 734: Advanced Statistical Methods July 17, 2008.
Using Weighted Data Donald Miller Population Research Institute 812 Oswald Tower, December 2008.
1 Introduction to Survey Data Analysis Linda K. Owens, PhD Assistant Director for Sampling & Analysis Survey Research Laboratory University of Illinois.
5-4-1 Unit 4: Sampling approaches After completing this unit you should be able to: Outline the purpose of sampling Understand key theoretical.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved. Section 7-1 Review and Preview.
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 8-1 Chapter 8 Fundamentals of Hypothesis Testing: One-Sample Tests Statistics.
Eurostat Weighting and Estimation. Presented by Loredana Di Consiglio Istituto Nazionale di Statistica, ISTAT.
© John M. Abowd 2007, all rights reserved General Methods for Missing Data John M. Abowd March 2007.
Chap 8-1 Fundamentals of Hypothesis Testing: One-Sample Tests.
Statistical Inference Statistical inference is concerned with the use of sample data to make inferences about unknown population parameters. For example,
Using Data from the National Survey of Children with Special Health Care Needs Centers for Disease Control and Prevention National Center for Health Statistics.
NHANES Analytic Strategies Deanna Kruszon-Moran, MS Centers for Disease Control and Prevention National Center for Health Statistics.
Guillaume Osier Institut National de la Statistique et des Etudes Economiques (STATEC) Social Statistics Division Construction.
Copyright © 2013 Pearson Education, Inc. Publishing as Prentice Hall Statistics for Business and Economics 8 th Edition Chapter 9 Hypothesis Testing: Single.
Statistical Inference for the Mean Objectives: (Chapter 8&9, DeCoursey) -To understand the terms variance and standard error of a sample mean, Null Hypothesis,
SECTION 1 TEST OF A SINGLE PROPORTION
BINARY LOGISTIC REGRESSION
Chapter 7. Classification and Prediction
Presenter: Phillip S. Kott
Notes on Logistic Regression
Introduction to Survey Data Analysis
Chapter 9 Hypothesis Testing.
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Regression Analysis.
Presentation transcript:

Analysis of Complex Survey Data Day 5, Special topics: Developing weights and imputing data

Part 1: Imputation using HOT DECK

What is HOT DECK? This procedure is designed to perform the Cox- Iannacchione Weighted Sequential Hot Deck (WSHD) imputation that is described in both Cox (1980) and Iannacchione (1982), a methodology based on a weighted sequential sample selection algorithm developed by Chromy (1979). Provisions are included in this procedure for multivariate imputation (several variables imputed at the same time) and multiple imputations (several imputed versions of the same variable).

Vocab Donor – An item respondent selected to provide a value for missing item nonrespondent data Imputation class – user-defined group used in the imputation process. Three categories: classes that consist of only item respondent records, class that consist of only item nonrespondent records, and classes that contain both item respondents and item nonrespondents. For classes with both item respondents and item nonrespondents, imputation is performed and donors are selected for missing values. For classes with only item respondents or only item nonrespondents, imputation is not performed.

Vocab Imputation variable – User-defined variable that contains some missing values on the input data file. A missing value for this variable will be populated with a donor value. Item Nonrespondent — A record for which imputation is performed on missing data. All records on the input data file are defined as either an item respondent or an item nonrespondent. Users define the set of item respondents. Item nonrespondents are the set of remaining records from the input file not defined as an item respondent. Item Respondent — A record for which values can be selected for imputation of missing item nonrespondent data. All records on the input data file are defined as either an item respondent or an itemnonrespondent. Users define the set of all item respondents.

Getting started Prepare a dataset that includes: – The variable(s) you want to impute – The variable(s) that will inform the imputation – ID and weighting variables

How do you decide what variables to include to inform the imputation? IMPORTANT ASSUMPTION: imputation assumes that, for a given variable with missing data, the missing-data mechanism within each imputation class is ignorable (also known as missing at random). The validity of the imputed values is dependent on how good the measures you use to inform the imputation. This decision is theoretical rather than statistical (but we can use statistics to inform our decision). Choose variables that are strongly related to the outcome of interest. Want response to be as homogeneous as possible within groups and as hetergeneous as possible across groups – E.g., if you are imputing depression score, you definitely want sex and age. The others depend on what you think impacts depression score: BMI? Smoking? Race? Education? Income?

How much missing is too much? There is no rule that says when there is just too much missing data to use a variable Some people use a 10% rule It likely depends on how important this variable is to your analysis Just remember that the more missing data, the less valid the imputed variable will be.

Getting into the details Sorting Matters. The way in which the input file is sorted WITHIN each imputation class (defined by the IMPBY statement) will have an effect on the imputation results. The assignment of a selection probability to a potential donor, or item respondent, depends both on the donor’s weight and on the weights of nearby item nonrespondents. In other words, both the weights and the sort order of observations play a role in the selection of donors for imputation in the WSHD algorithm.

Getting into the details

Lab: Imputation using HOT DECK

Part 2: WTADJUST

Why would you need to calculate new weights? Nonresponse adjustment – One of the key variables in your analysis has high levels of missingness, and you don’t want to impute – In this case, you can reestimate the sample weights taking into account factors associated with being missing on the key variable Post-stratification weight adjustment – You don’t like the referent population used for calculating the sample weights in your data – E.G., Most complex surveys weight the sample to be representative of the U.S. based on the 2000 Census – they may not be desirable if your data was collected in – Post-stratification adjustment may also be useful to users who seek to create standardized weights or non-probability based sample weights

PROC WTADJUST Designed to be used to compute nonresponse and post-stratification weight adjustments Created using a model-based, calibration approach that is somewhat similar to what is done with PROC LOGISTIC – a generalization of the classical weighting class approach for producing weight adjustments

PROC WTADJUST In a model based approach: The weight estimate allows the user to include more main effect and lower-order interactions of variables in the weight adjustment process. This can reduce bias in estimates computed using the adjusted weights Allows you to estimate the statistical significance of the variables used in the adjustment process. Unlike traditional methods, can incorporate continuous variables.

PROC WTADJUST In fact, if all interaction terms are included in the weight adjustment model for a given set of categorical variables, the model-based approach is equivalent to the weighting class approach

The weight adjustment model Where is an index corresponding to each record in the domain of interest is the domain of interest (SUBPOPN statement) is the final weight adjustment for each record k in. This is the key output variable from this procedure is a weight trimming factor that will be computed before the B-parameters of the exponential model (i.e., parameters of ) are estimated is the nonresponse or post-stratification adjustment computed after the weight trimming step

The weight adjustment model Where lower bound imposed on the adjustment upper bound imposed on the adjustment centering constant a constant used to control the behavior of as the upper and lower bound get closer to the centering constant is a vector of explanatory variables are the model parameters that will be estimated within the procedure

The weight adjustment model is the input weight for record k (whatever is on the WEIGHT statement) dependent variable in the modeling procedure. For nonresponse adjustments, this variable should be set to one for records corresponding eligible respondents and to zero for records corresponding to ineligible cases. For post- stratification adjustments, this variable should be set to one for all records that should receive a post-stratification adjustment (if that’s everyone, just use the option “_ONE_”.

Weight Trimming Reducing the variance in your weights will reduce the variance in your estimates (which is good!). So, you might want to ‘trim’ the weights to be within certain bounds. For example, the 99 year old daily cocaine user might have a really extreme weight. We might want to reign that person in to have a weight that’s similar to a 60 year old daily cocaine user.

How do you decide the bounds on the weight trimming factor? There are many ways to do this. One relatively simple approach is to parition the sample into small subpopulations (e.g., by strata or by levels of some covariate of interest) Within each of the subpopulations, compute the interquartile range (IQR) of the input sample, and set:

A simple example RECNOSUnique record identifier SAMPWT Base sampling weight for each person STRATA PSU GENDER RACE AGE ELIGYes/No variable indicating whether or not the record is eligible RESPYes/No variable indicating whether the record on file corresponds to a respondent

A simple example To compute a nonresponse adjustment that will correct the sample weights of respondents for those people that did not respond to the survey, we use the following code:

The DESIGN=WR coupled with the NEST and WEIGHT statements provides the design information for WTADJUST so that the procedure can compute appropriate design-based variances of the model parameters B. The variable SAMPWT is w k A SUBPOPN elig=1 statement is used to tell SUDAAN to only consider eligible records. In this example, we seek weight adjustments that will correct the sample weights of respondents for eligible nonrespondents. The IDVAR statement is included so that the OUTPUT file, ADJUST, contains a variable that can be used to merge the adjustments back to the original file. In this example, the merge- by variable is RECNOS.

The WTMAX and WTMIN statements are included. These are optional statements. A fixed value can be used in these statements – in this case, the fixed value applies to all records k. Optionally, a variable can be used in these statements. One could use a variable in cases where a different WTMAX and/or WTMIN is desired for different sets of respondents. In this particular example, the user would like to truncate any weight that is less than 10 or greater than prior to computing the actual nonresponse weight adjustment. Similarly, the UPPERBD and LOWRBD statements are included. These are also optional statements. A fixed value can be used in these statements – in this case, the fixed value applies to all records k. Optionally, a variable can be used in these statements. In this particular example, the user would like to truncate or bound the resulting weight adjustments, k α, so that no weight adjustment falls below 1.0 or above 3.0.

The CENTER statement is included. This is also an optional statement. A fixed value can be used in this statement – in this case, the fixed value applies to all records k. Optionally, a variable can be used in this statement. In this particular example, the value of k c is set equal to 2.0 for each record.

The MODEL statement tells WTADJUST that RESP is the 0/1 indicator for response status and that the user would like to use the main effects of categorical variables GENDER and RACE in the model. If the user also wants the interaction of GENDER and RACE, then similar to all other SUDAAN procedures, they would add the term GENDER*RACE to the right hand side of the MODEL statement. The user is also specifying that AGE be included in the model as a continuous variable.

The output file TRIMFACTOR. This is – In our example, this variable is assigned a value that will force to equal 10 for those records where is <10 and for those records with > For records with between 10 and 15000, the value of will be equal to 1.0 ADJFACTOR. This will hold the values of the weight adjustment factors

Suppose in this example that the weighted sums of explanatory variables are as displayed above Then, WTADJUST is designed to yield model-based weight adjustments ( ) that will force the adjusted weighted sum of the model explanatory variables to equal those totals displayed in the third column above. In other words, if you were to compute the weighted sum of each explanatory variable using only those records that satisfy RESP=1 and using the adjusted sample weight WTFINAL, then the totals you would obtain would be equivalent to what is in column 2.

Suppose instead that we were interested in obtaining a post-stratification adjustment that would force the nonresponse-adjusted respondent weights to equal the following controls: Now a post-stratification example Let’s say we merged our nonresponse-adjusted respondent weights back into the dataset and named them WTNONADJ. Then, getting the post-stratification totals is easy:

Now a post-stratification example We no longer need weight trimming or upper and lower bounds. The POSTWGT statement contains the control totals for the post-stratification adjustment. This numbers should correspond, in order, to the B model parameters. Unless the NOINT option is specified, SUDAAN first always includes an intercept in the mode. Consequently, the first POSTWGT value corresponds to the overall control total – in this case, that would be = The next eight numbers in the POSTWGT statement are control totals corresponding to the GENDER*AGEGRP*RACE interaction. Note that control totals should be supplied for reference levels associated with any explanatory variable or interaction term.

Lab 5: Calculating sample weights