Weighting and imputation PHC 6716 July 13, 2011 Chris McCarty.

1 Weighting and imputation PHC 6716 July 13, 2011 Chris McCarty

2 Weighting Weighting is the process of adjusting the contribution of each observation in a survey sample based on independent knowledge about appropriate distributions Before weighting the implied weight of each observation is 1.0 After weighting, some observations will have weights >1.0 and some <1.0, and some at 1.0 No observations should have a weight of 0 Two general types of weighting: – Design weights -- Adjusting for differences due to intentional disproportionate sampling (e.g. over-sampling African Americans or certain regions) – Post-stratification weights -- Adjusting for differences in population or households when release of sample is intended to be representative (e.g. adjustments for non-response of young people)

3 Common sources for calculating weights U.S. Census Current Population Survey American Community Survey For Florida County, Age, Race, Ethnicity the BEBR Population Program

4 How frequency procedures use weights All statistical packages have options on procedures to incorporate weights For frequency procedures the weights are multiplied by the unweighted frequencies, then percentages are calculated on the result

5 How to make a simple weight A Region B Frequency C Percent Of Sample D Percent From Other Source E Weight (D/C) F Adjusted Frequency (B*E) G Adjusted Percent North110.025.02.5 25.0 South440.025.00.6252.5 25.0 East220. 25.0 West330.025.00.8332.5 25.0 Total10100.00 -10100.00

6 What that would look like in data set ObservationRegionEmployedWeight 1NY2.5 2SN0.625 3SY 4SY 5SN 6EN1.25 7EN 8WY0.833 9WY 10WN0.833 Total--9.99

7 Original and adjusted frequency of Employment variable EmployedFrequencyPercentFrequency Adjusted to Weights* Percent Adjusted to Weights Y550.00 5.416 54.21 N550.00 4.583 45.87 Total10100.009.99100.00 *This is the sum of the weights for the category

8 Notes on weighting Typically you don’t want weights to make enormous differences Keep in mind that with weighting you are saying you have information extraneous to the survey process that informs you of the proper distribution You could conceivably up-weight results from a small sample strata Weights are typically used for accurate estimates of prevalence Models where you test relationships do not need weights if you include the variables you would use to weight

9 Weighting with more than one variable Combined weight with multiplication – Create individual weights for each variable then multiply weights to get a single weight (Wage*Wgender) – Not a good solution with a lot of variables Combine weights iteratively – Calculate weight for a variable using frequency table – Use that weight in frequency of second variable to create weight – Use that weight in frequency of third variable to create weight – And so on

10 Consumer Confidence Survey of approximately 500 Florida households each month RDD Landline Survey Five questions (components) averaged into an Overall Index Until now only post-stratification weighting by proportion of households by county

11 Potential weighting variables County Typically we get under- representation from large south Florida counties (Miami-Dade) and over- representation from northern counties (Alachua) Household proportions are estimated between census years by BEBR Weights June 2011.xls

12 Potential weighting variables Age RDD tends to lead to over- sampling of seniors with landlines Cell phones emerged as a problem around 2005 No reliable age group data until 2010 Census Elderly tend to be less confident than younger respondents due to fixed incomes Weights June 2011.xls

13 Potential weighting variables Hispanic Ethnicity Cell phones tend to be used disproportionately by Hispanics 2010 Census provided reliable data about proportion of Hispanic Floridians Hispanics tend to have lower confidence than non-Hispanics Weights June 2011.xls

23 Example 2- FHIS The state of Florida wanted to estimate rates of the uninsured They stratified the state into 17 regions and wanted to be able to make estimates for the state and each region with a tolerable margin of error On the state level they wanted to be able to say something about Blacks, Hispanics and those under 200 percent of the poverty level Data on these demographics for each Florida telephone exchange were obtained prior to sampling Strata were created from exchanges This made it possible to create weights based on known households in each exchange This required design weights to adjust for disproportionate sampling

24 Example 3 – Medicaid survey The state wanted to evaluate Medicaid Reform being conducted in Duval and Broward counties They wanted to administer a modified CAHPS instrument to Adults and Children separately They wanted to stratify by plan as well, sampling a minimum number of observations per plan In the end they wanted to compare plans, counties and adults and children These weights required knowledge about total enrollment for each one of these characteristics (plan, age, county)

25 Imputation Like weighting, imputation involves adjusting the analysis after data collection Unlike weighting, imputation is the deliberate creation of data that were not actually collected The main reason for imputation is to retain observations in a statistical analysis that would otherwise be left out Your ability to discover significant results may be compromised by too many missing values In some case there may be systematic bias associated with missing data so that not imputing presents an unrepresentative result

26 Imputation and regressions Imputation is particularly common when data are analyzed with regression analysis A regression model explains the variability in a dependent variable using one or more independent variables Observations can only be included in the regression if they have values for all variables in the model Models with a lot of variables increase the probability that an observation will have at least one missing value for them

27 Example Model Income = β1(Age) + β2(Education)+ β3(Employed)

28 Imputation algorithms Two general categories – Random imputation assigns values randomly, often based on a desired statistical distribution – Deterministic imputation typically assigns values based on existing knowledge Existing knowledge could be in the data – Single imputation fills missing data with one value, such as the mean of all non-missing values for a continuous variable – Multiple imputation fills in missing data with a set of plausible values – Hot deck imputation fills in missing values with those of an observation that matches on key variables

