Download presentation
Presentation is loading. Please wait.
Published byBerenice Hill Modified over 9 years ago
1
1 Statistical Disclosure Control Methods for Census Outputs Natalie Shlomo SDC Centre, ONS January 11, 2005
2
2 Introduction Developing SDC methods » Output requirements, risk management » Risk-Utility decision problem Methods for protecting census outputs » Pre and post-tabular methods » Safe settings, contracts Work plan for Census 2011 » Assessment of Census 2001 SDC methods » User and other stakeholder’s requirements » Assessment of alternative methods Topics for Discussion
3
3 Previous ONS talk on SDC implementation for Census 2001 and lessons learnt This talk will focus on developing SDC methodology for Census 2011 based on census output requirements and user needs more generally. The planned approach is a risk-utility decision framework. Goal: To provide adequate protection against the risk of re- identification taking into account user needs and output requirements and the usefulness of the data Improve on existing and develop new SDC methods, assess SDC impact on data quality, and clarify advantages and disadvantages of each method. Introduction
4
4 Developing SDC Methodology What are the Census outputs requirements? » Variables to be disseminated » Standard and flexible (web generated?) tables » Origin-Destination tables (workplace tables) » SARs microdata What are the disclosure risk scenarios, i.e. realistic assumptions on information available to the public that increases the probability of disclosure? Comment: Note that the problem are 1’s and 2’s in tables and not the 0’s (except for extreme cases)
5
5 Disclosure risk measures - quantifies the risk of re- identification: » probability that a sample unique is a population unique, » probability of a correct match between a record in the microdata to an external file » probability that a record is perturbed Information loss measures - quantifies the loss of information content in the data as a result of the SDC method. Utility depends on the user needs and use of the data: » distortion to distributions (bias) » weaknesses in the measures of association » impact on the variance of estimates » changes to the likelihood functions.
6
6 Developing SDC Methodology Methods for protecting outputs: » Data masking: perturbative methods - methods that alter the data: swapping, random noise, over imputation, rounding non-perturbative methods - methods that preserve data integrity: recoding, sub-sampling, suppression » Data access under contract in a safe setting, and additionally ensuring non-disclosive outputs Need to develop SDC methods for checking outputs, i.e. residuals of regression analysis, bandwidth for kernel estimation of distributions.
7
7 Developing SDC Methodology In some cases, parameters can be given to users to make corrections in their analysis or they are embedded in the SDC method to minimize information loss. SDC an optimization problem: Choose SDC method as a function of the data: which will maximize the utility of the data: subject to the constraint that the disclosure risk will be below a threshold: Risk-Utility decision framework for choosing optimum method
8
8 SDC for Census Outputs Pre-tabular Methods 1. Random Record Swapping (UK 2001, USA 1991) Small percentage of records have geographical identifiers (or other variables) swapped with other records matching on control variables ( larger geographical areas, household size and age-sex distribution)
9
9 Pre-tabular Methods 2. Targeted Record Swapping (USA 2001) Large percentage of unique records on set of key variables (large households, ethnicity), have geographical identifiers (or other variables) swapped with other records matching on control variables.
10
10 Pre-tabular Methods 3. Over-Imputation A percentage of randomly selected records have certain variables erased and standard imputation methods are applied by selecting donors matching on control variables.
11
11 Pre-tabular Methods 4. Post-Random Perturbation (PRAM) (UK 2001) Percentage of records have certain variables misclassified according to prescribed probabilities. Includes method that preserves marginal (compounded) distributions and edit constraints AdvantagesDisadvantages Marginal (compounded) distributions and edits can be maintained Method can be targeted to risky records Consistent totals between tables Some protection against disclosure by differencing All records (risky or not) have the same probability of being perturbed Errors (bias and variance of estimates) due to perturbation Need for re-editing and further imputations Probability of perturbation very small
12
12 Preliminary Evaluation of Record Swapping 16,120 households from 1995 Israel CBS Census sample file Households randomly swapped within control strata: Broad region, type of locality, age groups (10) and sex Strata collapsed for unique records with no swap pair Disclosure Risk measure: where cells of size 1 and 2 Information Loss measure:
13
13 Preliminary Evaluation of Record Swapping Risk-Information Loss (R-IL) Map: » Swapping rates 5%, 10%, 15%, 20% » Information Loss - distortion to the distribution: Number of persons in household in each district
14
14 Preliminary Evaluation of Record Swapping Random record swapping vs. targeted record swapping (on uniques in control strata, i.e. large households). » Swapping rate: 10% » Information Loss - distortion to the distribution: Number of persons in household in each district
15
15 Preliminary Evaluation of Over-Imputation 10% of households had geographic identifier erased: » Random selection of households » Targeted selection of households from unique control strata (i.e., large households) Geographic identifier imputed using hot-deck imputation within strata: sex age groups. Risk measure: Information Loss measure:
16
16 Preliminary Evaluation of Over-Imputing Risk-Information Loss (R-IL) Map: » 10% Selected Records (Random and Targeted on Uniques) » Information Loss - distortion to the distribution: Number of persons in household in each district Risk-Information Loss Assessment 10% Random and Targeted Record Swapping(blue) 10% Random and Targeted Over Imputation (pink) 0.4 0.5 0.6 0.7 0.8 0.9 1 020406080100120140 Information Loss Risk
17
17 Final Comments for Pre-tabular Methods Geographies are swapped because they introduce less edit failures and are generally less correlated with other variables. If other variables are swapped (or over-imputed), such as age, the data would be badly damaged, a large amount of re-editing would be necessary and further imputations carried out. Swapping does not affect higher (geographical) level distributions within which the records are swapped. This is an advantage and not a disadvantage. Over imputation is similar to record swapping but causes more damage to the data. Assumptions of “missing at random” problematic for the analysis of full data sets.
18
18 SDC for Census Outputs Post-tabular Methods 1. Barnardization (UK 1991) Every internal cell in an output table modified by (+1,0,-1) according to prescribed probabilities (q, 1-2q, q) No adjustments made to zero cells. AdvantagesDisadvantages Some protection against disclosure by differencing High proportion of risky (unique) records unperturbed Inconsistent totals between tables since margins calculated by perturbed internal cells
19
19 Post-tabular Methods 2. Small Cell Adjustments (UK 2001, Australia) Small cells randomly adjusted upwards or downwards to a base depending on an unbiased stochastic method and prescribed probabilities. AdvantagesDisadvantages Protects the risky (unique) records Lower loss of information for standard tables Inconsistent totals between tables since margins calculated by perturbed internal cells Can have high errors in totals Little protection against disclosure by differencing Implementation problems for sparse tables (eg., origin- destination tables)
20
20 Post-tabular Methods 3. Unbiased Random Rounding (UK NeSS, New Zealand, Canada) All cells in tables rounded up or down according to an unbiased prescribed probability scheme. Rounds all cells, including safe cells Requires complex auditing to ensure protection Totals rounded independently from internal cells so tables not additive Provides good protection against disclosure by differencing (although not 100% guarantee) Easy to apply Totals are consistent between tables within the rounding base DisadvantagesAdvantages
21
21 Post-tabular Methods 4. Controlled Rounding (UK NeSS) All cells in tables rounded up or down in an optimal method that ensures maintaining the marginal totals (up to the base) Rounds all cells, including safe cells Requires complex SDC tool Tau Argus (and licence) Would require more development to work with Census size tables. Fully protects against disclosure by differencing Tables fully additive Minimal information loss Works with linked tables and external constraints. DisadvantagesAdvantages
22
22 Post-tabular Methods 5. Table Design Methods Population thresholds Level of detail and number of dimensions in the table Minimum average cell size 6. Further development of SDC methods Controlled small cell adjustments, controlled rounding Better implementation and benchmarking techniques for maintaining totals at higher aggregated levels.
23
23 Evaluation Study Origin-Destination (Workplace) Tables and Small Cell Adjustments Totals in tables obtained by aggregating internal perturbed cells Different tables produced different results, number of flows different between tables ONS guidelines: (1) use table with minimum number of categories; (2) combine minimum number of smaller geographical areas for obtaining estimates for larger areas Some problems in implementation for origin- destination tables
24
24 Evaluation Study Workplace (ward to ward) Table W206 for West Midlands: small cell adjustment method unbiased (errors within confidence intervals of perturbation scheme), ward to ward totals not badly damaged, skewness in lower geographical areas.
25
25 Optimum SDC method a mixture of different methods depending on risk-utility management, output requirements and user needs more generally. »What is the optimum balance between perturbative and non- perturbative methods of SDC? »How transparent should the SDC method be? Pre-tabular methods have hidden effects and users are not able to make adjustments in their analysis. »What are the data used for and how to measure information loss and the impact of the SDC method on data quality? »Can we improve on post-tabular methods? »Policies and strategies for access to data through contracts and safe settings? Work started on optimal methods as part of the overall planning for 2011 Census
26
26 Work Plan Census 2011 I. Assessment of Census 2001 SDC Methods: Risk-Utility analysis Comprehensive report, forums and discussion groups on SDC methods with users and other agencies II. Alternative methods for SDC based on results of phase I, user requirements for census outputs and feedback
27
27 Final Remarks: We are evaluating our methods and planning future improvements Our SDC methodology is based on a scientific approach, understanding the needs and requirements of the users and international best practice Methods for SDC are greatly enhanced by the cooperation and feedback from the user community!
28
28 Contact Details Natalie Shlomo SDC Centre, Methodology Directorate Office for National Statistics Segensworth Road Titchfield Fareham PO15 5RR 01329 812612 natalie.shlomo@ons.gsi.gov.uk
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.