1 Statistical Disclosure Control Methods for Census Outputs Natalie Shlomo SDC Centre, ONS January 11, 2005.

Slides:



Advertisements
Similar presentations
Estimating Identification Risks for Microdata Jerome P. Reiter Institute of Statistics and Decision Sciences Duke University, Durham NC, USA.
Advertisements

Statistical Disclosure Control (SDC) for 2011 Census Progress Update Keith Spicer – ONS SDC Methodology 23 April 2009.
Output Consultation Plans and Statistical Disclosure Control Strategy developments Angele Storey and Jane Longhurst ONS.
CAPRI CCSR Analysis of Information Loss: a Case Study From a UK Survey Mark Elliot Kingsley Purdam Confidentiality and Privacy Group (CAPRI) CCSR, University.
WP 33 Information Loss Measures for Frequency Tables Natalie Shlomo University of Southampton Office for National Statistics Caroline.
Progress on the SDC Strategy for the 2011 Census 23 rd June 2008 Keith Spicer and Caroline Young.
Statistical Disclosure Control (SDC) at SURS Andreja Smukavec General Methodology and Standards Sector.
Confidentiality risks of releasing measures of data quality Jerry Reiter Department of Statistical Science Duke University
The estimation strategy of the National Household Survey (NHS) François Verret, Mike Bankier, Wesley Benjamin & Lisa Hayden Statistics Canada Presentation.
© Statistisches Bundesamt, IIA - Mathematisch Statistische Methoden Summary of Topic ii (Tabular Data Protection) Frequency Tables Magnitude Tables Web.
SDC for continuous variables under edit restrictions Natalie Shlomo & Ton de Waal UN/ECE Work Session on Statistical Data Editing, Bonn, September 2006.
Sampling Strategy for Establishment Surveys International Workshop on Industrial Statistics Beijing, China, 8-10 July 2013.
1 A Common Measure of Identity and Value Disclosure Risk Krish Muralidhar University of Kentucky Rathin Sarathy Oklahoma State University.
Quality assurance -Population and Housing Census Alma Kondi, INSTAT, Albania.
Assessing Disclosure Risk in Sample Microdata Under Misclassification
1 Editing Administrative Data and Combined Data Sources Introduction.
Access routes to 2001 UK Census Microdata: Issues and Solutions Jo Wathan SARs support Unit, CCSR University of Manchester, UK
Len Cook: Hero or Zero of the 2001 Census? OR A look at the impact of disclosure control on aggregate census outputs.
Methods of Geographical Perturbation for Disclosure Control Division of Social Statistics And Department of Geography Caroline Young Supervised jointly.
11 Comparison of Perturbation Approaches for Spatial Outliers in Microdata Natalie Shlomo* and Jordi Marés** * Social Statistics, University of Manchester,
Microdata Simulation for Confidentiality of Tax Returns Using Quantile Regression and Hot Deck Jennifer Huckett Iowa State University June 20, 2007.
Research Methodology Lecture No :16
Joint UNECE/Eurostat Meeting on Population and Housing Censuses (28-30 October 2009) Accuracy evaluation of Nuts level 2 hypercubes with the adoption of.
The Application of the Concept of Uniqueness for Creating Public Use Microdata Files Jay J. Kim, U.S. National Center for Health Statistics Dong M. Jeong,
Intruder Testing: Demonstrating practical evidence of disclosure protection in 2011 UK Census Keith Spicer, Caroline Tudor and George Cornish 1 Joint UNECE/Eurostat.
Multiple Indicator Cluster Surveys Survey Design Workshop Sampling: Overview MICS Survey Design Workshop.
Q2010, Helsinki Development and implementation of quality and performance indicators for frame creation and imputation Kornélia Mag László Kajdi Q2010,
Chap 20-1 Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chapter 20 Sampling: Additional Topics in Sampling Statistics for Business.
1 Tel Aviv April 29th, 2007 Disclosure Limitation from a Statistical Perspective Natalie Shlomo Dept. of Statistics, Hebrew University Central Bureau of.
Statistical Disclosure Control for the 2011 UK Census Jane Longhurst, Caroline Young and Caroline Miller (ONS)
WP. 46 Providing access to data and making microdata safe, experiences of the ONS Jane Longhurst Paul Jackson ONS.
1 Statistical Disclosure Control for Communal Establishments in the UK 2011 Census Joe Frend Office for National Statistics.
2011 CENSUS Coverage Assessment – What’s new? OWEN ABBOTT.
Luisa Franconi Integration, Quality, Research and Production Networks Development Department Unit on microdata access ISTAT Essnet on Common Tools and.
Collecting Electronic Data From the Carriers: the Key to Success in the Canadian Trucking Commodity Origin and Destination Survey François Gagnon and Krista.
A Strategy for Prioritising Non-response Follow-up to Reduce Costs Without Reducing Output Quality Gareth James Methodology Directorate UK Office for National.
1 New Implementations of Noise for Tabular Magnitude Data, Synthetic Tabular Frequency and Microdata, and a Remote Microdata Analysis System Laura Zayatz.
1 Assessing the Impact of SDC Methods on Census Frequency Tables Natalie Shlomo Southampton Statistical Sciences Research Institute University of Southampton.
Jeroen Pannekoek - Statistics Netherlands Work Session on Statistical Data Editing Oslo, Norway, 24 September 2012 Topic (I) Selective and macro editing.
2011 Census: Lessons learned from the Business Sector Dr Barry Leventhal MRS Census & Geodemographics Group CAG Meeting 8 th January 2015.
Some ACS Data Issues and Statistical Significance (MOEs) Table Release Rules Statistical Filtering & Collapsing Disclosure Review Board Statistical Significance.
5-4-1 Unit 4: Sampling approaches After completing this unit you should be able to: Outline the purpose of sampling Understand key theoretical.
WP 19 Assessment of Statistical Disclosure Control Methods for the 2001 UK Census Natalie Shlomo University of Southampton Office for National Statistics.
1 IPAM 2010 Privacy Protection from Sampling and Perturbation in Surveys Natalie Shlomo and Chris Skinner Southampton Statistical Sciences Research Institute.
JOINT UN-ECE/EUROSTAT MEETING ON POPULATION AND HOUSING CENSUSES GENEVA, MAY 2009 DETERMINING USER NEEDS FOR THE 2011 UK CENSUS IAN WHITE, Office.
Using Targeted Perturbation of Microdata to Protect Against Intelligent Linkage Mark Elliot, University of Manchester Cathie.
Disclosure Limitation in Microdata with Multiple Imputation Jerry Reiter Institute of Statistics and Decision Sciences Duke University.
Topic (iii): Macro Editing Methods Paula Mason and Maria Garcia (USA) UNECE Work Session on Statistical Data Editing Ljubljana, Slovenia, 9-11 May 2011.
Disclosure Control in the UK Census Keith Spicer 11 January 2005.
United Nations Workshop on Revision 3 of Principles and Recommendations for Population and Housing Censuses and Evaluation of Census Data, Amman 19 – 23.
Protection of frequency tables – current work at Statistics Sweden Karin Andersson Ingegerd Jansson Karin Kraft Joint UNECE/Eurostat.
European Conference on Quality in Official Statistics, Rome, July 2008 Community Innovation Survey: a Flexible Approach to the Dissemination of Microdata.
1 WP 10 On Risk Definitions and a Neighbourhood Regression Model for Sample Disclosure Risk Estimation Natalie Shlomo Hebrew University Southampton University.
Creating Open Data whilst maintaining confidentiality Philip Lowthian, Caroline Tudor Office for National Statistics 1.
Estimation of the Probit Model From Anonymized Micro Data Gerd Ronning and Martin Rosemann Universität Tübingen & IAW Tübingen UNECE Work Session on Statistical.
United Nations Oslo City Group on Energy Statistics OG7, Helsinki, Finland October 2012 ESCM Chapter 8: Data Quality and Meta Data 1.
Exploring Microsimulation Methodologies for the Estimation of Household Attributes Dimitris Ballas, Graham Clarke, and Ian Turton School of Geography University.
The Review of the Dissemination of Health Statistics Carole Abrahams Office for National Statistics.
The 2011 Census: Estimating the Population Alexa Courtney.
Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona Sampling as a way to reduce risk and create a Public Use File maintaining weighted.
Census 2011 – A Question of Confidentiality Statistical Disclosure control for the 2011 Census Carole Abrahams ONS Methodology BSPS – York, September 2011.
Common Pitfalls in Randomized Evaluations Jenny C. Aker Tufts University.
11 Measuring Disclosure Risk and Data Utility for Flexible Table Generators Natalie Shlomo, Laszlo Antal, Mark Elliot University of Manchester
Natalie Shlomo Social Statistics, School of Social Sciences
Disclosure scenario and risk assessment: Structure of Earnings Survey
Progress towards a table builder with in-built disclosure control for 2021 Census Keith Spicer UNECE, 22 September 2017.
Assessing Disclosure Risk in Microdata
Harmonisation process of anonymisation of microdata
Imputation as a Practical Alternative to Data Swapping
Presentation transcript:

1 Statistical Disclosure Control Methods for Census Outputs Natalie Shlomo SDC Centre, ONS January 11, 2005

2 Introduction Developing SDC methods » Output requirements, risk management » Risk-Utility decision problem Methods for protecting census outputs » Pre and post-tabular methods » Safe settings, contracts Work plan for Census 2011 » Assessment of Census 2001 SDC methods » User and other stakeholder’s requirements » Assessment of alternative methods Topics for Discussion

3 Previous ONS talk on SDC implementation for Census 2001 and lessons learnt This talk will focus on developing SDC methodology for Census 2011 based on census output requirements and user needs more generally. The planned approach is a risk-utility decision framework. Goal:  To provide adequate protection against the risk of re- identification taking into account user needs and output requirements and the usefulness of the data  Improve on existing and develop new SDC methods, assess SDC impact on data quality, and clarify advantages and disadvantages of each method. Introduction

4 Developing SDC Methodology What are the Census outputs requirements? » Variables to be disseminated » Standard and flexible (web generated?) tables » Origin-Destination tables (workplace tables) » SARs microdata What are the disclosure risk scenarios, i.e. realistic assumptions on information available to the public that increases the probability of disclosure? Comment: Note that the problem are 1’s and 2’s in tables and not the 0’s (except for extreme cases)

5 Disclosure risk measures - quantifies the risk of re- identification: » probability that a sample unique is a population unique, » probability of a correct match between a record in the microdata to an external file » probability that a record is perturbed Information loss measures - quantifies the loss of information content in the data as a result of the SDC method. Utility depends on the user needs and use of the data: » distortion to distributions (bias) » weaknesses in the measures of association » impact on the variance of estimates » changes to the likelihood functions.

6 Developing SDC Methodology Methods for protecting outputs: » Data masking: perturbative methods - methods that alter the data: swapping, random noise, over imputation, rounding non-perturbative methods - methods that preserve data integrity: recoding, sub-sampling, suppression » Data access under contract in a safe setting, and additionally ensuring non-disclosive outputs Need to develop SDC methods for checking outputs, i.e. residuals of regression analysis, bandwidth for kernel estimation of distributions.

7 Developing SDC Methodology In some cases, parameters can be given to users to make corrections in their analysis or they are embedded in the SDC method to minimize information loss. SDC an optimization problem: Choose SDC method as a function of the data: which will maximize the utility of the data: subject to the constraint that the disclosure risk will be below a threshold: Risk-Utility decision framework for choosing optimum method

8 SDC for Census Outputs Pre-tabular Methods 1. Random Record Swapping (UK 2001, USA 1991) Small percentage of records have geographical identifiers (or other variables) swapped with other records matching on control variables ( larger geographical areas, household size and age-sex distribution)

9 Pre-tabular Methods 2. Targeted Record Swapping (USA 2001) Large percentage of unique records on set of key variables (large households, ethnicity), have geographical identifiers (or other variables) swapped with other records matching on control variables.

10 Pre-tabular Methods 3. Over-Imputation A percentage of randomly selected records have certain variables erased and standard imputation methods are applied by selecting donors matching on control variables.

11 Pre-tabular Methods 4. Post-Random Perturbation (PRAM) (UK 2001) Percentage of records have certain variables misclassified according to prescribed probabilities. Includes method that preserves marginal (compounded) distributions and edit constraints AdvantagesDisadvantages  Marginal (compounded) distributions and edits can be maintained  Method can be targeted to risky records  Consistent totals between tables  Some protection against disclosure by differencing  All records (risky or not) have the same probability of being perturbed  Errors (bias and variance of estimates) due to perturbation  Need for re-editing and further imputations  Probability of perturbation very small

12 Preliminary Evaluation of Record Swapping 16,120 households from 1995 Israel CBS Census sample file Households randomly swapped within control strata: Broad region, type of locality, age groups (10) and sex Strata collapsed for unique records with no swap pair Disclosure Risk measure: where cells of size 1 and 2 Information Loss measure:

13 Preliminary Evaluation of Record Swapping Risk-Information Loss (R-IL) Map: » Swapping rates 5%, 10%, 15%, 20% » Information Loss - distortion to the distribution: Number of persons in household in each district

14 Preliminary Evaluation of Record Swapping Random record swapping vs. targeted record swapping (on uniques in control strata, i.e. large households). » Swapping rate: 10% » Information Loss - distortion to the distribution: Number of persons in household in each district

15 Preliminary Evaluation of Over-Imputation 10% of households had geographic identifier erased: » Random selection of households » Targeted selection of households from unique control strata (i.e., large households) Geographic identifier imputed using hot-deck imputation within strata: sex age groups. Risk measure: Information Loss measure:

16 Preliminary Evaluation of Over-Imputing Risk-Information Loss (R-IL) Map: » 10% Selected Records (Random and Targeted on Uniques) » Information Loss - distortion to the distribution: Number of persons in household in each district Risk-Information Loss Assessment 10% Random and Targeted Record Swapping(blue) 10% Random and Targeted Over Imputation (pink) Information Loss Risk

17 Final Comments for Pre-tabular Methods Geographies are swapped because they introduce less edit failures and are generally less correlated with other variables. If other variables are swapped (or over-imputed), such as age, the data would be badly damaged, a large amount of re-editing would be necessary and further imputations carried out. Swapping does not affect higher (geographical) level distributions within which the records are swapped. This is an advantage and not a disadvantage. Over imputation is similar to record swapping but causes more damage to the data. Assumptions of “missing at random” problematic for the analysis of full data sets.

18 SDC for Census Outputs Post-tabular Methods 1. Barnardization (UK 1991) Every internal cell in an output table modified by (+1,0,-1) according to prescribed probabilities (q, 1-2q, q) No adjustments made to zero cells. AdvantagesDisadvantages  Some protection against disclosure by differencing  High proportion of risky (unique) records unperturbed  Inconsistent totals between tables since margins calculated by perturbed internal cells

19 Post-tabular Methods 2. Small Cell Adjustments (UK 2001, Australia) Small cells randomly adjusted upwards or downwards to a base depending on an unbiased stochastic method and prescribed probabilities. AdvantagesDisadvantages  Protects the risky (unique) records  Lower loss of information for standard tables  Inconsistent totals between tables since margins calculated by perturbed internal cells  Can have high errors in totals  Little protection against disclosure by differencing  Implementation problems for sparse tables (eg., origin- destination tables)

20 Post-tabular Methods 3. Unbiased Random Rounding (UK NeSS, New Zealand, Canada) All cells in tables rounded up or down according to an unbiased prescribed probability scheme.  Rounds all cells, including safe cells  Requires complex auditing to ensure protection  Totals rounded independently from internal cells so tables not additive  Provides good protection against disclosure by differencing (although not 100% guarantee)  Easy to apply  Totals are consistent between tables within the rounding base DisadvantagesAdvantages

21 Post-tabular Methods 4. Controlled Rounding (UK NeSS) All cells in tables rounded up or down in an optimal method that ensures maintaining the marginal totals (up to the base)  Rounds all cells, including safe cells  Requires complex SDC tool Tau Argus (and licence)  Would require more development to work with Census size tables.  Fully protects against disclosure by differencing  Tables fully additive  Minimal information loss  Works with linked tables and external constraints. DisadvantagesAdvantages

22 Post-tabular Methods 5. Table Design Methods Population thresholds Level of detail and number of dimensions in the table Minimum average cell size 6. Further development of SDC methods Controlled small cell adjustments, controlled rounding Better implementation and benchmarking techniques for maintaining totals at higher aggregated levels.

23 Evaluation Study Origin-Destination (Workplace) Tables and Small Cell Adjustments Totals in tables obtained by aggregating internal perturbed cells Different tables produced different results, number of flows different between tables ONS guidelines: (1) use table with minimum number of categories; (2) combine minimum number of smaller geographical areas for obtaining estimates for larger areas Some problems in implementation for origin- destination tables

24 Evaluation Study Workplace (ward to ward) Table W206 for West Midlands: small cell adjustment method unbiased (errors within confidence intervals of perturbation scheme), ward to ward totals not badly damaged, skewness in lower geographical areas.

25 Optimum SDC method a mixture of different methods depending on risk-utility management, output requirements and user needs more generally. »What is the optimum balance between perturbative and non- perturbative methods of SDC? »How transparent should the SDC method be? Pre-tabular methods have hidden effects and users are not able to make adjustments in their analysis. »What are the data used for and how to measure information loss and the impact of the SDC method on data quality? »Can we improve on post-tabular methods? »Policies and strategies for access to data through contracts and safe settings? Work started on optimal methods as part of the overall planning for 2011 Census

26 Work Plan Census 2011 I. Assessment of Census 2001 SDC Methods: Risk-Utility analysis Comprehensive report, forums and discussion groups on SDC methods with users and other agencies II. Alternative methods for SDC based on results of phase I, user requirements for census outputs and feedback

27 Final Remarks: We are evaluating our methods and planning future improvements Our SDC methodology is based on a scientific approach, understanding the needs and requirements of the users and international best practice Methods for SDC are greatly enhanced by the cooperation and feedback from the user community!

28 Contact Details Natalie Shlomo SDC Centre, Methodology Directorate Office for National Statistics Segensworth Road Titchfield Fareham PO15 5RR