11 Measuring Disclosure Risk and Data Utility for Flexible Table Generators Natalie Shlomo, Laszlo Antal, Mark Elliot University of Manchester

Slides:



Advertisements
Similar presentations
Estimating Identification Risks for Microdata Jerome P. Reiter Institute of Statistics and Decision Sciences Duke University, Durham NC, USA.
Advertisements

Statistical Disclosure Control (SDC) for 2011 Census Progress Update Keith Spicer – ONS SDC Methodology 23 April 2009.
Output Consultation Plans and Statistical Disclosure Control Strategy developments Angele Storey and Jane Longhurst ONS.
CAPRI CCSR Analysis of Information Loss: a Case Study From a UK Survey Mark Elliot Kingsley Purdam Confidentiality and Privacy Group (CAPRI) CCSR, University.
WP 33 Information Loss Measures for Frequency Tables Natalie Shlomo University of Southampton Office for National Statistics Caroline.
Household Projections for England Yolanda Ruiz DCLG 16 th July 2012.
Progress on the SDC Strategy for the 2011 Census 23 rd June 2008 Keith Spicer and Caroline Young.
Statistical Disclosure Control (SDC) at SURS Andreja Smukavec General Methodology and Standards Sector.
Confidentiality risks of releasing measures of data quality Jerry Reiter Department of Statistical Science Duke University
WP 9 Assessing Disclosure Risk in Microdata using Record Level Measures Natalie Shlomo University of Southampton Office for National Statistics
© Statistisches Bundesamt, IIA - Mathematisch Statistische Methoden Summary of Topic ii (Tabular Data Protection) Frequency Tables Magnitude Tables Web.
GLOBAL TOBACCO SURVEILLANCE SYSTEM Global Youth Tobacco Survey Training Workshop Introduction to the GYTS Sample Design & Weights.
SDC for continuous variables under edit restrictions Natalie Shlomo & Ton de Waal UN/ECE Work Session on Statistical Data Editing, Bonn, September 2006.
Assessing Disclosure Risk in Sample Microdata Under Misclassification
Analysis of frequency counts with Chi square
Methods of Geographical Perturbation for Disclosure Control Division of Social Statistics And Department of Geography Caroline Young Supervised jointly.
1 Seventh Lecture Error Analysis Instrumentation and Product Testing.
Aspects of Bayesian Inference and Statistical Disclosure Control in Python Duncan Smith Confidentiality and Privacy Group CCSR University of Manchester.
11 Comparison of Perturbation Approaches for Spatial Outliers in Microdata Natalie Shlomo* and Jordi Marés** * Social Statistics, University of Manchester,
Microdata Simulation for Confidentiality of Tax Returns Using Quantile Regression and Hot Deck Jennifer Huckett Iowa State University June 20, 2007.
Joint UNECE/Eurostat Meeting on Population and Housing Censuses (28-30 October 2009) Accuracy evaluation of Nuts level 2 hypercubes with the adoption of.
The Application of the Concept of Uniqueness for Creating Public Use Microdata Files Jay J. Kim, U.S. National Center for Health Statistics Dong M. Jeong,
Confidentiality Issues with “Small Cell” Data Michael C. Samuel, DrPH STD Control Branch California Department of Public Health 2008 National STD Prevention.
1 Tel Aviv April 29th, 2007 Disclosure Limitation from a Statistical Perspective Natalie Shlomo Dept. of Statistics, Hebrew University Central Bureau of.
Statistical Disclosure Control for the 2011 UK Census Jane Longhurst, Caroline Young and Caroline Miller (ONS)
WP. 46 Providing access to data and making microdata safe, experiences of the ONS Jane Longhurst Paul Jackson ONS.
1 Statistical Disclosure Control Methods for Census Outputs Natalie Shlomo SDC Centre, ONS January 11, 2005.
1 Statistical Disclosure Control for Communal Establishments in the UK 2011 Census Joe Frend Office for National Statistics.
Luisa Franconi Integration, Quality, Research and Production Networks Development Department Unit on microdata access ISTAT Essnet on Common Tools and.
Discussion of “ Statistical Disclosure Limitation: Releasing Useful Data for Statistical Analysis” Nancy J. Kirkendall Energy Information Administration.
1 Assessing the Impact of SDC Methods on Census Frequency Tables Natalie Shlomo Southampton Statistical Sciences Research Institute University of Southampton.
The Dutch Virtual Census based on registers and already existing surveys Eric Schulte Nordholt Senior researcher and project leader of the Census Statistics.
Confidentiality issues in the EU Population and Housing Censuses of 2011 Risks and Criteria.
Some ACS Data Issues and Statistical Significance (MOEs) Table Release Rules Statistical Filtering & Collapsing Disclosure Review Board Statistical Significance.
WP 19 Assessment of Statistical Disclosure Control Methods for the 2001 UK Census Natalie Shlomo University of Southampton Office for National Statistics.
Disclosure Avoidance at Statistics Canada INFO747 Session on Confidentiality Protection April 19, 2007 Jean-Louis Tambay, Statistics Canada
1 IPAM 2010 Privacy Protection from Sampling and Perturbation in Surveys Natalie Shlomo and Chris Skinner Southampton Statistical Sciences Research Institute.
IAB homepage: Institut für Arbeitsmarkt- und Berufsforschung/Institute for Employment Research A New Approach for Disclosure Control in the.
Statistical data confidentiality and micro data in Albania
Slide 1 Eurostat Unit B3 – Statistical Information Technologies CoRD Meeting – 4 June 2007 Agenda Item 8 Preliminary ideas for a 2011 census hub Giuseppe.
JOINT UN-ECE/EUROSTAT MEETING ON POPULATION AND HOUSING CENSUSES GENEVA, MAY 2009 DETERMINING USER NEEDS FOR THE 2011 UK CENSUS IAN WHITE, Office.
Using Targeted Perturbation of Microdata to Protect Against Intelligent Linkage Mark Elliot, University of Manchester Cathie.
Developments in new farm typology. Background EC Farm Structure Survey in 2010 (full) and 2013 and 2016 (partial). Previously 2000 (full) and partial.
Disclosure Control in the UK Census Keith Spicer 11 January 2005.
Protection of frequency tables – current work at Statistics Sweden Karin Andersson Ingegerd Jansson Karin Kraft Joint UNECE/Eurostat.
European Conference on Quality in Official Statistics, Rome, July 2008 Community Innovation Survey: a Flexible Approach to the Dissemination of Microdata.
1 WP 10 On Risk Definitions and a Neighbourhood Regression Model for Sample Disclosure Risk Estimation Natalie Shlomo Hebrew University Southampton University.
Creating Open Data whilst maintaining confidentiality Philip Lowthian, Caroline Tudor Office for National Statistics 1.
1 1 Confidentiality protection of large frequency data cubes UNECE Workshop on Statistical Confidentiality Ottawa October 2013 Johan Heldal and Svetlana.
Access to microdata in the Netherlands: from a cold war to co-operation projects Eric Schulte Nordholt Senior researcher and project leader of the Census.
Joint UNECE/Eurostat work session on statistical data confidentiality Manchester, December 2007 Dealing with Confidentiality in Dissemination: The.
Exploring Microsimulation Methodologies for the Estimation of Household Attributes Dimitris Ballas, Graham Clarke, and Ian Turton School of Geography University.
7b. SDMX practical use case: Census Hub
The 2011 Census: Estimating the Population Alexa Courtney.
Joint Eurostat Unece Worksession on Statistical Data Confidentiality 2011, Tarragona Initial analyses on comparable dissemination from the Essnet project.
Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona Sampling as a way to reduce risk and create a Public Use File maintaining weighted.
Remote Analysis Server for Tabulation and Analysis of Data Tarragonia, October 2011 James Chipperfield and Frank Yu (presenter)
Census 2011 – A Question of Confidentiality Statistical Disclosure control for the 2011 Census Carole Abrahams ONS Methodology BSPS – York, September 2011.
Transforming Data to Satisfy Privacy Constraints 컴퓨터교육 전공 032CSE15 최미희.
The London Health Observatory: monitoring health and health care in the capital, supporting practitioners and informing decision-makers Disclosure control.
Reconciling Confidentiality Risk Measures from Statistics and Computer Science Jerry Reiter Department of Statistical Science Duke University.
Natalie Shlomo Social Statistics, School of Social Sciences
Disclosure scenario and risk assessment: Structure of Earnings Survey
Progress towards a table builder with in-built disclosure control for 2021 Census Keith Spicer UNECE, 22 September 2017.
Assessing Disclosure Risk in Microdata
Establishing an Automated Confidentiality Service in Stats NZ
Classification Trees for Privacy in Sample Surveys
GUIDELINES FOR THE COLLECTION OF PESTICIDE USAGE STATISTICS A summary
Confidentiality on the Fly
Item 5 Wim Kloek, Eurostat
Presentation transcript:

11 Measuring Disclosure Risk and Data Utility for Flexible Table Generators Natalie Shlomo, Laszlo Antal, Mark Elliot University of Manchester The research leading to these results has received funding from the European Union's Seventh Framework Programme (FP7/ ) under grant agreement n° (DwB - Data without Boundaries).

22 Topics Covered Introduction Design of Flexible Table Generating Servers Information Based Risk-Utility Measures Example Application and Results Discussion

33 Large demand for specialized and tailored tables from policy makers and researchers NSIs considering the internet to disseminate outputs through flexible table generators, eg. US Census Bureau, Australia ABS, Israel CBS Key questions: (1) What data should be used to produce the tables? Original microdata with or without SDC methods, often aggregated to hypercubes (2) At what stage to apply the SDC? Apply to underlying data and all tables considered safe - Compounds SDC and reduces utility Apply to final output tables - Problem to ensure consistency and additivity Introduction

44 Types of disclosure risk: Identity disclosure where small cells may lead to an identification Attribute disclsoure where rows/columns have structural zeros and only one or two cells populated (small cells on margins) Differencing tables leading to higher risks of above disclosures For output based query systems, eg. flexible table generator, need perturbative methods of SDC (see: CS literature on differential privacy) Flexible table generating requires ‘on the fly’ disclosure risk assessment, application of SDC methods and data utility measures Introduction

55 SDC rules easily programmed, some examples: Limit the number of dimensions Avoid disclosure by differencing by ensuring consistent and nested categories Minimum population thresholds, average cell size, etc. Algorithm: (1) Determine by SDC rules if table can be produced (2) Assess disclosure risk (3) Apply SDC method if needed (4) Recalculate disclosure risk (5) If safe table then output with utility measure, else go to (3) Designing a Flexible Table Generating Server

66 Types of Data: Census Data- whole population counts European Census Hub with all member states providing common hypercubes Different SDC methods across member states reduces the utility of the hub Business data – different type of tables (magnitude) and not considered further Survey data from Social Surveys typically have non- perturbative SDC methods (coarsening) Weighted counts generally safe due to large and varying weights with low sample counts deleted for low quality Unweighted counts not differentially private due to sample uniques that are population uniques (Shlomo and Skinner 2012) and must be avoided Designing a Flexible Table Generating Server

77 To assess attribute disclosure in tables mainly caused by structural zeros, use the entropy where vector of frequency counts and Entropy bounded by 0 if all cells are zero except one cell, and log(K) if all cell values are equal, i.e. cell proportions are 1/K Risk measure: Combine with other measures (proportion of zeros and size of the population)and define weighted average: Information Based Disclosure Risk and Data Utility Measures

88 Take into account perturbation that introduces random zeros: Adjust first term comparing number of zeros before and after perturbation Smooth out perturbed cell counts based on their expectation under the transition matrix (lowers the second term) Example: For random rounding, replace perturbed zeros with: where frequencies of cell values and frequencies of perturbed cell values For sampling, smooth out sample counts by using probabilistic Log-linear-Poisson model approach (Skinner and Shlomo 2008) Replace population counts in the entropy term by Estimate number of zeros by: Information Based Disclosure Risk and Data Utility Measures

99 Utility measure: Hellenger’s Distance where original counts perturbed counts Hellenger’s Distance bounded by 0 and and can be used to compare SDC methods Information Based Disclosure Risk and Data Utility Measures

10 Population N=1,500,000 NUTS2 Region - two regions Gender – 2 categories Banded age groups – 21 categories Current Activity Status – 5 categories Occupation – 13 categories Educational attainment – 9 categories Country of citizenship – 5 categories Calculate cell proportions from 2001 UK Census via iterative proportional fitting All proportions multiplied by population size and rounded Example: Simulation Hypercube

11 Define a 3- dimensional table with one variable to define the population: banded age group, education group and occupation group defined for NUTS2=1 Table has 2,457 cells, 854,539 individuals, average cell size of For comparison, we carry out a semi-controlled random rounding to base 3 on the output table calculated from original data Flexible Table Generating Servers Cell ValueNumber of CellsPercentage of Cells % % % % % 5 and over % Total %

12 Random record swapping by selecting 5% of the individuals in NUTS2 region and swapping LAU2, thus a total of 10% of individuals swapped Semi-controlled random rounding to base 3 controlled for two NUTS2 totals Invariant PRAM with control of totals for two NUTS regions Perturbation on cell values 1 to 10 and above 11 no perturbation Low entropy, i.e. cells perturbed to neighbouring cells only Risk measure: weights:.1,.7 (small cells),.1,.1 Adjust measure for perturbations by transition matrix Sample based measure: all 2 way interaction log-linear model (entropy term: populaton 0.318, sample 0.323, estimate 0.319) SDC Methods for Hypercube

13 Results Disclosure RiskHellinger’s Distance Perturbed Input Original :50 sample table Swapping Semi-controlled Random Rounding Stochastic Perturbation Perturbed Output Semi-Controlled Random Rounding Record swapping applied to hypercube did little to reduce disclosure risk since small cells remain and utiity is high Stochastic perturbation has lower disclosure risk but low utility Semi-controlled random rounding also reduces disclosure risk and good utility but need to ensure consistency and additivity so could lower utility Comparing the rounding before and after shows that SDC ‘on the fly’ has lower disclosure risk and the highest utility out of all the methods since perturbation is not confounded Sample based risk measure resulted in higher risk measure (future work) with very low utility

14 Discussion While agencies can claim there is uncertainty in the tables from record swapping, there is little actual reduction in disclosure risk which is problematic when disseminating tables freely over the internet Record swapping and the proposed stochastic perturbation have little impact on disclosure by differencing since it leaves original counts in the table Perturbative methods where all cells are perturbed can provide more protection and can be made differentially private To avoid confounding SDC methods, apply perturbative method ‘on the fly’ within the table generating server on final output table Using stochastic perturbative methods allow users to account for the perturbation in their analysis Future research: Improve SDC methods for additivity and consistency ; Consider conditional entropy to account for perturbation and sampling

15 Thank you for your attention