Methods of Geographical Perturbation for Disclosure Control Division of Social Statistics And Department of Geography Caroline Young Supervised jointly.

Slides:



Advertisements
Similar presentations
Multiple Indicator Cluster Surveys Survey Design Workshop
Advertisements

Will 2011 be the last Census of its kind in England and Wales? Roma Chappell, Programme Director Beyond 2011 Office for National Statistics, July 2011.
Statistical Disclosure Control (SDC) for 2011 Census Progress Update Keith Spicer – ONS SDC Methodology 23 April 2009.
The methodology used for the 2001 SARs Special Uniques Analysis Mark Elliot Anna Manning Confidentiality And Privacy Group ( University.
Balancing Access and Confidentiality Jenny Telford Australian Bureau of Statistics September 2008.
Conference Programme Introduction to the Samples of Anonymised Records - Keith Spicer, ONS CCSR's role in providing SAR's support - Jo Wathan,
Progress on the SDC Strategy for the 2011 Census 23 rd June 2008 Keith Spicer and Caroline Young.
Data linking – Project update 15 th May 2012 – Homecare & SDS event Atlantic Quay Ellen Lynch & Euan Patterson.
Assessing Disclosure Risk in Sample Microdata Under Misclassification
11 ACS Public Use Microdata Samples of 2005 and 2006 – How to Use the Replicate Weights B. Dale Garrett and Michael Starsinic U.S. Census Bureau AAPOR.
Statistical Disclosure Control for the 2011 UK Census Keith Spicer Office for National Statistics.
Access routes to 2001 UK Census Microdata: Issues and Solutions Jo Wathan SARs support Unit, CCSR University of Manchester, UK
Adding Census Geographical Detail into the British Crime Survey for Modelling Crime Charatdao Kongmuang Naresuan University, Thailand Graham Clarke and.
2001 Census Programme Delivering UK Census Data to Researchers: Progress and Challenges David Martin University of Southampton and ESRC/JISC Census Programme.
Geography and Geographical Analysis using the ONS Longitudinal Study Christopher Marshall & Julian Buxton CeLSIUS.
David Martin Department of Geography University of Southampton 2001 Census: the emergence of a new geographical framework.
United Nations Expert Group Meeting on Revising the Principles and Recommendations for Population and Housing Censuses New York, 29 October – 1 November.
Census.ac.uk Census Area Statistics and Casweb David Rawnsley Census Dissemination Unit (CDU) Mimas University of Manchester.
Spatial Simulation for Education Policy Analysis in Ireland An Initial Exploration Gillian Golden University College Dublin
Beyond 2011 – A new paradigm for population statistics? Pete Benton, Beyond 2011 Programme Director Office for National Statistics, UK.
Nigel James Bodleian Library The Census Accessing and mapping British Census Data.
Household projections for Scotland Hugh Mackenzie April 2014.
11 Comparison of Perturbation Approaches for Spatial Outliers in Microdata Natalie Shlomo* and Jordi Marés** * Social Statistics, University of Manchester,
GEOG3025 Census and administrative data sources 2: Outputs and access.
Microdata Simulation for Confidentiality of Tax Returns Using Quantile Regression and Hot Deck Jennifer Huckett Iowa State University June 20, 2007.
Screening Data for Disclosure Risk and the Research behind One Possible Tool Kristine Witkowski Research support from the National Institute of Child Health.
The Application of the Concept of Uniqueness for Creating Public Use Microdata Files Jay J. Kim, U.S. National Center for Health Statistics Dong M. Jeong,
Intruder Testing: Demonstrating practical evidence of disclosure protection in 2011 UK Census Keith Spicer, Caroline Tudor and George Cornish 1 Joint UNECE/Eurostat.
STA Lecture 161 STA 291 Lecture 16 Normal distributions: ( mean and SD ) use table or web page. The sampling distribution of and are both (approximately)
Confidentiality Issues with “Small Cell” Data Michael C. Samuel, DrPH STD Control Branch California Department of Public Health 2008 National STD Prevention.
Statistical Disclosure Control for the 2011 UK Census Jane Longhurst, Caroline Young and Caroline Miller (ONS)
1 Statistical Disclosure Control Methods for Census Outputs Natalie Shlomo SDC Centre, ONS January 11, 2005.
1 Statistical Disclosure Control for Communal Establishments in the UK 2011 Census Joe Frend Office for National Statistics.
United Nations Regional Workshop on the 2010 World Programme on Population and Housing Censuses: Census Evaluation and Post Enumeration Surveys Bangkok,
Health Datasets in Spatial Analyses: The General Overview Lukáš MAREK Department of Geoinformatics, Faculty.
GEOG3025 Confidentiality and social implications.
1 New Implementations of Noise for Tabular Magnitude Data, Synthetic Tabular Frequency and Microdata, and a Remote Microdata Analysis System Laura Zayatz.
1 Assessing the Impact of SDC Methods on Census Frequency Tables Natalie Shlomo Southampton Statistical Sciences Research Institute University of Southampton.
The Impact of Disclosure Control on Labour Market Statistics (& other issues)– the User’s Gripes Jill Tuffnell Head of Research Cambridgeshire County Council.
New and easier ways of working with aggregate data and geographies from UK censuses Justin Hayes UK Data Service Census Support.
Introduction to Spatial Microsimulation Dr Kirk Harland.
Data Perturbation An Inference Control Method for Database Security Dissertation Defense Bob Nielson Oct 23, 2009.
WP 19 Assessment of Statistical Disclosure Control Methods for the 2001 UK Census Natalie Shlomo University of Southampton Office for National Statistics.
Disclosure Avoidance at Statistics Canada INFO747 Session on Confidentiality Protection April 19, 2007 Jean-Louis Tambay, Statistics Canada
1 IPAM 2010 Privacy Protection from Sampling and Perturbation in Surveys Natalie Shlomo and Chris Skinner Southampton Statistical Sciences Research Institute.
BPS - 3rd Ed. Chapter 161 Inference about a Population Mean.
Statistical data confidentiality and micro data in Albania
Ireland from Boundary Geography to Geo referenced Dwellings in Census 2011 Not a pointless process Joint UNECE/Eurostat Meetings on Population and Housing.
JOINT UN-ECE/EUROSTAT MEETING ON POPULATION AND HOUSING CENSUSES GENEVA, MAY 2009 DETERMINING USER NEEDS FOR THE 2011 UK CENSUS IAN WHITE, Office.
Design of the 2011 Census Coverage Survey Owen Abbott (ONS) James Brown (Institute of Education)
Disclosure Control in the UK Census Keith Spicer 11 January 2005.
Protection of frequency tables – current work at Statistics Sweden Karin Andersson Ingegerd Jansson Karin Kraft Joint UNECE/Eurostat.
Creating Open Data whilst maintaining confidentiality Philip Lowthian, Caroline Tudor Office for National Statistics 1.
1 1 Confidentiality protection of large frequency data cubes UNECE Workshop on Statistical Confidentiality Ottawa October 2013 Johan Heldal and Svetlana.
The Review of the Dissemination of Health Statistics Carole Abrahams Office for National Statistics.
The micro-geography of UK demographic change Paul Norman School of Geography, University of Leeds understanding population trends and processes.
The micro-geography of UK demographic change Paul Norman School of Geography, University of Leeds understanding population trends and processes.
The 2011 Census: Estimating the Population Alexa Courtney.
Census 2011 – A Question of Confidentiality Statistical Disclosure control for the 2011 Census Carole Abrahams ONS Methodology BSPS – York, September 2011.
Sinclair Sutherland Labour supply: Finding and using statistics.
Data disclosure control Nordic Forum for Geography and Statistics Stockholm, 10 th September 2015.
The complexities of publishing gridded data for the UK European Forum for Geostatistics Krakow – October 2014 Ian Coady Geography Policy and Research Manager.
INTRODUCTION Despite recent advances in spatial analysis in transport, such as the accounting for spatial correlation in accident analysis, important research.
The London Health Observatory: monitoring health and health care in the capital, supporting practitioners and informing decision-makers Disclosure control.
11 Measuring Disclosure Risk and Data Utility for Flexible Table Generators Natalie Shlomo, Laszlo Antal, Mark Elliot University of Manchester
Progress towards a table builder with in-built disclosure control for 2021 Census Keith Spicer UNECE, 22 September 2017.
Assessing Disclosure Risk in Microdata
2001 Census Disclosure Control UK variations
Federal Statistical Office Germany Research Data Centre
Imputation as a Practical Alternative to Data Swapping
Presentation transcript:

Methods of Geographical Perturbation for Disclosure Control Division of Social Statistics And Department of Geography Caroline Young Supervised jointly by: Prof. Chris Skinner (Statistics) and Prof. David Martin (Geography) POPFEST June 2006, Liverpool

Overview of Presentation Part I - Description of Disclosure Control Introduction to PhD topic - disclosure by differencing Part 2 – Methodology to protect against Differencing Conclusions and Future Work

What is Disclosure Control? Protecting confidentiality of statistical data, particularly the Census UK Census: a promise given to respondents to protect confidentiality (also legal obligations) Disclosure control procedures are necessary to ensure confidentiality

How can Disclosure Occur? 2232Total 0 6Aged 60 and over 74Aged Aged Aged Aged Aged under 20 Non claimants Benefit Claimants Benefit Claimants by Age-Group complete census in Area Y EXAMPLE

Statistical Disclosure Control refers to statistical methods which modify the data to control the disclosure risk What is Statistical Disclosure Control? 2130Total 0 6Aged 60 and over 66Aged Aged Aged Aged Aged under 20 Non claimants Benefit claimants Randomly Rounded data (base 3) Benefit Claimants by Age-Group complete census in Area Y

Disclosure by Differencing Disclosure by [geographical] differencing occurs when multiple geographies can be linked to reveal new information

Differencing from two geographies Census User A wants Geography A….

Differencing from two geographies Census User B wants Geography B….

Differencing from two geographies Differenced area Nested geography Ref: Duke-Williams & Rees (1998)

Fictitious Table 1: Claimants in Small Area (to larger boundary) …. Benefit claimed … Benefit not claimed 81211… Disclosure by Geographical Differencing … Benefit claimed 91619… Benefit not claimed 811 … Fictitious Table 2: Claimants in Small Area A (to smaller boundary)

Calculated Table 2: Claimants in Differenced Area … Benefit claimed 100… Benefit not claimed 010… Disclosure by Geographical Differencing Differenced area in yellow

Demand for Multiple Geographies Increased user demand for flexible or non-standard geographies Academics NHS & Business Postcodes Static boundaries Local government Environmentalists National Grid co-ordinates Administrative units

Part II – Methodology to protect against Differencing

Random Record Swapping (UK Census 2001) Introduce uncertainty into the true geographical location of a subset of households Basic idea: Swap the location of household A with the location of similar household B  A unique household in an area (cell value of one) may not be the true household – may have been swapped. Cannot disclose information with any certainty. BA

Assessing Performance of a Swapping Method Risk-Utility concept - finding a balance MAXIMISE UTILITY  Measure of damage/utility : Average Absolute Deviation (AAD) per cell (averaged over all tables) MINIMISE RISK  Measure of risk: % of true uniques in table (averaged over all tables) Identification Rate = % of cell counts where which relate to the same household as Let represent cell of table and the number of cells in table.

Experiments  Performed simulations on a synthetic census dataset  Random record swapping method (UK Census 2001) used as benchmark to assess new approaches  Examine disclosure risk at small area level (postcodes) since the aim is to protect slivers produced by differencing  Some simplified results here…

Simulating Census Swaps  Full details of methods are unknown as they are confidential MAKE A GUESS... (1) UK Random Record Swap:  Swap a random sample (10%) of households between Enumeration Districts (EDs) but not out of Local Authority district. Pair similar households (plus other constraints) (2) US Targeted Record Swap:  Swap 10% of risky households only (households that are unique)

Disclosure Risk Postcode LevelRandom Swap (1) Targeted Swap (2) Identification Rate 94%74% In practice, other post-tabulation methods were also used (small cell adjustment) to offer more protection at small area level But we need a pre-tabulation method – one method that protects data before aggregation

100% swapping  Reduce disclosure risk: swap ALL households  Maximise Utility: swap shorter distances (between adjacent postcodes instead of EDs)  Disclosure risk is much reduced at small area level  Too much damage at higher levels of aggregation Postcode Level 100% random postcode swap Identification Rate 1% AAD2.0 per cell Ward Level 100% random postcode swap Identification Rate 50% AAD17.5 per cell

Distance Swap Postcode Level 100% random postcode swap 100% distance swap Identification Rate 1%2% AAD2.0 per cell 1.7 per cell  Current swapping distances are dependent on pre-set geographies which have different shapes and population distributions. Plus boundaries often change  New Distance swap: sample swapping distances from a distribution equivalent to 100% random swap (truncated normal with same mean and std) Ward Level100% random postcode swap 100% distance swap Identification Rate 50%33% AAD17.5 per cell 29.2 per cell

Density Swap  How to improve distance swap?  Want more control over damage and risk.  Solution: Low density areas are more vulnerable to disclosure attacks - fewer people living there. These households require greater perturbation. Households in high density areas are less risky and require perturbing smaller distances (also reduces damage).

Density Swap Rural area Urban area  Change sampling distribution: sample ‘number of households’  Takes into account local population density  Distance is not Euclidean but in terms of number of households

Effectiveness of Density Swap Choice of sampling distribution is very important (normal, exponential, etc) Sort households appropriately to control pairing of households Match households appropriately – definition of ‘similar households’ Households moved too short and still disclosive Households moved too far leading to lots of damage Mean = 2 households Majority of households moved approx 2 households away

Postcode Level RandomDistanceDensity Identification Rate 1%2%0.7% AAD2.0 per cell1.7 per cell2.0 per cell Ward LevelRandomDistanceDensity Identification Rate 50%33%8.5% AAD17.5 per cell 29.2 per cell 9.7 per cell Results of all 100% swaps

Conclusions and Further Work Density Swap appears to be a good solution: BUT need to examine at other measures of damage and risk Is the density swap better than the combination of methods used on the 2001 Census? (swapping plus small cell adjustment) Discriminate between local-area uniques and wide-area uniques

References Brown D. (2003) Different Approaches to Disclosure Control Problems Associated with Geography. Joint ECE/ Eurostat work session on statistical data confidentiality, Luxembourg. Duke-Williams, O. and Rees, P. (1998) Can Census Offices publish statistics for more than one small area geography? An analysis of the differencing problem in statistical disclosure International Journal of Geographical Information Science 12, Elliot, M. J., (2005) ‘An overview of Statistical Disclosure Control’ Paper presented to RSS Social Statistics Committee conference on Linking survey and administrative data and statistical disclosure control. London; May L. Willenborg and T. de Waal. Statistical Disclosure Control in Practice. Springer-Verlag, New York, Voas D. and Williamson P. (2000) 'An evaluation of the combinatorial optimisation approach to the creation of synthetic microdata', International Journal of Population Geography, 6,