Download presentation
Presentation is loading. Please wait.
1
Methods of Geographical Perturbation for Disclosure Control Division of Social Statistics And Department of Geography Caroline Young Supervised jointly by: Prof. Chris Skinner (Statistics) and Prof. David Martin (Geography) POPFEST June 2006, Liverpool
2
Overview of Presentation Part I - Description of Disclosure Control Introduction to PhD topic - disclosure by differencing Part 2 – Methodology to protect against Differencing Conclusions and Future Work
3
What is Disclosure Control? Protecting confidentiality of statistical data, particularly the Census UK Census: a promise given to respondents to protect confidentiality (also legal obligations) Disclosure control procedures are necessary to ensure confidentiality
4
How can Disclosure Occur? 2232Total 0 6Aged 60 and over 74Aged 50 -59 38Aged 40 - 49 55Aged 30 -39 14Aged 20 -29 65Aged under 20 Non claimants Benefit Claimants Benefit Claimants by Age-Group complete census in Area Y EXAMPLE
5
Statistical Disclosure Control refers to statistical methods which modify the data to control the disclosure risk What is Statistical Disclosure Control? 2130Total 0 6Aged 60 and over 66Aged 50 -59 39Aged 40 - 49 36Aged 30 -39 33Aged 20 -29 66Aged under 20 Non claimants Benefit claimants Randomly Rounded data (base 3) Benefit Claimants by Age-Group complete census in Area Y
6
Disclosure by Differencing Disclosure by [geographical] differencing occurs when multiple geographies can be linked to reveal new information
7
Differencing from two geographies Census User A wants Geography A….
8
Differencing from two geographies Census User B wants Geography B….
9
Differencing from two geographies Differenced area Nested geography Ref: Duke-Williams & Rees (1998)
10
Fictitious Table 1: Claimants in Small Area (to larger boundary) 16-2021-3031-40…. Benefit claimed 101619… Benefit not claimed 81211… Disclosure by Geographical Differencing 16-2021-3031-40… Benefit claimed 91619… Benefit not claimed 811 … Fictitious Table 2: Claimants in Small Area A (to smaller boundary)
11
Calculated Table 2: Claimants in Differenced Area 16-2021-3031-40… Benefit claimed 100… Benefit not claimed 010… Disclosure by Geographical Differencing Differenced area in yellow
12
Demand for Multiple Geographies Increased user demand for flexible or non-standard geographies Academics NHS & Business Postcodes Static boundaries Local government Environmentalists National Grid co-ordinates Administrative units
13
Part II – Methodology to protect against Differencing
14
Random Record Swapping (UK Census 2001) Introduce uncertainty into the true geographical location of a subset of households Basic idea: Swap the location of household A with the location of similar household B A unique household in an area (cell value of one) may not be the true household – may have been swapped. Cannot disclose information with any certainty. BA
15
Assessing Performance of a Swapping Method Risk-Utility concept - finding a balance MAXIMISE UTILITY Measure of damage/utility : Average Absolute Deviation (AAD) per cell (averaged over all tables) MINIMISE RISK Measure of risk: % of true uniques in table (averaged over all tables) Identification Rate = % of cell counts where which relate to the same household as Let represent cell of table and the number of cells in table.
16
Experiments Performed simulations on a synthetic census dataset Random record swapping method (UK Census 2001) used as benchmark to assess new approaches Examine disclosure risk at small area level (postcodes) since the aim is to protect slivers produced by differencing Some simplified results here…
17
Simulating Census Swaps Full details of methods are unknown as they are confidential MAKE A GUESS... (1) UK Random Record Swap: Swap a random sample (10%) of households between Enumeration Districts (EDs) but not out of Local Authority district. Pair similar households (plus other constraints) (2) US Targeted Record Swap: Swap 10% of risky households only (households that are unique)
18
Disclosure Risk Postcode LevelRandom Swap (1) Targeted Swap (2) Identification Rate 94%74% In practice, other post-tabulation methods were also used (small cell adjustment) to offer more protection at small area level But we need a pre-tabulation method – one method that protects data before aggregation
19
100% swapping Reduce disclosure risk: swap ALL households Maximise Utility: swap shorter distances (between adjacent postcodes instead of EDs) Disclosure risk is much reduced at small area level Too much damage at higher levels of aggregation Postcode Level 100% random postcode swap Identification Rate 1% AAD2.0 per cell Ward Level 100% random postcode swap Identification Rate 50% AAD17.5 per cell
20
Distance Swap Postcode Level 100% random postcode swap 100% distance swap Identification Rate 1%2% AAD2.0 per cell 1.7 per cell Current swapping distances are dependent on pre-set geographies which have different shapes and population distributions. Plus boundaries often change New Distance swap: sample swapping distances from a distribution equivalent to 100% random swap (truncated normal with same mean and std) Ward Level100% random postcode swap 100% distance swap Identification Rate 50%33% AAD17.5 per cell 29.2 per cell
21
Density Swap How to improve distance swap? Want more control over damage and risk. Solution: Low density areas are more vulnerable to disclosure attacks - fewer people living there. These households require greater perturbation. Households in high density areas are less risky and require perturbing smaller distances (also reduces damage).
22
Density Swap Rural area Urban area Change sampling distribution: sample ‘number of households’ Takes into account local population density Distance is not Euclidean but in terms of number of households
23
Effectiveness of Density Swap Choice of sampling distribution is very important (normal, exponential, etc) Sort households appropriately to control pairing of households Match households appropriately – definition of ‘similar households’ Households moved too short and still disclosive Households moved too far leading to lots of damage Mean = 2 households Majority of households moved approx 2 households away
24
Postcode Level RandomDistanceDensity Identification Rate 1%2%0.7% AAD2.0 per cell1.7 per cell2.0 per cell Ward LevelRandomDistanceDensity Identification Rate 50%33%8.5% AAD17.5 per cell 29.2 per cell 9.7 per cell Results of all 100% swaps
25
Conclusions and Further Work Density Swap appears to be a good solution: BUT need to examine at other measures of damage and risk Is the density swap better than the combination of methods used on the 2001 Census? (swapping plus small cell adjustment) Discriminate between local-area uniques and wide-area uniques
26
References Brown D. (2003) Different Approaches to Disclosure Control Problems Associated with Geography. Joint ECE/ Eurostat work session on statistical data confidentiality, Luxembourg. Duke-Williams, O. and Rees, P. (1998) Can Census Offices publish statistics for more than one small area geography? An analysis of the differencing problem in statistical disclosure International Journal of Geographical Information Science 12, 579-605 Elliot, M. J., (2005) ‘An overview of Statistical Disclosure Control’ Paper presented to RSS Social Statistics Committee conference on Linking survey and administrative data and statistical disclosure control. London; May 2005. L. Willenborg and T. de Waal. Statistical Disclosure Control in Practice. Springer-Verlag, New York, 1996. Voas D. and Williamson P. (2000) 'An evaluation of the combinatorial optimisation approach to the creation of synthetic microdata', International Journal of Population Geography, 6, 349-366.
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.