The methodology used for the 2001 SARs Special Uniques Analysis Mark Elliot Anna Manning Confidentiality And Privacy Group ( University of Manchester
Overview Description of DIS Description of SUDA Description of DIS-SUDA Numerical Study
Data Intrusion Simulation(DIS) Uses microdata set itself to estimate risk at the file level Provides estimates of matching probabilities – matching probability particularly: probability of a correct match given a unique match: pr(cm|um). Special method: sub-sampling and re- sampling. General method: derivation from the partition structure of the microdata file.
The DIS Method Remove a small number of records Microdata sample
The DIS Method II Copy back a random number of the removed records (at a probability equivalent to the original sampling fraction)
The DIS Method III Match the removed fragment against the truncated microdata file
DIS Validation Numerical studies using population data: results: no bias and small error; Elliot (2000) Statistical validation; Skinner and Elliot (2002)
Levels of Risk Analysis DIS –Works at the file level –Very good for comparative analyses E.G. Small area microdata(SAM); Tranmer et. al. (2003) BUT: Record level risk is important –Variations in risk topography –Risky records
Special Uniques Original concept: Elliot, Skinner & Dale(1998) –Counterintuitive geographical effect, indicated two types of sample uniques –Random and special –Special Demographic peculiarity –Random Effect of sampling and variable definition
Special Uniques Definitions Changing definition: 1.Sample uniques which remain unique despite geographical aggregation. 2.Sample uniques which remain unique through any variable aggregation. 3.Sample uniques on small number of key variables.
Theoretical and empirical properties of special and random uniques
Special Uniques: Issues Problem: how to look at all the variables? –File may contain hundreds –Combinatorial explosion –Data storage issues (1)Storage requirements for locating minimal sample unique patterns(MSUs) (2)Storage of results for post-processing
HIPERSTAD Use of high performance computing –Enables comprehensive analysis of patterns of uniqueness within each record –Has allowed investigation of more complex grading systems
Risk Signatures Example –Unique pairs 3 –Unique Triples 2 –Unique fourfolds 0 –Unique fivefolds 1 –Unique sixfolds 0 –Unique sevenfolds 0 –………
An example of MSUs at record level Size 2Size 3Size 5 1,2(1,6,9)(2,5,6,8,11) 1,5(5,8,12) 1,8
Numerical Study Elliot et al. (2002), show strong relationship between SUDA output score (essentially a measure of the proportion of lattice that is unique) and Population Equivalence class However, SUDAs output score is ad hoc. Two SUDA output scores from different analyses do not mean the same thing.
DIS-SUDA DIS and SUDA outputs both relate to the underlying partition structure in the population. However, relating the two is tricky as SUDA is ad hoc. The method we have developed involves first running DIS to calibrate SUDA
DIS-SUDA It exploits the fact that DIS accurately estimates the mean reciprocal equivalence class. –this can be used to derive the number of population units corresponding to the sample uniques. –which can then be distributed using the SUDA score.
DIS-SUDA
DIS-SUDA Evaluation 1991 census data used Geographical area pop approximately 0.5m population. 50 parallel geographically stratified 2% samples drawn 12 key variables restricted to variables coded at 100% in 1991 DIS-SUDA run across all 50 samples Results summed across the 50 samples. Compare DIS-SUDA scores with population uniques and 1/Fj
Percentage of records population unique by DIS SUDA score (rounded up to one decimal place).
Mean reciprocal population equivalence class by DIS- SUDA score (grouped)
Conclusions Combination of DIS and SUDA give desired record level matching certainty metric Records DIS SUDA predicts are population unique are extremely likely to be so.