Download presentation
Presentation is loading. Please wait.
Published byBuddy Manning Modified over 9 years ago
1
Using Targeted Perturbation of Microdata to Protect Against Intelligent Linkage Mark Elliot, University of Manchester Mark.Elliot@manchester.ac.uk Cathie Marsh Centre for Census and Survey Research, University of Manchester
2
Overview Linkage experiments using: –Individual microdata from the UK census –microdata from the UK Labour Force Survey (LFS). Objective –to assess the disclosure risk impact of the statistical disclosure control methods on the used on the census microdata in 2001
3
Data Three datasets were used in this study: –The spring 2001 quarter of the standard Labour Force Survey (LFS). –The standard release version of the 2001 individual level SAR (post-SDC SAR). –The pre-SDC version of the 2001 individual level SAR (Pre-SDC SAR).
4
Background The 2001 SARS were subject to extensive statistical disclosure risk assessment and targeted control methods. The risk assessment was carried through collaboration of Manchester University and ONS using a variety of approaches. The disclosure control was a mixture of global recoding and local suppression and reimputation based on a variant of the PRAM (post randomisation) method.
5
Procedure 1) The variables were selected and then the codings of these variables on the different datasets were harmonised. 2) The matching was conducted. 3) The SUDA Software was run over the SARs to obtain DIS-SUDA scores for the matches. 4) All the unique matches (one-to-one) were sent to ONS. 5) The matches were verified by ONS Census and LFS divisions 6) The matches were returned with an indicator placed on the match file indicating whether the match was true or false. 7) The proportion of correct matches was generated under several different assumptions.
6
Variables Age (95 categories for the pre SDC SARS-LFS match, 44 categories for the post-SDC SARs-LFS match) Sex (2 Categories) Marital Status (5 Categories) Region of residence. (11 Categories) Number of Residents (7 categories) Primary economic status (9 categories) Country of birth (14 Categories) Ethnic group (15 categories) Tenure (5 categories)
7
Matching In principle, we could have used fuzzy matching methods to allow for data divergence However, the number of direct one to one matches was very large on both files and therefore this was deemed to cause an unnecessary administrative burden at the match verification stage. Therefore, a simple combine and sort algorithm was used for the matching.
8
Matching 6085 one to one matches between the pre-SDC SAR and the LFS 3130 one to one matches between the released SAR and the LFS. matches sent to ONS for verification.
9
Verification problem a significant number of matches there was no address linkable to the LFS identifying variables. –This affected 1602 matches (26.32%) against the pre-SDC SARS file and 895 matches (28.95%) against the post-SDC SARS file. –However, no strong relationships with match key variables.
10
Results 2.74% correct match rate with PRE-SDC SARS 1.63% correct match rate with PRE-SDC SARS
11
DIS-SUDA Band False matches Correct matches % correctTotal 0->0.13900812.033981 0.1->0.221883.54226 0.2->0.3111119.02122 0.3->0.477910.4786 0.4->0.533819.5141 0.5->0.610323.0813 0.6->0.71150.002 >0.710216.6712 Total43601232.744483 Pre SDC-SAR
12
DIS-SUDA Threshold False MatchesCorrect matches %correct >043601232.7 >0.1460428.4 >0.22423412.3 >0.31312314.9 >0.4541420.6 >0.521622.2 >0.611321.4 >0.710216.7 Pre SDC-SAR
13
DIS-SUDA Band False matches Correct matches % CorrectTotal 0->0.11577281.741605 0.1->0.239481.99402 0.2->0.38522.387 0.3->0.452813.3360 0.4->0.54524.2647 0.5->0.614212.516 0.6->0.76006 >0.71118.3312 Total2184512.282235 Post SDC-SAR
14
DIS-SUDA Threshold False MatchesCorrect matches%correct >02184512.3 >0.1607233.7 >0.2213156.6 >0.3128139.2 >0.47656.2 >0.53138.8 >0.61715.6 >0.71118.3
15
Concluding Remarks Study provides evidence that disclosure control method use in the 2001 SARS provided protection against targeted intrusion. Caveats: –Data divergence, coverage, secondary attacks –Alternative method for identifying risky records
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.