Download presentation
Presentation is loading. Please wait.
Published byElliott Harmon Modified over 10 years ago
1
WP 9 Assessing Disclosure Risk in Microdata using Record Level Measures Natalie Shlomo University of Southampton Office for National Statistics n.shlomo@soton.ac.uk Chris Skinner University of Southampton C.J.Skinner@soton.ac.uk
2
Disclosure Risk Assessment for Microdata Assume: sample categorical key variables no measurement error Seek: record level risk measures aggregated to file level measures
3
Record Level Measures Record with combination of key variable values Sample count with same combination = Population count with same combination = Only consider sample unique records, i.e. = Pr(population unique) = = Pr(correct match)=
4
Aggregated File-level Measures Expected number of population uniques in sample Expected number of correct matches among sample uniques to the population Note: sample uniques
5
Estimation Problem To make inference about: Record level measures and for sample unique File level measures and
6
Log-linear Model 1., and independent given 3. where, sampling fraction Estimate by maximum likelihood,,,
7
Some Literature Skinner and Holmes (1998, JOS): good properties of under all two-way interactions log-linear model, where:, Elamir and Skinner (2006, JOS): good properties of and under all two-way interactions model, but no need for term.
8
All two-way interactions model performs well, but… still evidence of some model-dependence of and in neighborhood of this model. Tendency for risk to decrease as model complexity increases. Model Sensitivity
9
Model Choice Goodness of fit tests? Pearson? Likelihood ratio? AIC, BIC? Problems with very large and sparse tables
10
Allow for small departures from Estimate bias of by: Choose model to minimise Similar to choosing model to minimise Bias Criterion
11
Minimising Over- (Under-) Dispersion Model estimates degree of over- or under-dispersion tests hypothesis of equal dispersion Cameron and Trivedi (1998)
12
Two areas with population of 944,793. ‘Large’ Key: Area (2), Sex (2), Age (101), Marital Status (6), Ethnicity (17), Economic Activity (10) 412,080 cells ‘Small’ Key: same except Age (18) 73,440 cells Samples from 2001 UK Census
13
Small key, Simple random sample of size 18,896 True values: number of population uniques in sample: sum of over sample uniques: ModelEstimatesCameron-Trivedi Independence174.3355.72.0629.561.287 All 2-way85.5222.00.0062.00 0.007
14
Large Key, Simple random sample of size 4,724 True values, Model EstimatesCameron-Trivedi Independence197.4385.10.08810.580.0296 All 2-way35.9112.3-0.003-7.960.0004
15
Model Search Algorithm Starting solution: all 2-way interactions log-linear model Search by: Removing terms Adding terms Swapping terms TABU method of Drezner, Marcoulides and Salhi (1999)
16
Large key, Simple random sample of size 9,448 True values, Model Independent386.6701.20.184614.420.3435 All 2-way104.9280.1-0.0036-10.320.0060 Drop {ea*s} 104.6279.8-0.0035-10.150.0060 Drop {ea*a} 105.3281.3-0.0032-9.690.0062 Drop {ea*m} 103.8279.1-0.0034-10.920.0065 Drop {ea*et} 108.7290.0-0.0024-6.090.0059 Drop {ea*ec} 105.2280.0-0.0035-10.600.0061 Drop {s*ec} 103.2281.5-0.0018-5.180.0065 Drop {a*m} 134.0328.60.00719.420.0040 Drop {a*et} 147.0346.20.00181.520.0039 Drop {a*ec} 184.7419.20.031613.270.0549 Drop {m*et} 108.7287.5-0.0032-8.560.0060 Drop {m*ec} 108.3284.0-0.0028-6.740.0052 Drop {et*ec} 132.3308.2-0.0015-2.240.0020
17
True values, Model Drop {et*ec} 132.3308.2-0.0015-2.240.0020 Drop {ea*s}{et*ec} 132.3308.2-0.0015-2.270.0021 Drop {ea*a}{et*ec} 133.4310.4-0.0011-1.650.0015 Drop {ea*et}{et*ec} 139.8320.8-0.0002-0.200.0003 Drop {s*a}{et*ec} 132.1309.2-0.0013-2.170.0033 Drop {s*m}{et*ec} 133.4310.3-0.0011-1.580.0015 Drop {s*et}{et*ec} 132.4308.5-0.0015-2.240.0021 Drop {s*ec}{et*ec} 130.9310.30.00020.350.0039 Drop {a*et}{et*ec} 173.4370.20.00662.580.0163 Drop {m*et}{et*ec} 137.3315.8-0.0011-1.680.0019 Drop {m*ec}{et*ec} 134.0311.1-0.0008-1.100.0012 In {ea} Out {et*ec}(ea*s} {ea*a}{ea*m}{ea*et}{ea*ec} 141.3321.70.00020.280.0000 In {s} Out {et*ec}(ea*s} {s*a}{s*m}{s*et}{s*ec} 132.6313.00.00091.360.0038
18
Record Level Risk Measures Preferred Model: {ea}{s*a}{s*m}(s*et}{s*ec}{a*m}{a*et}{a*ec}(m*et}{m*ec} True Global Risk: Estimated Global Risk
19
Record Level Risk Measures Preferred Model: {ea}{s*a}{s*m}(s*et}{s*ec}{a*m}{a*et}{a*ec}(m*et}{m*ec} True Global Risk: Estimated Global Risk True Record Level Risk Measures Estimated Record Level Risk Measures 0 – 0.30.3 – 0.70.7 – 1Total 0 – 0.31,83897261,961 0.3 – 0.7 755752184 0.7 – 1 454965159 Total 1,9582031432,304
20
Conclusions Model selection by assessing over-, under-dispersion Similar risk estimates for models with nearly Poisson dispersion Further work: - stratification of files - complex survey designs
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.