Presentation is loading. Please wait.

Presentation is loading. Please wait.

WP 9 Assessing Disclosure Risk in Microdata using Record Level Measures Natalie Shlomo University of Southampton Office for National Statistics

Similar presentations


Presentation on theme: "WP 9 Assessing Disclosure Risk in Microdata using Record Level Measures Natalie Shlomo University of Southampton Office for National Statistics"— Presentation transcript:

1 WP 9 Assessing Disclosure Risk in Microdata using Record Level Measures Natalie Shlomo University of Southampton Office for National Statistics n.shlomo@soton.ac.uk Chris Skinner University of Southampton C.J.Skinner@soton.ac.uk

2 Disclosure Risk Assessment for Microdata Assume: sample categorical key variables no measurement error Seek: record level risk measures aggregated to file level measures

3 Record Level Measures Record with combination of key variable values Sample count with same combination = Population count with same combination = Only consider sample unique records, i.e. = Pr(population unique) = = Pr(correct match)=

4 Aggregated File-level Measures Expected number of population uniques in sample Expected number of correct matches among sample uniques to the population Note: sample uniques

5 Estimation Problem To make inference about: Record level measures and for sample unique File level measures and

6 Log-linear Model 1., and independent given 3. where, sampling fraction Estimate by maximum likelihood,,,

7 Some Literature Skinner and Holmes (1998, JOS): good properties of under all two-way interactions log-linear model, where:, Elamir and Skinner (2006, JOS): good properties of and under all two-way interactions model, but no need for term.

8 All two-way interactions model performs well, but… still evidence of some model-dependence of and in neighborhood of this model. Tendency for risk to decrease as model complexity increases. Model Sensitivity

9 Model Choice Goodness of fit tests? Pearson? Likelihood ratio? AIC, BIC? Problems with very large and sparse tables

10 Allow for small departures from Estimate bias of by: Choose model to minimise Similar to choosing model to minimise Bias Criterion

11 Minimising Over- (Under-) Dispersion Model estimates degree of over- or under-dispersion tests hypothesis of equal dispersion Cameron and Trivedi (1998)

12 Two areas with population of 944,793. ‘Large’ Key: Area (2), Sex (2), Age (101), Marital Status (6), Ethnicity (17), Economic Activity (10) 412,080 cells ‘Small’ Key: same except Age (18) 73,440 cells Samples from 2001 UK Census

13 Small key, Simple random sample of size 18,896 True values: number of population uniques in sample: sum of over sample uniques: ModelEstimatesCameron-Trivedi Independence174.3355.72.0629.561.287 All 2-way85.5222.00.0062.00 0.007

14 Large Key, Simple random sample of size 4,724 True values, Model EstimatesCameron-Trivedi Independence197.4385.10.08810.580.0296 All 2-way35.9112.3-0.003-7.960.0004

15 Model Search Algorithm Starting solution: all 2-way interactions log-linear model Search by: Removing terms Adding terms Swapping terms TABU method of Drezner, Marcoulides and Salhi (1999)

16 Large key, Simple random sample of size 9,448 True values, Model Independent386.6701.20.184614.420.3435 All 2-way104.9280.1-0.0036-10.320.0060 Drop {ea*s} 104.6279.8-0.0035-10.150.0060 Drop {ea*a} 105.3281.3-0.0032-9.690.0062 Drop {ea*m} 103.8279.1-0.0034-10.920.0065 Drop {ea*et} 108.7290.0-0.0024-6.090.0059 Drop {ea*ec} 105.2280.0-0.0035-10.600.0061 Drop {s*ec} 103.2281.5-0.0018-5.180.0065 Drop {a*m} 134.0328.60.00719.420.0040 Drop {a*et} 147.0346.20.00181.520.0039 Drop {a*ec} 184.7419.20.031613.270.0549 Drop {m*et} 108.7287.5-0.0032-8.560.0060 Drop {m*ec} 108.3284.0-0.0028-6.740.0052 Drop {et*ec} 132.3308.2-0.0015-2.240.0020

17 True values, Model Drop {et*ec} 132.3308.2-0.0015-2.240.0020 Drop {ea*s}{et*ec} 132.3308.2-0.0015-2.270.0021 Drop {ea*a}{et*ec} 133.4310.4-0.0011-1.650.0015 Drop {ea*et}{et*ec} 139.8320.8-0.0002-0.200.0003 Drop {s*a}{et*ec} 132.1309.2-0.0013-2.170.0033 Drop {s*m}{et*ec} 133.4310.3-0.0011-1.580.0015 Drop {s*et}{et*ec} 132.4308.5-0.0015-2.240.0021 Drop {s*ec}{et*ec} 130.9310.30.00020.350.0039 Drop {a*et}{et*ec} 173.4370.20.00662.580.0163 Drop {m*et}{et*ec} 137.3315.8-0.0011-1.680.0019 Drop {m*ec}{et*ec} 134.0311.1-0.0008-1.100.0012 In {ea} Out {et*ec}(ea*s} {ea*a}{ea*m}{ea*et}{ea*ec} 141.3321.70.00020.280.0000 In {s} Out {et*ec}(ea*s} {s*a}{s*m}{s*et}{s*ec} 132.6313.00.00091.360.0038

18 Record Level Risk Measures Preferred Model: {ea}{s*a}{s*m}(s*et}{s*ec}{a*m}{a*et}{a*ec}(m*et}{m*ec} True Global Risk: Estimated Global Risk

19 Record Level Risk Measures Preferred Model: {ea}{s*a}{s*m}(s*et}{s*ec}{a*m}{a*et}{a*ec}(m*et}{m*ec} True Global Risk: Estimated Global Risk True Record Level Risk Measures Estimated Record Level Risk Measures 0 – 0.30.3 – 0.70.7 – 1Total 0 – 0.31,83897261,961 0.3 – 0.7 755752184 0.7 – 1 454965159 Total 1,9582031432,304

20 Conclusions Model selection by assessing over-, under-dispersion Similar risk estimates for models with nearly Poisson dispersion Further work: - stratification of files - complex survey designs


Download ppt "WP 9 Assessing Disclosure Risk in Microdata using Record Level Measures Natalie Shlomo University of Southampton Office for National Statistics"

Similar presentations


Ads by Google