WP 9 Assessing Disclosure Risk in Microdata using Record Level Measures Natalie Shlomo University of Southampton Office for National Statistics

WP 9 Assessing Disclosure Risk in Microdata using Record Level Measures Natalie Shlomo University of Southampton Office for National Statistics n.shlomo@soton.ac.uk Chris Skinner University of Southampton C.J.Skinner@soton.ac.uk

Disclosure Risk Assessment for Microdata Assume: sample categorical key variables no measurement error Seek: record level risk measures aggregated to file level measures

Record Level Measures Record with combination of key variable values Sample count with same combination = Population count with same combination = Only consider sample unique records, i.e. = Pr(population unique) = = Pr(correct match)=

Aggregated File-level Measures Expected number of population uniques in sample Expected number of correct matches among sample uniques to the population Note: sample uniques

Estimation Problem To make inference about: Record level measures and for sample unique File level measures and

Log-linear Model 1., and independent given 3. where, sampling fraction Estimate by maximum likelihood,,,

Some Literature Skinner and Holmes (1998, JOS): good properties of under all two-way interactions log-linear model, where:, Elamir and Skinner (2006, JOS): good properties of and under all two-way interactions model, but no need for term.

All two-way interactions model performs well, but… still evidence of some model-dependence of and in neighborhood of this model. Tendency for risk to decrease as model complexity increases. Model Sensitivity

Model Choice Goodness of fit tests? Pearson? Likelihood ratio? AIC, BIC? Problems with very large and sparse tables

Allow for small departures from Estimate bias of by: Choose model to minimise Similar to choosing model to minimise Bias Criterion

Minimising Over- (Under-) Dispersion Model estimates degree of over- or under-dispersion tests hypothesis of equal dispersion Cameron and Trivedi (1998)

Two areas with population of 944,793. ‘Large’ Key: Area (2), Sex (2), Age (101), Marital Status (6), Ethnicity (17), Economic Activity (10) 412,080 cells ‘Small’ Key: same except Age (18) 73,440 cells Samples from 2001 UK Census

Small key, Simple random sample of size 18,896 True values: number of population uniques in sample: sum of over sample uniques: ModelEstimatesCameron-Trivedi Independence174.3355.72.0629.561.287 All 2-way85.5222.00.0062.00 0.007

Large Key, Simple random sample of size 4,724 True values, Model EstimatesCameron-Trivedi Independence197.4385.10.08810.580.0296 All 2-way35.9112.3-0.003-7.960.0004

Model Search Algorithm Starting solution: all 2-way interactions log-linear model Search by: Removing terms Adding terms Swapping terms TABU method of Drezner, Marcoulides and Salhi (1999)

Large key, Simple random sample of size 9,448 True values, Model Independent386.6701.20.184614.420.3435 All 2-way104.9280.1-0.0036-10.320.0060 Drop {ea*s} 104.6279.8-0.0035-10.150.0060 Drop {ea*a} 105.3281.3-0.0032-9.690.0062 Drop {ea*m} 103.8279.1-0.0034-10.920.0065 Drop {ea*et} 108.7290.0-0.0024-6.090.0059 Drop {ea*ec} 105.2280.0-0.0035-10.600.0061 Drop {s*ec} 103.2281.5-0.0018-5.180.0065 Drop {a*m} 134.0328.60.00719.420.0040 Drop {a*et} 147.0346.20.00181.520.0039 Drop {a*ec} 184.7419.20.031613.270.0549 Drop {m*et} 108.7287.5-0.0032-8.560.0060 Drop {m*ec} 108.3284.0-0.0028-6.740.0052 Drop {et*ec} 132.3308.2-0.0015-2.240.0020

True values, Model Drop {et*ec} 132.3308.2-0.0015-2.240.0020 Drop {ea*s}{et*ec} 132.3308.2-0.0015-2.270.0021 Drop {ea*a}{et*ec} 133.4310.4-0.0011-1.650.0015 Drop {ea*et}{et*ec} 139.8320.8-0.0002-0.200.0003 Drop {s*a}{et*ec} 132.1309.2-0.0013-2.170.0033 Drop {s*m}{et*ec} 133.4310.3-0.0011-1.580.0015 Drop {s*et}{et*ec} 132.4308.5-0.0015-2.240.0021 Drop {s*ec}{et*ec} 130.9310.30.00020.350.0039 Drop {a*et}{et*ec} 173.4370.20.00662.580.0163 Drop {m*et}{et*ec} 137.3315.8-0.0011-1.680.0019 Drop {m*ec}{et*ec} 134.0311.1-0.0008-1.100.0012 In {ea} Out {et*ec}(ea*s} {ea*a}{ea*m}{ea*et}{ea*ec} 141.3321.70.00020.280.0000 In {s} Out {et*ec}(ea*s} {s*a}{s*m}{s*et}{s*ec} 132.6313.00.00091.360.0038

Record Level Risk Measures Preferred Model: {ea}{s*a}{s*m}(s*et}{s*ec}{a*m}{a*et}{a*ec}(m*et}{m*ec} True Global Risk: Estimated Global Risk

Record Level Risk Measures Preferred Model: {ea}{s*a}{s*m}(s*et}{s*ec}{a*m}{a*et}{a*ec}(m*et}{m*ec} True Global Risk: Estimated Global Risk True Record Level Risk Measures Estimated Record Level Risk Measures 0 – 0.30.3 – 0.70.7 – 1Total 0 – 0.31,83897261,961 0.3 – 0.7 755752184 0.7 – 1 454965159 Total 1,9582031432,304

Conclusions Model selection by assessing over-, under-dispersion Similar risk estimates for models with nearly Poisson dispersion Further work: - stratification of files - complex survey designs

WP 9 Assessing Disclosure Risk in Microdata using Record Level Measures Natalie Shlomo University of Southampton Office for National Statistics

Similar presentations

Presentation on theme: "WP 9 Assessing Disclosure Risk in Microdata using Record Level Measures Natalie Shlomo University of Southampton Office for National Statistics"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

WP 9 Assessing Disclosure Risk in Microdata using Record Level Measures Natalie Shlomo University of Southampton Office for National Statistics

Similar presentations

Presentation on theme: "WP 9 Assessing Disclosure Risk in Microdata using Record Level Measures Natalie Shlomo University of Southampton Office for National Statistics"— Presentation transcript:

Similar presentations

About project

Feedback