Assessing Disclosure Risk in Microdata Natalie Shlomo Chris Skinner University of Manchester London School of Economics and Political Science 1
Topics Covered Disclosure risk assessment for identity disclosure Probabilistic modelling for quantifying identity disclosure in sample microdata Extensions under misclassification/perturbation Extensions to sub-population microdata Discussion 2
Disclosure Risk in Sample Microdata Probabilistic Models: denotes a q-way frequency table which is a sample from a population table where indicates a cell population count and sample count in cell Disclosure risk measure: For unknown population counts, estimate from the conditional distribution of
Disclosure Risk in Sample Microdata Natural assumption: Bernoulli sampling: It follows that: and where are conditionally independent is the sampling fraction in cell k
Disclosure Risk in Sample Microdata Skinner and Holmes, 1998, Elamir and Skinner, 2006 use log linear models to estimate parameters Sample frequencies are independent Poisson distributed with a mean of Log-linear model for estimating expressed as: where design matrix of key variables and their interactions MLE’s calculated by solving score function:
Disclosure Risk in Sample Microdata Fitted values calculated by: and Individual risk measures estimated by: Skinner and Shlomo (2008) develop goodness of fit criteria which minimizes the bias of disclosure risk estimates
Disclosure Risk in Sample Microdata Criteria related to tests for over and under-dispersion: over-fitting - sample marginal counts produce too many random zeros, leading to large expected cell counts for non-zero cells : under-estimation of risk under-fitting - sample marginal counts do not allow for structural zeros, leading to small expected cell counts for non-zero cells: over-estimation of risk Criteria selects the model using a forward search algorithm which minimizes the bias
Disclosure Risk Assessment Example Example: Population N= 944,793 from UK 2001 Census SRS sample size n= 9,448 Key: Area (2), Sex (2), Age (101), Marital Status (6), Ethnicity (17), Economic Activity (10) - 412,080 cells Model Selection: Starting solution: main-effects log-linear model indicates under-fitting (minimum error statistics too large) Add in higher interaction terms until minimum error statistics indicate fit
Model Search Example (SRS n=9,448) True values , Area–ar, Sex-s, Age–a, Marital Status–m, Ethnicity–et, and Economic Activity-ec Independence - I 386.6 701.2 48.54 114.19 All 2 way - II 104.9 280.1 -1.57 -2.65 1: I + {a*ec} 243.4 494.3 54.75 59.22 2: 1 + {a*et} 180.1 411.6 3.07 9.82 3: 2 + {a*m} 152.3 343.3 0.88 1.73 4: 3 + {s*ec} 149.2 337.5 0.26 0.92 5a: 4 + {ar*a} 148.5 337.1 -0.01 0.84 5b: 4 + {s*m} 147.7 335.3 0.02 0.66 6b: 5b + {ar*a} 147.0 335.0 -0.24 0.56 6c: 5b + {ar*m} 148.9 -0.04 0.72 6d: 5b + {m*ec} 146.3 331.4 0.03 7c: 6c + {m*ec} 147.5 333.2 -0.34 0.06 7d: 6d + {ar*a} 145.6 331.0 -0.44 -0.03 ,
Model Search Example Preferred Model: {a*ec}{a*et}{a*m}(s*ec}{ar*a} True Global Risk: Estimated Global Risk Log-scale True risk measure Estimated per-record risk measure
Statistical Disclosure Control Methods Agencies limit risk of identification through statistical disclosure control (SDC) methods: Non-perturbative – sub-sampling, recoding and collapsing categories of key variables, deleting variables Perturbative – data swapping, additive noise, misclassification (PRAM) and synthetic data 11
Disclosure Risk in Perturbed Microdata Model assumes no misclassification errors either arising from data processes or purposely introduced for SDC Shlomo and Skinner, 2010 address misclassification (perturbation) errors Let: where cross-classified key variables: in population fixed in microdata subject to misclassification (perturbation)
Disclosure Risk in Perturbed Microdata The per-record disclosure risk measure of a match of external unit B to a unique record in microdata A that has undergone misclassification: (1) For small misclassification and small sampling fractions: or (2) Global measure: estimated by: (3) where per-record risk:
Perturbation Example Population of individuals from 2001 United Kingdom (UK) Census N=1,468,255 1% srs sample n=14,683 Six key variables: Local Authority (LAD) (11), sex (2), age groups (24), marital status (6), ethnicity (17), economic activity (10) K=538,560 14
Perturbation Example Record Swapping: LAD swapped randomly, eg. for a 20% swap: Diagonal: Off diagonal: where is the number of records in the sample from LAD k Pram: LAD misclassified, eg. for a 20% misclassification Off diagonal: Parameter: 15
Perturbation Example Random 20% perturbation on LAD Global risk measures: Expected correct matches on SU’s Global Risk Measure PRAM Swapping True risk measure in original sample 358.1 362.4 Estimated naïve risk measure ignoring misclassification 349.5 358.6 Risk measure on non-perturbed records 292.2 292.8 Risk measure under misclassification (1) Sample uniques 299.7 2,779 298.9 2,831 Approximation based on diagonals (2) 299.8 Estimated risk measure under misclassification (3) 283.1 286.8 Expected correct match per sample unique: Pram: 0.108 Record swapping: 0.106
Perturbation Example Estimating individual per-record risk measures for 20% random swap based on log linear modelling (log scale): Risk Measure (1) Estimated Risk Measure (3) 17
Disclosure Risk in Subpopulation Microdata Up till now, sample is a random subset of the population and ‘intruder’ knows that an individual can be linked New problem: Microdata contains members of a subpopulation and membership is not known Example: subpopulation refers to all persons with a medical condition (where membership is sensitive) 18
Disclosure Risk in Subpopulation Microdata Subpopulation not necessarily representative of the population As with a sample, ‘intruder’ matches a record in the subpopulation to an individual in the population of which the subpopulation is a subset (assume no measurement errors) Use sample microdata to make inference about population uniqueness in the subpopulation or a sample from the subpopulation 19
Disclosure Risk in Subpopulation Microdata Assume Fk are unknown, and assume a random sample s from the population U Let fk denote the random sample frequency in cell k Let ( )denote the sub-population (sample) frequency in cell k 2 Scenarios: Assess disclosure risk in subpopulation: Assess disclosure risk in sample from subpopulation: 20
Disclosure Risk in Subpopulation Microdata Following Skinner and Shlomo 2008: where the log linear model holds: Assume within cell k, Y takes the value 1 with probability pk , independently for each of the Fk units, so that where and the F’k are binomially distributed Assume that follows a logistic model: (may assume different X variables) Assume a Bernoulli sample design which preserves the Poisson Distribution for sample counts: pk 21
Expressions for Risk Measures since assuming independent of For the case of a sample from the subpopulation, assume that It follows that and assuming that is independent of 22
Estimation of Risk Measures – 2 step approach Estimate from as in Skinner and Shlomo, 2008 Fix and since and Since estimate by method of scoring treating each of and as functions of (and fixed) 23
Simulation Study Population Size N=1,163,659 UK Census Subpopulation - those with long term illness– N`=207,537 Key: Geography (6)*Age group(14)*Sex(2)*Marital Status (6)*Ethnicity(16)*Economic Activity(10) K=161,280 Step 1: Draw simple random sample of size n=23,273 Step 2: Draw samples from subpopulation with different sampling fractions: Step 3: Estimate parameters and from score function with fixed Step 4: Repeat 500 times 24
Population and subpopulation Uniques Percent Relative Difference Simulation Study Sample Fraction Population and subpopulation Uniques Estimate Percent Relative Difference 01:20 142 213 -50.00% 01:10 266 392.2 -47.44% 01:05 561 714.7 -27.40% 01:02 1335 1530.7 -14.66% 01:01 2721 2766 -1.65% 25
Simulation Study Step 5: Generate population and subpopulation from known parameters according to assumptions of the model Step 6: Draw sample from ‘synthetic’ population Step 7: Draw sample from ‘synthetic’ subpopulation: Step 8: Estimate parameters and from score function keeping fixed in the 2-step approach Step 9: Repeat 500 times 26
Population and subpopulation Uniques Simulation Study Truth Estimate Population and subpopulation Uniques 142 141 Population zeros 132567 132482 Population ones 10988 11030 subpopulation zeros 148172 148091 Subpopulation ones 6395 6398 27
Discussion Future work: - 2-step approach results in subpopulation uniques with estimated parameter of 0 from step 1 and hence large over-estimation of risk measures (exp(0)=1) - Develop modelling framework for estimating joint likelihood of and under an EM algorithm where ‘missing’ are those sample units in the sub-population - Develop goodness of fit criteria for selecting models which minimize bias of estimates
Thank you for your attention