Assessing Disclosure Risk in Microdata

Assessing Disclosure Risk in Microdata
Natalie Shlomo Chris Skinner University of Manchester London School of Economics and Political Science 1

Topics Covered Disclosure risk assessment for identity disclosure
Probabilistic modelling for quantifying identity disclosure in sample microdata Extensions under misclassification/perturbation Extensions to sub-population microdata Discussion 2

Disclosure Risk in Sample Microdata Probabilistic Models:
denotes a q-way frequency table which is a sample from a population table where indicates a cell population count and sample count in cell Disclosure risk measure: For unknown population counts, estimate from the conditional distribution of

Disclosure Risk in Sample Microdata
Natural assumption: Bernoulli sampling: It follows that: and where are conditionally independent is the sampling fraction in cell k

Skinner and Holmes, 1998, Elamir and Skinner, 2006 use log linear models to estimate parameters Sample frequencies are independent Poisson distributed with a mean of Log-linear model for estimating expressed as: where design matrix of key variables and their interactions MLE’s calculated by solving score function:

Fitted values calculated by: and Individual risk measures estimated by: Skinner and Shlomo (2008) develop goodness of fit criteria which minimizes the bias of disclosure risk estimates

Criteria related to tests for over and under-dispersion: over-fitting - sample marginal counts produce too many random zeros, leading to large expected cell counts for non-zero cells : under-estimation of risk under-fitting - sample marginal counts do not allow for structural zeros, leading to small expected cell counts for non-zero cells: over-estimation of risk Criteria selects the model using a forward search algorithm which minimizes the bias

Disclosure Risk Assessment Example
Example: Population N= 944,793 from UK 2001 Census SRS sample size n= 9,448 Key: Area (2), Sex (2), Age (101), Marital Status (6), Ethnicity (17), Economic Activity (10) ,080 cells Model Selection: Starting solution: main-effects log-linear model indicates under-fitting (minimum error statistics too large) Add in higher interaction terms until minimum error statistics indicate fit

Model Search Example (SRS n=9,448) True values ,
Area–ar, Sex-s, Age–a, Marital Status–m, Ethnicity–et, and Economic Activity-ec Independence - I 386.6 701.2 48.54 114.19 All 2 way - II 104.9 280.1 -1.57 -2.65 1: I + {a*ec} 243.4 494.3 54.75 59.22 2: {a*et} 180.1 411.6 3.07 9.82 3: {a*m} 152.3 343.3 0.88 1.73 4: {s*ec} 149.2 337.5 0.26 0.92 5a: {ar*a} 148.5 337.1 -0.01 0.84 5b: {s*m} 147.7 335.3 0.02 0.66 6b: 5b + {ar*a} 147.0 335.0 -0.24 0.56 6c: 5b + {ar*m} 148.9 -0.04 0.72 6d: 5b + {m*ec} 146.3 331.4 0.03 7c: 6c + {m*ec} 147.5 333.2 -0.34 0.06 7d: 6d + {ar*a} 145.6 331.0 -0.44 -0.03 ,

Model Search Example Preferred Model: {a*ec}{a*et}{a*m}(s*ec}{ar*a} True Global Risk: Estimated Global Risk Log-scale True risk measure Estimated per-record risk measure

Statistical Disclosure Control Methods
Agencies limit risk of identification through statistical disclosure control (SDC) methods: Non-perturbative – sub-sampling, recoding and collapsing categories of key variables, deleting variables Perturbative – data swapping, additive noise, misclassification (PRAM) and synthetic data 11

Disclosure Risk in Perturbed Microdata
Model assumes no misclassification errors either arising from data processes or purposely introduced for SDC Shlomo and Skinner, 2010 address misclassification (perturbation) errors Let: where cross-classified key variables: in population fixed in microdata subject to misclassification (perturbation)

Disclosure Risk in Perturbed Microdata
The per-record disclosure risk measure of a match of external unit B to a unique record in microdata A that has undergone misclassification: (1) For small misclassification and small sampling fractions: or (2) Global measure: estimated by: (3) where per-record risk:

Perturbation Example Population of individuals from United Kingdom (UK) Census N=1,468,255 1% srs sample n=14,683 Six key variables: Local Authority (LAD) (11), sex (2), age groups (24), marital status (6), ethnicity (17), economic activity (10) K=538,560 14

Perturbation Example Record Swapping: LAD swapped randomly, eg. for a 20% swap: Diagonal: Off diagonal: where is the number of records in the sample from LAD k Pram: LAD misclassified, eg. for a 20% misclassification Off diagonal: Parameter: 15

Perturbation Example Random 20% perturbation on LAD
Global risk measures: Expected correct matches on SU’s Global Risk Measure PRAM Swapping True risk measure in original sample 358.1 362.4 Estimated naïve risk measure ignoring misclassification 349.5 358.6 Risk measure on non-perturbed records 292.2 292.8 Risk measure under misclassification (1) Sample uniques 299.7 2,779 298.9 2,831 Approximation based on diagonals (2) 299.8 Estimated risk measure under misclassification (3) 283.1 286.8 Expected correct match per sample unique: Pram: Record swapping: 0.106

Perturbation Example Estimating individual per-record risk measures for 20% random swap based on log linear modelling (log scale): Risk Measure (1) Estimated Risk Measure (3) 17

Disclosure Risk in Subpopulation Microdata
Up till now, sample is a random subset of the population and ‘intruder’ knows that an individual can be linked New problem: Microdata contains members of a subpopulation and membership is not known Example: subpopulation refers to all persons with a medical condition (where membership is sensitive) 18

Subpopulation not necessarily representative of the population As with a sample, ‘intruder’ matches a record in the subpopulation to an individual in the population of which the subpopulation is a subset (assume no measurement errors) Use sample microdata to make inference about population uniqueness in the subpopulation or a sample from the subpopulation 19

Assume Fk are unknown, and assume a random sample s from the population U Let fk denote the random sample frequency in cell k Let ( )denote the sub-population (sample) frequency in cell k 2 Scenarios: Assess disclosure risk in subpopulation: Assess disclosure risk in sample from subpopulation: 20

Following Skinner and Shlomo 2008: where the log linear model holds: Assume within cell k, Y takes the value 1 with probability pk , independently for each of the Fk units, so that where and the F’k are binomially distributed Assume that follows a logistic model: (may assume different X variables) Assume a Bernoulli sample design which preserves the Poisson Distribution for sample counts: pk 21

Expressions for Risk Measures
since assuming independent of For the case of a sample from the subpopulation, assume that It follows that and assuming that is independent of 22

Estimation of Risk Measures – 2 step approach
Estimate from as in Skinner and Shlomo, 2008 Fix and since and Since estimate by method of scoring treating each of and as functions of (and fixed) 23

Simulation Study Population Size N=1,163,659 UK Census
Subpopulation - those with long term illness– N`=207,537 Key: Geography (6)*Age group(14)*Sex(2)*Marital Status (6)*Ethnicity(16)*Economic Activity(10) K=161,280 Step 1: Draw simple random sample of size n=23,273 Step 2: Draw samples from subpopulation with different sampling fractions: Step 3: Estimate parameters and from score function with fixed Step 4: Repeat 500 times 24

Population and subpopulation Uniques Percent Relative Difference
Simulation Study Sample Fraction Population and subpopulation Uniques Estimate Percent Relative Difference 01:20 142 213 -50.00% 01:10 266 392.2 -47.44% 01:05 561 714.7 -27.40% 01:02 1335 1530.7 -14.66% 01:01 2721 2766 -1.65% 25

Simulation Study Step 5: Generate population and subpopulation from known parameters according to assumptions of the model Step 6: Draw sample from ‘synthetic’ population Step 7: Draw sample from ‘synthetic’ subpopulation: Step 8: Estimate parameters and from score function keeping fixed in the 2-step approach Step 9: Repeat 500 times 26

Population and subpopulation Uniques
Simulation Study Truth Estimate Population and subpopulation Uniques 142 141 Population zeros 132567 132482 Population ones 10988 11030 subpopulation zeros 148172 148091 Subpopulation ones 6395 6398 27

Discussion Future work:
- 2-step approach results in subpopulation uniques with estimated parameter of 0 from step 1 and hence large over-estimation of risk measures (exp(0)=1) - Develop modelling framework for estimating joint likelihood of and under an EM algorithm where ‘missing’ are those sample units in the sub-population - Develop goodness of fit criteria for selecting models which minimize bias of estimates

Thank you for your attention

Assessing Disclosure Risk in Microdata

Similar presentations

Presentation on theme: "Assessing Disclosure Risk in Microdata"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Assessing Disclosure Risk in Microdata

Similar presentations

Presentation on theme: "Assessing Disclosure Risk in Microdata"— Presentation transcript:

Similar presentations

About project

Feedback