Assessing Disclosure Risk in Microdata

Slides:



Advertisements
Similar presentations
Estimating Identification Risks for Microdata Jerome P. Reiter Institute of Statistics and Decision Sciences Duke University, Durham NC, USA.
Advertisements

Statistical Disclosure Control (SDC) for 2011 Census Progress Update Keith Spicer – ONS SDC Methodology 23 April 2009.
Confidentiality risks of releasing measures of data quality Jerry Reiter Department of Statistical Science Duke University
Brief introduction on Logistic Regression
WP 9 Assessing Disclosure Risk in Microdata using Record Level Measures Natalie Shlomo University of Southampton Office for National Statistics
SDC for continuous variables under edit restrictions Natalie Shlomo & Ton de Waal UN/ECE Work Session on Statistical Data Editing, Bonn, September 2006.
Estimation  Samples are collected to estimate characteristics of the population of particular interest. Parameter – numerical characteristic of the population.
1 A Common Measure of Identity and Value Disclosure Risk Krish Muralidhar University of Kentucky Rathin Sarathy Oklahoma State University.
Assessing Disclosure Risk in Sample Microdata Under Misclassification
The Simple Linear Regression Model: Specification and Estimation
Resampling techniques Why resampling? Jacknife Cross-validation Bootstrap Examples of application of bootstrap.

Maximum likelihood Conditional distribution and likelihood Maximum likelihood estimations Information in the data and likelihood Observed and Fisher’s.
CSE 221: Probabilistic Analysis of Computer Systems Topics covered: Statistical inference (Sec. )
Chap 9-1 Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chapter 9 Estimation: Additional Topics Statistics for Business and Economics.
Linear and generalised linear models Purpose of linear models Least-squares solution for linear models Analysis of diagnostics Exponential family and generalised.
INFERENTIAL STATISTICS – Samples are only estimates of the population – Sample statistics will be slightly off from the true values of its population’s.
11 Comparison of Perturbation Approaches for Spatial Outliers in Microdata Natalie Shlomo* and Jordi Marés** * Social Statistics, University of Manchester,
AS 737 Categorical Data Analysis For Multivariate
Chapter 5 Sampling and Statistics Math 6203 Fall 2009 Instructor: Ayona Chatterjee.
1 Statistical Disclosure Control Methods for Census Outputs Natalie Shlomo SDC Centre, ONS January 11, 2005.
1 Assessing the Impact of SDC Methods on Census Frequency Tables Natalie Shlomo Southampton Statistical Sciences Research Institute University of Southampton.
WP 19 Assessment of Statistical Disclosure Control Methods for the 2001 UK Census Natalie Shlomo University of Southampton Office for National Statistics.
1 IPAM 2010 Privacy Protection from Sampling and Perturbation in Surveys Natalie Shlomo and Chris Skinner Southampton Statistical Sciences Research Institute.
1 WP 10 On Risk Definitions and a Neighbourhood Regression Model for Sample Disclosure Risk Estimation Natalie Shlomo Hebrew University Southampton University.
Creating Open Data whilst maintaining confidentiality Philip Lowthian, Caroline Tudor Office for National Statistics 1.
Machine Learning 5. Parametric Methods.
1 Probability and Statistics Confidence Intervals.
Reconciling Confidentiality Risk Measures from Statistics and Computer Science Jerry Reiter Department of Statistical Science Duke University.
11 Measuring Disclosure Risk and Data Utility for Flexible Table Generators Natalie Shlomo, Laszlo Antal, Mark Elliot University of Manchester
Chapter 8 Confidence Interval Estimation Statistics For Managers 5 th Edition.
Statistics for Business and Economics 7 th Edition Chapter 7 Estimation: Single Population Copyright © 2010 Pearson Education, Inc. Publishing as Prentice.
CHAPTER 6: SAMPLING, SAMPLING DISTRIBUTIONS, AND ESTIMATION Leon-Guerrero and Frankfort-Nachmias, Essentials of Statistics for a Diverse Society.
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Comparing Counts Chi Square Tests Independence.
Inference: Conclusion with Confidence
Natalie Shlomo Social Statistics, School of Social Sciences
BINARY LOGISTIC REGRESSION
University of Texas at El Paso
Chapter 7. Classification and Prediction
STATISTICS POINT ESTIMATION
12. Principles of Parameter Estimation
Probability Theory and Parameter Estimation I
Creation of synthetic microdata in 2021 Census Transformation Programme (proof of concept) Robert Rendell.
Inference: Conclusion with Confidence
CH 5: Multivariate Methods
Simple Linear Regression - Introduction
Chapter 9 Hypothesis Testing.
Introduction to Instrumentation Engineering
Discrete Event Simulation - 4
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Chapter 8: Weighting adjustment
ERRORS, CONFOUNDING, and INTERACTION
Topic models for corpora and for graphs
Statistics Workshop Tutorial 1
Classification Trees for Privacy in Sample Surveys
5.2 Least-Squares Fit to a Straight Line
Bootstrapping Jackknifing
Bayesian Learning Chapter
Fixed, Random and Mixed effects
What are their purposes? What kinds?
Chapter 8: Estimating With Confidence
New Techniques and Technologies for Statistics 2017  Estimation of Response Propensities and Indicators of Representative Response Using Population-Level.
Sampling Distributions (§ )
The European Statistical Training Programme (ESTP)
12. Principles of Parameter Estimation
Chapter 9 Estimation: Additional Topics
Applied Statistics and Probability for Engineers
Chapter 5: Sampling Distributions
Presentation transcript:

Assessing Disclosure Risk in Microdata Natalie Shlomo Chris Skinner University of Manchester London School of Economics and Political Science 1

Topics Covered Disclosure risk assessment for identity disclosure Probabilistic modelling for quantifying identity disclosure in sample microdata Extensions under misclassification/perturbation Extensions to sub-population microdata Discussion 2

Disclosure Risk in Sample Microdata Probabilistic Models: denotes a q-way frequency table which is a sample from a population table where indicates a cell population count and sample count in cell Disclosure risk measure: For unknown population counts, estimate from the conditional distribution of

Disclosure Risk in Sample Microdata Natural assumption: Bernoulli sampling: It follows that: and where are conditionally independent is the sampling fraction in cell k

Disclosure Risk in Sample Microdata Skinner and Holmes, 1998, Elamir and Skinner, 2006 use log linear models to estimate parameters Sample frequencies are independent Poisson distributed with a mean of Log-linear model for estimating expressed as: where design matrix of key variables and their interactions MLE’s calculated by solving score function:

Disclosure Risk in Sample Microdata Fitted values calculated by: and Individual risk measures estimated by: Skinner and Shlomo (2008) develop goodness of fit criteria which minimizes the bias of disclosure risk estimates

Disclosure Risk in Sample Microdata Criteria related to tests for over and under-dispersion: over-fitting - sample marginal counts produce too many random zeros, leading to large expected cell counts for non-zero cells : under-estimation of risk under-fitting - sample marginal counts do not allow for structural zeros, leading to small expected cell counts for non-zero cells: over-estimation of risk Criteria selects the model using a forward search algorithm which minimizes the bias

Disclosure Risk Assessment Example Example: Population N= 944,793 from UK 2001 Census SRS sample size n= 9,448 Key: Area (2), Sex (2), Age (101), Marital Status (6), Ethnicity (17), Economic Activity (10) - 412,080 cells Model Selection: Starting solution: main-effects log-linear model indicates under-fitting (minimum error statistics too large) Add in higher interaction terms until minimum error statistics indicate fit

Model Search Example (SRS n=9,448) True values , Area–ar, Sex-s, Age–a, Marital Status–m, Ethnicity–et, and Economic Activity-ec Independence - I 386.6 701.2 48.54 114.19 All 2 way - II 104.9 280.1 -1.57 -2.65 1: I + {a*ec} 243.4 494.3 54.75 59.22 2: 1 + {a*et} 180.1 411.6 3.07 9.82 3: 2 + {a*m} 152.3 343.3 0.88 1.73 4: 3 + {s*ec} 149.2 337.5 0.26 0.92 5a: 4 + {ar*a} 148.5 337.1 -0.01 0.84 5b: 4 + {s*m} 147.7 335.3 0.02 0.66 6b: 5b + {ar*a} 147.0 335.0 -0.24 0.56 6c: 5b + {ar*m} 148.9 -0.04 0.72 6d: 5b + {m*ec} 146.3 331.4 0.03 7c: 6c + {m*ec} 147.5 333.2 -0.34 0.06 7d: 6d + {ar*a} 145.6 331.0 -0.44 -0.03 ,

Model Search Example Preferred Model: {a*ec}{a*et}{a*m}(s*ec}{ar*a} True Global Risk: Estimated Global Risk Log-scale True risk measure Estimated per-record risk measure

Statistical Disclosure Control Methods Agencies limit risk of identification through statistical disclosure control (SDC) methods: Non-perturbative – sub-sampling, recoding and collapsing categories of key variables, deleting variables Perturbative – data swapping, additive noise, misclassification (PRAM) and synthetic data 11

Disclosure Risk in Perturbed Microdata Model assumes no misclassification errors either arising from data processes or purposely introduced for SDC Shlomo and Skinner, 2010 address misclassification (perturbation) errors Let: where cross-classified key variables: in population fixed in microdata subject to misclassification (perturbation)

Disclosure Risk in Perturbed Microdata The per-record disclosure risk measure of a match of external unit B to a unique record in microdata A that has undergone misclassification: (1) For small misclassification and small sampling fractions: or (2) Global measure: estimated by: (3) where per-record risk:

Perturbation Example Population of individuals from 2001 United Kingdom (UK) Census N=1,468,255 1% srs sample n=14,683 Six key variables: Local Authority (LAD) (11), sex (2), age groups (24), marital status (6), ethnicity (17), economic activity (10) K=538,560 14

Perturbation Example Record Swapping: LAD swapped randomly, eg. for a 20% swap: Diagonal: Off diagonal: where is the number of records in the sample from LAD k Pram: LAD misclassified, eg. for a 20% misclassification Off diagonal: Parameter: 15

Perturbation Example Random 20% perturbation on LAD Global risk measures: Expected correct matches on SU’s Global Risk Measure PRAM Swapping True risk measure in original sample 358.1 362.4 Estimated naïve risk measure ignoring misclassification 349.5 358.6 Risk measure on non-perturbed records 292.2 292.8 Risk measure under misclassification (1) Sample uniques 299.7 2,779 298.9 2,831 Approximation based on diagonals (2) 299.8 Estimated risk measure under misclassification (3) 283.1 286.8 Expected correct match per sample unique: Pram: 0.108 Record swapping: 0.106

Perturbation Example Estimating individual per-record risk measures for 20% random swap based on log linear modelling (log scale): Risk Measure (1) Estimated Risk Measure (3) 17

Disclosure Risk in Subpopulation Microdata Up till now, sample is a random subset of the population and ‘intruder’ knows that an individual can be linked New problem: Microdata contains members of a subpopulation and membership is not known Example: subpopulation refers to all persons with a medical condition (where membership is sensitive) 18

Disclosure Risk in Subpopulation Microdata Subpopulation not necessarily representative of the population As with a sample, ‘intruder’ matches a record in the subpopulation to an individual in the population of which the subpopulation is a subset (assume no measurement errors) Use sample microdata to make inference about population uniqueness in the subpopulation or a sample from the subpopulation 19

Disclosure Risk in Subpopulation Microdata Assume Fk are unknown, and assume a random sample s from the population U Let fk denote the random sample frequency in cell k Let ( )denote the sub-population (sample) frequency in cell k 2 Scenarios: Assess disclosure risk in subpopulation: Assess disclosure risk in sample from subpopulation: 20

Disclosure Risk in Subpopulation Microdata Following Skinner and Shlomo 2008: where the log linear model holds: Assume within cell k, Y takes the value 1 with probability pk , independently for each of the Fk units, so that where and the F’k are binomially distributed Assume that follows a logistic model: (may assume different X variables) Assume a Bernoulli sample design which preserves the Poisson Distribution for sample counts: pk 21

Expressions for Risk Measures since assuming independent of For the case of a sample from the subpopulation, assume that It follows that and assuming that is independent of 22

Estimation of Risk Measures – 2 step approach Estimate from as in Skinner and Shlomo, 2008 Fix and since and Since estimate by method of scoring treating each of and as functions of (and fixed) 23

Simulation Study Population Size N=1,163,659 UK Census Subpopulation - those with long term illness– N`=207,537 Key: Geography (6)*Age group(14)*Sex(2)*Marital Status (6)*Ethnicity(16)*Economic Activity(10) K=161,280 Step 1: Draw simple random sample of size n=23,273 Step 2: Draw samples from subpopulation with different sampling fractions: Step 3: Estimate parameters and from score function with fixed Step 4: Repeat 500 times 24

Population and subpopulation Uniques Percent Relative Difference Simulation Study Sample Fraction Population and subpopulation Uniques Estimate Percent Relative Difference 01:20 142 213 -50.00% 01:10 266 392.2 -47.44% 01:05 561 714.7 -27.40% 01:02 1335 1530.7 -14.66% 01:01 2721 2766 -1.65% 25

Simulation Study Step 5: Generate population and subpopulation from known parameters according to assumptions of the model Step 6: Draw sample from ‘synthetic’ population Step 7: Draw sample from ‘synthetic’ subpopulation: Step 8: Estimate parameters and from score function keeping fixed in the 2-step approach Step 9: Repeat 500 times 26

Population and subpopulation Uniques Simulation Study Truth Estimate Population and subpopulation Uniques 142 141 Population zeros 132567 132482 Population ones 10988 11030 subpopulation zeros 148172 148091 Subpopulation ones 6395 6398 27

Discussion Future work: - 2-step approach results in subpopulation uniques with estimated parameter of 0 from step 1 and hence large over-estimation of risk measures (exp(0)=1) - Develop modelling framework for estimating joint likelihood of and under an EM algorithm where ‘missing’ are those sample units in the sub-population - Develop goodness of fit criteria for selecting models which minimize bias of estimates

Thank you for your attention