Assessing Disclosure Risk in Sample Microdata Under Misclassification

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Estimating Identification Risks for Microdata Jerome P. Reiter Institute of Statistics and Decision Sciences Duke University, Durham NC, USA.
Statistical Disclosure Control (SDC) for 2011 Census Progress Update Keith Spicer – ONS SDC Methodology 23 April 2009.
Confidentiality risks of releasing measures of data quality Jerry Reiter Department of Statistical Science Duke University
1 Parametric Empirical Bayes Methods for Microarrays 3/7/2011 Copyright © 2011 Dan Nettleton.
WP 9 Assessing Disclosure Risk in Microdata using Record Level Measures Natalie Shlomo University of Southampton Office for National Statistics
SDC for continuous variables under edit restrictions Natalie Shlomo & Ton de Waal UN/ECE Work Session on Statistical Data Editing, Bonn, September 2006.
Estimation  Samples are collected to estimate characteristics of the population of particular interest. Parameter – numerical characteristic of the population.
Fast Bayesian Matching Pursuit Presenter: Changchun Zhang ECE / CMR Tennessee Technological University November 12, 2010 Reading Group (Authors: Philip.
1 A Common Measure of Identity and Value Disclosure Risk Krish Muralidhar University of Kentucky Rathin Sarathy Oklahoma State University.
Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.
Model Assessment, Selection and Averaging
What is Statistical Modeling
Analysis of frequency counts with Chi square
Maximum likelihood (ML) and likelihood ratio (LR) test
Mutual Information Mathematical Biology Seminar
Maximum likelihood Conditional distribution and likelihood Maximum likelihood estimations Information in the data and likelihood Observed and Fisher’s.
Statistical Methods Chichang Jou Tamkang University.
Sampling.
Maximum likelihood (ML) and likelihood ratio (LR) test
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Methods of Geographical Perturbation for Disclosure Control Division of Social Statistics And Department of Geography Caroline Young Supervised jointly.
Privacy Preserving Data Mining: An Overview and Examination of Euclidean Distance Preserving Data Transformation Chris Giannella cgiannel AT acm DOT org.
Linear and generalised linear models
Maximum likelihood (ML)
Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.
Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.
11 Comparison of Perturbation Approaches for Spatial Outliers in Microdata Natalie Shlomo* and Jordi Marés** * Social Statistics, University of Manchester,
Geo479/579: Geostatistics Ch13. Block Kriging. Block Estimate  Requirements An estimate of the average value of a variable within a prescribed local.
Microdata Simulation for Confidentiality of Tax Returns Using Quantile Regression and Hot Deck Jennifer Huckett Iowa State University June 20, 2007.
The Application of the Concept of Uniqueness for Creating Public Use Microdata Files Jay J. Kim, U.S. National Center for Health Statistics Dong M. Jeong,
Model Inference and Averaging
PROBABILITY (6MTCOAE205) Chapter 6 Estimation. Confidence Intervals Contents of this chapter: Confidence Intervals for the Population Mean, μ when Population.
1 Tel Aviv April 29th, 2007 Disclosure Limitation from a Statistical Perspective Natalie Shlomo Dept. of Statistics, Hebrew University Central Bureau of.
WP. 46 Providing access to data and making microdata safe, experiences of the ONS Jane Longhurst Paul Jackson ONS.
1 Statistical Disclosure Control Methods for Census Outputs Natalie Shlomo SDC Centre, ONS January 11, 2005.
1 Introduction to Survey Data Analysis Linda K. Owens, PhD Assistant Director for Sampling & Analysis Survey Research Laboratory University of Illinois.
1 Assessing the Impact of SDC Methods on Census Frequency Tables Natalie Shlomo Southampton Statistical Sciences Research Institute University of Southampton.
The relationship between error rates and parameter estimation in the probabilistic record linkage context Tiziana Tuoto, Nicoletta Cibella, Marco Fortini.
Danila Filipponi Simonetta Cozzi ISTAT, Italy Outlier Identification Procedures for Contingency Tables in Longitudinal Data Roma,8-11 July 2008.
WP 19 Assessment of Statistical Disclosure Control Methods for the 2001 UK Census Natalie Shlomo University of Southampton Office for National Statistics.
1 IPAM 2010 Privacy Protection from Sampling and Perturbation in Surveys Natalie Shlomo and Chris Skinner Southampton Statistical Sciences Research Institute.
Lecture 4: Statistics Review II Date: 9/5/02  Hypothesis tests: power  Estimation: likelihood, moment estimation, least square  Statistical properties.
ICCS 2009 IDB Workshop, 18 th February 2010, Madrid 1 Training Workshop on the ICCS 2009 database Weighting and Variance Estimation picture.
Eurostat Weighting and Estimation. Presented by Loredana Di Consiglio Istituto Nazionale di Statistica, ISTAT.
Lecture 12: Linkage Analysis V Date: 10/03/02  Least squares  An EM algorithm  Simulated distribution  Marker coverage and density.
Using Targeted Perturbation of Microdata to Protect Against Intelligent Linkage Mark Elliot, University of Manchester Cathie.
1 WP 10 On Risk Definitions and a Neighbourhood Regression Model for Sample Disclosure Risk Estimation Natalie Shlomo Hebrew University Southampton University.
Creating Open Data whilst maintaining confidentiality Philip Lowthian, Caroline Tudor Office for National Statistics 1.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
1 1 Confidentiality protection of large frequency data cubes UNECE Workshop on Statistical Confidentiality Ottawa October 2013 Johan Heldal and Svetlana.
Privacy-preserving data publishing
Introduction to Survey Sampling
Machine Learning 5. Parametric Methods.
ICCS 2009 IDB Seminar – Nov 24-26, 2010 – IEA DPC, Hamburg, Germany Training Workshop on the ICCS 2009 database Weights and Variance Estimation picture.
Census 2011 – A Question of Confidentiality Statistical Disclosure control for the 2011 Census Carole Abrahams ONS Methodology BSPS – York, September 2011.
Reconciling Confidentiality Risk Measures from Statistics and Computer Science Jerry Reiter Department of Statistical Science Duke University.
11 Measuring Disclosure Risk and Data Utility for Flexible Table Generators Natalie Shlomo, Laszlo Antal, Mark Elliot University of Manchester
Estimating standard error using bootstrap
Natalie Shlomo Social Statistics, School of Social Sciences
Assessing Disclosure Risk in Microdata
Ch3: Model Building through Regression
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
The European Statistical Training Programme (ESTP)
Chapter 8: Weighting adjustment
Pairwise Sequence Alignment (cont.)
Classification Trees for Privacy in Sample Surveys
New Techniques and Technologies for Statistics 2017  Estimation of Response Propensities and Indicators of Representative Response Using Population-Level.
Chapter 13: Item nonresponse
Presentation transcript:

Assessing Disclosure Risk in Sample Microdata Under Misclassification Natalie Shlomo Southampton Statistical Sciences Research Institute University of Southampton N.Shlomo@soton.ac.uk This is joint work with Prof Chris Skinner 1

Topics Covered Introduction and motivation Disclosure risk assessment for sample microdata: Probabilistic modelling extended for misclassification Probabilistic record linkage – linking the frameworks Risk-utility framework for microdata subject to misclassification Discussion 2

Introduction Disclosure risk scenario: ‘intruder’ attack on microdata through linking to available public data sources Linkage via identifying key variables common to both sources, eg. sex, age, region, ethnicity Agencies limit risk of identification through statistical disclosure limitation (SDL) methods: Non-perturbative – sub-sampling, recoding and collapsing categories of key variables, deleting variables Perturbative – data swapping, additive noise, misclassification (PRAM) and synthetic data 3

Introduction Need to quantify the risk of identification to inform microdata release Probabilistic models for quantifying risk of identification based on population uniqueness on identifying key variables Population counts in contingency tables spanned by key variables unknown Distribution assumptions to draw inference from the sample for estimating population parameters Assumes that microdata has not been altered either through misclassification arising from data processes or purposely introduced for SDL 4

Introduction Expand probabilistic modelling for quantifying risk of identification under misclassification/perturbation For perturbative methods of SDL, risk assessment typically based on probabilistic record linkage Conservative assessment of risk of identification Assumes that intruder has access to original dataset and does not take into account protection afforded by sampling Fit probabilistic record linkage into the probabilistic modelling framework for categorical matching variables 5

Probabilistic Modelling – no misclassification Disclosure Risk Assessment Probabilistic Modelling – no misclassification Let denote a q-way frequency table which is a sample from a population table where indicates a cell population count and sample count in cell Disclosure risk measure: For unknown population counts, estimate from the conditional distribution of

Disclosure Risk Assessment Natural assumption: Bernoulli sampling: It follows that: and where are conditionally independent is the sampling fraction in cell k

Disclosure Risk Assessment Skinner and Holmes, 1998, Elamir and Skinner, 2006 use log linear models to estimate parameters Sample frequencies are independent Poisson distributed with a mean of Log-linear model for estimating expressed as: where design matrix of key variables and their interactions MLE’s calculated by solving score function:

Disclosure Risk Assessment Fitted values calculated by: and Individual risk measures estimated by: Skinner and Shlomo (2009) develop goodness of fit criteria which minimize the bias of disclosure risk estimates, for example, for

Disclosure Risk Assessment Criteria related to tests for over and under-dispersion: over-fitting - sample marginal counts produce too many random zeros, leading to expected cell counts too high for non-zero cells and under-estimation of risk under-fitting - sample marginal counts don’t take into account structural zeros, leading to expected cell counts too low for non-zero cells and over-estimation of risk Criteria selects the model using a forward search algorithm which minimizes for where is the variance of

Disclosure Risk Assessment Example: Population of 944,793 from UK 2001 Census SRS sample size 9,448 Key: Area (2), Sex (2), Age (101), Marital Status (6), Ethnicity (17), Economic Activity (10) - 412,080 cells Model Selection: Starting solution: main-effects log-linear model which indicates under-fitting (minimum error statistics too large) Add in higher interaction terms until minimum error statistics indicate fit

Model Search Example (SRS n=9,448) True values , Area–ar, Sex-s, Age–a, Marital Status–m, Ethnicity–et, and Economic Activity-ec Independence - I 386.6 701.2 48.54 114.19 All 2 way - II 104.9 280.1 -1.57 -2.65 1: I + {a*ec} 243.4 494.3 54.75 59.22 2: 1 + {a*et} 180.1 411.6 3.07 9.82 3: 2 + {a*m} 152.3 343.3 0.88 1.73 4: 3 + {s*ec} 149.2 337.5 0.26 0.92 5a: 4 + {ar*a} 148.5 337.1 -0.01 0.84 5b: 4 + {s*m} 147.7 335.3 0.02 0.66 6b: 5b + {ar*a} 147.0 335.0 -0.24 0.56 6c: 5b + {ar*m} 148.9 -0.04 0.72 6d: 5b + {m*ec} 146.3 331.4 0.03 7c: 6c + {m*ec} 147.5 333.2 -0.34 0.06 7d: 6d + {ar*a} 145.6 331.0 -0.44 -0.03 ,

Model Search Example Preferred Model: {a*ec}{a*et}{a*m}(s*ec}{ar*a} True Global Risk: Estimated Global Risk Log-scale True risk measure Estimated per-record risk measure

Disclosure Risk Assessment Skinner and Shlomo, 2009 address complex survey designs: Sampling clusters introduce dependencies - key variables cut across clusters and assumption holds in practice Stratification – include strata id in key variables to account for differential inclusion probabilities Survey weights - Use pseudo maximum likelihood estimation where score function modified to: changed to: where Partition large tables and assess partitions separately (assumes that partitioning variable has an interaction with other key variables)

Disclosure Risk Assessment Under Misclassification Model assumes no misclassification errors either arising from data processes or purposely introduced for SDL Shlomo and Skinner, 2010 address misclassification errors Let: where cross-classified key variables: in population fixed in microdata subject to misclassification

Disclosure Risk Assessment Under misclassification The per-record disclosure risk measure of a match of external unit B to a unique record in microdata A that has undergone misclassification: (1) For small misclassification and small sampling fractions: or (2) Global measure: estimated by: (3) where per-record risk:

Perturbation Methods of SDL PRAM ( Post-randomisation method) Probability transition matrix containing conditional probabilities for a category c: Let T be a vector of frequencies On each record, category c changed or not changed according to and the result of a draw of a random variate u vector of perturbed frequencies Unbiased moment estimator of the original data: assuming has an inverse (dominant on the diagonals)

Perturbation Methods of SDL PRAM ( Post-randomisation method) - cont. Invariant PRAM - Define: (vector of the original frequencies eigenvector of ) To ensure correct non-perturbation probability on diagonal, define Expected values of marginal distribution preserved Exact marginal distribution preserved using a without replacement selection strategy to select records for perturbation

Random Record Swapping Perturbation Methods of SDL Random Record Swapping Exchange values of a key variable between pairs of records Pairs selected within control strata to minimize bias Typically geographical variable is swapped within a large area: Geography highly identifiable Conditional independence assumption usually met (sensitive variables relatively independent of geography) Does not produce inconsistent records Marginal distributions preserved at higher geographies Can be targeted to high risk records 19

Misclassification Example Population of individuals from 2001 United Kingdom (UK) Census N=1,468,255 1% srs sample n=14,683 Six key variables: Local Authority (LAD) (11), sex (2), age groups (24), marital status (6), ethnicity (17), economic activity (10) K=538,560. 20

Misclassification Example Record Swapping: LAD swapped randomly, eg. for a 20% swap: Diagonal: Off diagonal: where is the number of records in the sample from LAD k Pram: LAD misclassified, eg. for a 20% misclassification Off diagonal: Parameter: 21

Misclassification Example Random 20% perturbation on LAD Global risk measures: Expected correct matches from SU’s Global Risk Measure PRAM Swapping True risk measure in original sample 358.1 362.4 Estimated naïve risk measure ignoring misclassification 349.5 358.6 Risk measure on non-perturbed records 292.2 292.8 Risk measure under misclassification (1) Sample uniques 299.7 2,779 298.9 2,831 Approximation based on diagonals (2) 299.8 Estimated risk measure under misclassification (3) 283.1 286.8 Expected correct match per sample unique: Pram: 10.8% Record swapping: 10.6%

Misclassification Example Estimating individual per-record risk measures for 20% random swap based on log linear modelling (log scale): From perspective of intruder, difficult to identify high risk (population unique) records Risk Measure (1) Estimated Risk Measure (3) 23

Information Loss Measures Utility measured by whether inference can be carried out on perturbed data similar to original data Use proxy information loss measures on distributions calculated from microdata: Distance Metrics: where number of cells in distribution Let measure of average absolute perturbation compared to average cell size Also, can consider Kolmogorov-Smirnov statistic, Hellinger’s Distance and relative differences in means or variances

Pram record swapping Risk-Utility Map Random perturbation versus Targeted perturbation on non-white ethnicities 25

Probabilistic Record Linkage value of vector of cross-classified identifying key variables for unit in the microdata ( ) corresponding value for unit in the external database ( ) ( ) Misclassification mechanism via probability matrix: Comparison vector for pairs of units For subset partition set of pairs in Matches (M) Non-matches (U) through likelihood ratio: where 26

Probabilistic Record Linkage probability that pair is in M Probability of a correct match: Estimate parameters using previous test data or EM algorithm and assuming conditional independence 27

Linking the Two Frameworks No misclassification Non-match Match Total Disagree Agree n Nn 28

Linking the Two Frameworks With misclassification, denote Non-match Match Total Disagree Agree 29

Linking the Two Frameworks Matching 2,853 sample uniques to the population and blocking on all key variables except LAD result in 1,534,293 possible pairs On average across blocks, probability of a correct match given an agreement on LAD Non-match Match Total Disagree LAD 1,388,069 619 1,388,688 Agree LAD 143,321 2,234 145,555 1,531,390 2,853 1,534,293 30

Linking the Two Frameworks Probability of a correct match given an agreement for each Compare to risk measure Summing over the global disclosure risk measure of 289.5. 31

Discussion Global disclosure risk measures accurately estimated for a risk-utility assessment assuming known non-misclassification probability Empirical evidence of connection between F&S record linkage and probabilistic modelling for estimating identification risk Estimation carried out through log linear modelling for the probabilistic modelling or the EM algorithm for the F&S record linkage Individual disclosure risk measures more difficult to estimate without knowing true population parameters in both frameworks From the perspective of the intruder, it is difficult to identify sample uniques that are population uniques 32

Discussion Statistical Agencies (MRP) need to: - Assess disclosure risk objectively - Set tolerable risk thresholds according to different access modes - Optimize and combine SDL techniques - Provide guidelines on how to analyze disclosure controlled datasets Future dissemination strategies presents new challenges: - Synthetic data for web access prior to accessing real data - Online SDL techniques for flexible table generating software and remote access - Auditing query systems Bridge the Statistical and Computer Science literature on privacy preserving algorithms

Thank you for your attention