Confidentiality risks of releasing measures of data quality Jerry Reiter Department of Statistical Science Duke University

Slides:



Advertisements
Similar presentations
Alternative Approaches to Data Dissemination and Data Sharing Jerome Reiter Duke University
Advertisements

Estimating Identification Risks for Microdata Jerome P. Reiter Institute of Statistics and Decision Sciences Duke University, Durham NC, USA.
“Students” t-test.
STA305 Spring 2014 This started with excerpts from STA2101f13
Welcome to PHYS 225a Lab Introduction, class rules, error analysis Julia Velkovska.
Extension The General Linear Model with Categorical Predictors.
Estimating a Population Proportion
SDC for continuous variables under edit restrictions Natalie Shlomo & Ton de Waal UN/ECE Work Session on Statistical Data Editing, Bonn, September 2006.
Sections 7-1 and 7-2 Review and Preview and Estimating a Population Proportion.
6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.
Ch11 Curve Fitting Dr. Deshi Ye
April 21, 2010 STAT 950 Chris Wichman. Motivation Every ten years, the U.S. government conducts a population census, and every five years the U. S. National.
MF-852 Financial Econometrics
Evaluation (practice). 2 Predicting performance  Assume the estimated error rate is 25%. How close is this to the true error rate?  Depends on the amount.
Evaluation.
Lecture 24: Thurs. Dec. 4 Extra sum of squares F-tests (10.3) R-squared statistic (10.4.1) Residual plots (11.2) Influential observations (11.3,
Sample Size Determination In the Context of Hypothesis Testing
Review for Exam 2 Some important themes from Chapters 6-9 Chap. 6. Significance Tests Chap. 7: Comparing Two Groups Chap. 8: Contingency Tables (Categorical.
Inferential Statistics
Correlation Question 1 This question asks you to use the Pearson correlation coefficient to measure the association between [educ4] and [empstat]. However,
Chapter 13: Inference in Regression
© 2013 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Copyright © Cengage Learning. All rights reserved. 13 Linear Correlation and Regression Analysis.
Determining Sample Size
Microdata Simulation for Confidentiality of Tax Returns Using Quantile Regression and Hot Deck Jennifer Huckett Iowa State University June 20, 2007.
2015 AprilUNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 1 Hastie, Tibshirani and Friedman.The Elements of Statistical Learning (2nd edition,
CPE 619 Simple Linear Regression Models Aleksandar Milenković The LaCASA Laboratory Electrical and Computer Engineering Department The University of Alabama.
Simple Linear Regression Models
Model Building III – Remedial Measures KNNL – Chapter 11.
F OUNDATIONS OF S TATISTICAL I NFERENCE. D EFINITIONS Statistical inference is the process of reaching conclusions about characteristics of an entire.
1 Chapter 6. Section 6-1 and 6-2. Triola, Elementary Statistics, Eighth Edition. Copyright Addison Wesley Longman M ARIO F. T RIOLA E IGHTH E DITION.
Population All members of a set which have a given characteristic. Population Data Data associated with a certain population. Population Parameter A measure.
Statistics for Data Miners: Part I (continued) S.T. Balke.
Montecarlo Simulation LAB NOV ECON Montecarlo Simulations Monte Carlo simulation is a method of analysis based on artificially recreating.
Section 8.1 Estimating  When  is Known In this section, we develop techniques for estimating the population mean μ using sample data. We assume that.
Correlation and Linear Regression. Evaluating Relations Between Interval Level Variables Up to now you have learned to evaluate differences between the.
VI. Evaluate Model Fit Basic questions that modelers must address are: How well does the model fit the data? Do changes to a model, such as reparameterization,
Disclosure risk when responding to queries with deterministic guarantees Krish Muralidhar University of Kentucky Rathindra Sarathy Oklahoma State University.
1 Assessing the Impact of SDC Methods on Census Frequency Tables Natalie Shlomo Southampton Statistical Sciences Research Institute University of Southampton.
Statistics 1: tests and linear models. How to get started? Exploring data graphically: Scatterplot HistogramBoxplot.
Chapter 7 Probability and Samples: The Distribution of Sample Means.
Calibrated imputation of numerical data under linear edit restrictions Jeroen Pannekoek Natalie Shlomo Ton de Waal.
Sections 7-1 and 7-2 Review and Preview and Estimating a Population Proportion.
WP 19 Assessment of Statistical Disclosure Control Methods for the 2001 UK Census Natalie Shlomo University of Southampton Office for National Statistics.
1 IPAM 2010 Privacy Protection from Sampling and Perturbation in Surveys Natalie Shlomo and Chris Skinner Southampton Statistical Sciences Research Institute.
CHEMISTRY ANALYTICAL CHEMISTRY Fall Lecture 6.
Disclosure Limitation in Microdata with Multiple Imputation Jerry Reiter Institute of Statistics and Decision Sciences Duke University.
Statistical Methodology for the Automatic Confidentialisation of Remote Servers at the ABS Session 1 UNECE Work Session on Statistical Data Confidentiality.
1 Chapter 6. Section 6-1 and 6-2. Triola, Elementary Statistics, Eighth Edition. Copyright Addison Wesley Longman M ARIO F. T RIOLA E IGHTH E DITION.
Statistical Inference Statistical inference is concerned with the use of sample data to make inferences about unknown population parameters. For example,
Example x y We wish to check for a non zero correlation.
BIOL 582 Lecture Set 2 Inferential Statistics, Hypotheses, and Resampling.
Combinations of SDC methods for continuous microdata Anna Oganian National Institute of Statistical Sciences.
5. Evaluation of measuring tools: reliability Psychometrics. 2011/12. Group A (English)
Reconciling Confidentiality Risk Measures from Statistics and Computer Science Jerry Reiter Department of Statistical Science Duke University.
MM207 Statistics Welcome to the Unit 9 Seminar With Ms. Hannahs Final Project is due Tuesday, August 7 at 11:59 pm ET. No late projects will be accepted.
Chapter 4: Basic Estimation Techniques
EXCEL: Multiple Regression
Chapter 4 Basic Estimation Techniques
Confidence Interval Estimation
Basic Estimation Techniques
Assessing Disclosure Risk in Microdata
Introduction, class rules, error analysis Julia Velkovska
Statistics in MSmcDESPOT
Basic Estimation Techniques
Statistical Methods For Engineers
CHAPTER 29: Multiple Regression*
Tutorial 9 Suppose that a random sample of size 10 is drawn from a normal distribution with mean 10 and variance 4. Find the following probabilities:
Classification Trees for Privacy in Sample Surveys
Regression Forecasting and Model Building
Presentation transcript:

Confidentiality risks of releasing measures of data quality Jerry Reiter Department of Statistical Science Duke University

General setting Original microdata, O Candidate release, M (1:1 link with O). Provide some measure of quality of M for specific analyses, labeled FM. Released dynamically for analyses submitted to a “verification server.”

Possible measures (think regression coefficients) Overlap in confidence intervals Compute 95% CI for Q with O; compute 95% CI for Q with M; measure overlap in intervals. Distance between Q(M) and Q(O). Absolute (relative) error. Percentage change in variance.

Possible measures (think regression coefficients) Posterior predictive p-value Computed for (pre-specified?) statistics. Goodness of fit measure PRESS or other predictive stats for D. Value of likelihood or AIC/BIC in D. Others….

General confidentiality risks Backsolving risk If FM is exact, analyst might determine elements of O from FM and Q(M). Prediction risk If utility measure indicates small difference, analyst might closely estimate elements of O from FM and Q(M).

Confidentiality attacks for different SDL Data swapping: Try all possible swaps until utility measure reports exact for some Q that depends on all (swapped) variables. Adding noise: If constant variance noise and large dataset, recover noise variance by division. Increase individual records’ values in some Q until utility measure reveals a close fit. Re-coding / Suppression: Get utility measures for tabular-like Q to figure out individual values.

Confidentiality attacks for different SDL General attack strategies Turn microdata into tabular data. -- Create indicator variables out of continuous variables for tail area events, where limits change slightly.

Example of attacks: Setting Suppose agency uses 3% data swap of five categorical quasi-identifiers, X. When record selected for swapping, all of its X is swapped with the X for another record. Other values, Y, are not swapped.

Example continued User asks for quality of - coefficients in linear regression of Y on X. - sample mean of Y. - mean of Y for small subpopulation defined by X. - contingency table analysis with X. Fidelity measure is confidence interval overlap (not coarsened). Perfect overlap defined as 1. No overlap defined as 0.

Intruder attacks Find out which records were swapped: Try different records in M for some query involving a subset of the data. If FM = 1, assume records were not swapped. Find out values of swapped X: Find at least one not swapped record for each level of univariate X. Add target record with swapped X. Let Q be frequencies of each cell for the selected records. Correct value is the one not equal to swapped value for which FM < 1.

Another example To previous scenario, employ top-coding to protect upper 1% tail of a continuous Y1. And, add noise to all values of some Y2. Users desire regressions, means, confidence intervals for quantiles.

Intruder attacks Order the records by values of Y1: Form a data set with one top-coded record and many not top-coded records. Let Q be sample mean and obtain FM. Repeat for all top-coded records. Order records by FM. Estimate values of Y1 in upper 1%: Repeat above strategy. Try values of true Y1 that result in recreation of FM.

Intruder attacks Estimate values of Y1 in upper 1%: Form group of not top-coded records for which Y1 is transformed so that it equals zero (or any one number). Add one top- coded record. Get FM for mean of Y1 for these values. Try values of true Y1 that result in recreation of FM. Same strategy applies for learning values of Y2 with added noise.

Reducing risks of these attacks Limit what is released Report something other than exact FM. Coarsen or add noise to measures before release. Do not release FM for some Q. Limit what is answerable Do not allow any copying or transposing data. Do not allow arbitrary transformations of data.

Limiting queries Minimum sample sizes for queries. Automatic feedback for set of common transformations, and make all others go through disclosure review. (Counter attacks based on unusual transformations that are unlikely to be legit analysis).

Coarsening FM measures Report FM rounded, for example to nearest.05 for CI overlap. Hard to know what values are safe and useful for generic Q. Add noise to FM measure. Same difficulty! One approach is to create “acceptable” bounds for each true value, and choose rounding/noise to ensure those bounds are feasible under an attack strategy.

Randomness in FM measures FM based on Q(M) and Q(O-), where O- built by randomly delete k records from data used in Q(O). resample deleted records with replacement (random seed defined by Q). FM not true except by random chance. Adds large noise to Q with small sample sizes and small noise to Q with large sample sizes.

Where to go from here Evaluate how much noise infusion/aggregation/resampling to the FM to defeat the attacks against topcoding or noise. Begin to develop system.