1 A Common Measure of Identity and Value Disclosure Risk Krish Muralidhar University of Kentucky Rathin Sarathy Oklahoma State University.

Slides:



Advertisements
Similar presentations
11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.
Advertisements

On method-specific record linkage for risk assessment Jordi Nin Javier Herranz Vicenç Torra.
Component Analysis (Review)
Notes Sample vs distribution “m” vs “µ” and “s” vs “σ” Bias/Variance Bias: Measures how much the learnt model is wrong disregarding noise Variance: Measures.
CmpE 104 SOFTWARE STATISTICAL TOOLS & METHODS MEASURING & ESTIMATING SOFTWARE SIZE AND RESOURCE & SCHEDULE ESTIMATING.
Budapest May 27, 2008 Unifying mixed linear models and the MASH algorithm for breakpoint detection and correction Anders Grimvall, Sackmone Sirisack, Agne.
Dimension reduction (1)
Visual Recognition Tutorial
1 Multivariate Statistics ESM 206, 5/17/05. 2 WHAT IS MULTIVARIATE STATISTICS? A collection of techniques to help us understand patterns in and make predictions.
© 2005 The McGraw-Hill Companies, Inc., All Rights Reserved. Chapter 14 Using Multivariate Design and Analysis.
1 Chapter 3 Multiple Linear Regression Ray-Bing Chen Institute of Statistics National University of Kaohsiung.
1 Simple Linear Regression Chapter Introduction In this chapter we examine the relationship among interval variables via a mathematical equation.
Lecture 16 – Thurs, Oct. 30 Inference for Regression (Sections ): –Hypothesis Tests and Confidence Intervals for Intercept and Slope –Confidence.
REGRESSION AND CORRELATION
Experimental Evaluation
Today Concepts underlying inferential statistics
Correlation 1. Correlation - degree to which variables are associated or covary. (Changes in the value of one tends to be associated with changes in the.
Correlation. The sample covariance matrix: where.
Chapter 12 Inferential Statistics Gay, Mills, and Airasian
Inferential statistics Hypothesis testing. Questions statistics can help us answer Is the mean score (or variance) for a given population different from.
A PRIMER ON DATA MASKING TECHNIQUES FOR NUMERICAL DATA Krish Muralidhar Gatton College of Business & Economics.
1 Numerical Data Masking Techniques for Maintaining Sub-Domain Characteristics Krish Muralidhar University of Kentucky Rathindra Sarathy Oklahoma State.
Lecture 16 Correlation and Coefficient of Correlation
Objectives of Multiple Regression
Summarized by Soo-Jin Kim
Presented By Wanchen Lu 2/25/2013
Principal Components Analysis BMTRY 726 3/27/14. Uses Goal: Explain the variability of a set of variables using a “small” set of linear combinations of.
Simple Linear Regression
One-Way Manova For an expository presentation of multivariate analysis of variance (MANOVA). See the following paper, which addresses several questions:
Data Shuffling for Protecting Confidential Data Data Shuffling for Protecting Confidential Data A Software Demonstration Rathindra Sarathy* and Krish Muralidhar**
The Application of the Concept of Uniqueness for Creating Public Use Microdata Files Jay J. Kim, U.S. National Center for Health Statistics Dong M. Jeong,
© 2004 Prentice-Hall, Inc.Chap 15-1 Basic Business Statistics (9 th Edition) Chapter 15 Multiple Regression Model Building.
Principles of Pattern Recognition
The Multiple Correlation Coefficient. has (p +1)-variate Normal distribution with mean vector and Covariance matrix We are interested if the variable.
Some Background Assumptions Markowitz Portfolio Theory
Disclosure risk when responding to queries with deterministic guarantees Krish Muralidhar University of Kentucky Rathindra Sarathy Oklahoma State University.
Jeroen Pannekoek - Statistics Netherlands Work Session on Statistical Data Editing Oslo, Norway, 24 September 2012 Topic (I) Selective and macro editing.
Various topics Petter Mostad Overview Epidemiology Study types / data types Econometrics Time series data More about sampling –Estimation.
Multiple Linear Regression. Purpose To analyze the relationship between a single dependent variable and several independent variables.
Investment Analysis and Portfolio Management First Canadian Edition By Reilly, Brown, Hedges, Chang 6.
MGS3100_04.ppt/Sep 29, 2015/Page 1 Georgia State University - Confidential MGS 3100 Business Analysis Regression Sep 29 and 30, 2015.
Regression. Types of Linear Regression Model Ordinary Least Square Model (OLS) –Minimize the residuals about the regression linear –Most commonly used.
Educational Research Chapter 13 Inferential Statistics Gay, Mills, and Airasian 10 th Edition.
1 IPAM 2010 Privacy Protection from Sampling and Perturbation in Surveys Natalie Shlomo and Chris Skinner Southampton Statistical Sciences Research Institute.
© Copyright McGraw-Hill Correlation and Regression CHAPTER 10.
New Measures of Data Utility Mi-Ja Woo National Institute of Statistical Sciences.
Can you eat your cake and have it too? S haring healthcare data without compromising privacy or confidentiality 12 th National HIPAA Summit Concurrent.
Creating Open Data whilst maintaining confidentiality Philip Lowthian, Caroline Tudor Office for National Statistics 1.
Local Predictability of the Performance of an Ensemble Forecast System Liz Satterfield and Istvan Szunyogh Texas A&M University, College Station, TX Third.
Microdata masking as permutation Krish Muralidhar Price College of Business University of Oklahoma Josep Domingo-Ferrer UNESCO Chair in Data Privacy Dept.
4 basic analytical tasks in statistics: 1)Comparing scores across groups  look for differences in means 2)Cross-tabulating categoric variables  look.
Jump to first page Inferring Sample Findings to the Population and Testing for Differences.
- 1 - Preliminaries Multivariate normal model (section 3.6, Gelman) –For a multi-parameter vector y, multivariate normal distribution is where  is covariance.
Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L11.1 Lecture 11: Canonical correlation analysis (CANCOR)
Combinations of SDC methods for continuous microdata Anna Oganian National Institute of Statistical Sciences.
Educational Research Inferential Statistics Chapter th Chapter 12- 8th Gay and Airasian.
CSC2535: Computation in Neural Networks Lecture 7: Independent Components Analysis Geoffrey Hinton.
Regression Analysis Part A Basic Linear Regression Analysis and Estimation of Parameters Read Chapters 3, 4 and 5 of Forecasting and Time Series, An Applied.
Reconciling Confidentiality Risk Measures from Statistics and Computer Science Jerry Reiter Department of Statistical Science Duke University.
Model Comparison. Assessing alternative models We don’t ask “Is the model right or wrong?” We ask “Do the data support a model more than a competing model?”
Estimating standard error using bootstrap
Inference about the slope parameter and correlation
The simple linear regression model and parameter estimation
Assessing Disclosure Risk in Microdata
A Primer on Data Masking Techniques for Numerical Data
CJT 765: Structural Equation Modeling
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
OVERVIEW OF LINEAR MODELS
OVERVIEW OF LINEAR MODELS
3 basic analytical tasks in bivariate (or multivariate) analyses:
Presentation transcript:

1 A Common Measure of Identity and Value Disclosure Risk Krish Muralidhar University of Kentucky Rathin Sarathy Oklahoma State University

2 Context This study presents developments in the context of numerical data that have been masked and released We assume that the categorical data (if any) have not been masked This assumption can be relaxed

3 Empirical Assessment of Disclosure Risk Is there a link between both identity and value disclosure that will allow us to use a “common” measure?

4 Basis for Disclosure The “strength of the relationship”, in a multivariate sense, between the two datasets (original and masked) accounts for disclosure risk

5 Value Disclosure Value disclosure based on “strength of relationship” Palley & Simonoff(1987) (R 2 measure for individual variables) Tendick (1992) (R 2 for linear combinations) Muralidhar & Sarathy(2002) (Canonical Correlation) Implicit assumption – snooper can use linear models to improve their prediction of confidential values (Palley & Simonoff(1987), Fuller(1993), Tendick(1992), Muralidhar & Sarathy(1999,2001))

6 Identity Disclosure Assessment of identity disclosure is often empirical in nature e.g., Winkler’s software – (Census Bureau) based on a modified Fellegi- Sunter algorithm. The number (or proportion) of observations correctly re-identified represents an assessment of identity disclosure risk Theoretical attempts for numerical data: Fuller (1993) (Linear model) Tendick (1992) (Linear model) Fienberg, Makov, Sanil (1997) (Bayesian)

7 Fuller’s Measure Given the masked dataset Y, and the original dataset X, and assuming normality, the probability that the j th released record corresponds to any particular record that the intruder may possess is given by P j = (  k t ) -1 k j. The intruder chooses the record j which maximizes k j given by: exp{-0.5 (X – YH)A -1 (X – YH)`}, where A =  XX –  XY (  YY ) -1  YX and H = (  YY ) -1  YX P j may be treated as the identification probability (identity risk) of any particular record and averaging over every record gives a mean identification probability or mean identity disclosure risk for whole masked dataset

8 Fuller’s distance measure Based on best conditional densities While restricted to normal datasets, it relates identity risk to the association between the two datasets (though somewhat indirectly) as indicated by k j which contains  XY. Shows the connection between distance- based measures and probability-based measures

9 Our Goal To show that both value disclosure and identity disclosure are determined by the degree of association between the masked and original datasets. This must be true, since both are based on best predictors When the best predictors are linear (e.g., multivariate normal datasets) canonical correlation can capture the association, and both value disclosure and identity disclosure risk must be expressible in terms of canonical correlations Already shown for value disclosure (Muralidhar et al. 1999, and Sarathy et al. 2002). We will show here the relationship between identity disclosure and canonical correlation

10 Canonical Correlation Version of Fuller’s Distance Measure (X – YH)A -1 (X – YH)` = (U – V 0.5 ) C -1 (U – V 0.5 )`, where U = X(  xx ) -0.5 e (the canonical variates for the X variables) V = Y(  yy ) -0.5 f (the canonical variates for the Y variables) C = (I – λ) e is eigenvector of (  XX ) -0.5  XY (  YY ) -1  YX (  XX ) f is eigenvector of (  YY ) -0.5  YX (  XX ) -1  XY (  YY ) is diagonal matrix of eigenvalues and is also the vector of squared canonical correlations

11 Therefore… Identity disclosure risk is a function of the (linear) association between the two datasets (the lambdas, which are the square of the canonical correlations) (U – V 0.5 ) (I- ) -1 (U – V 0.5 )` relates this association to identity disclosure as well as provide an “operational” way to assess this risk. Compute this distance measure and match each original record to masked record that minimizes the expression. Then the number of re-identified records gives an overall empirical assessment of identity disclosure risk for a masked data release (Empirical results shown later.)

12 Mean Identification Probability (MIDP) Tendick computed bounds on identification probabilities for correlated additive noise methods His expressions are specific to the method and not for the general case We show a lower bound on MIDP for the general case (regardless of masking technique) that is based on canonical correlations

13 Bound on MIDP For a data set (size n) with k confidential variables X, masked using any procedure to result in Y, the mean identification probability is given by:

14 Identification Probability (IDP) For any given observation i in the original data set, the probability that it will be re- identified is given by: where U ij is the canonical variate for X ij

15 An Example Consider a data set with 10 variables and a specified covariance matrix Assume that the data is to be perturbed using simple noise addition with different levels of variance Compute MIDP for different sample sizes and different noise variances

16 Covariance Matrix of X

17 MIDP

18 Additive (Correlated) Noise Kim (1986) suggested that covariance structure of the noise term should be the same as that of the original confidential variables (dΣ XX ) where d is a constant representing the “level” of noise In this case, canonical correlation for each (masked, original) variable pair is [1/(1+d)] 0.5

19 MIDP

20 Comparison of Simple additive and Correlated noise For the same noise level Correlated noise results in higher identity disclosure risk … Tendick (1993) also observed this Correlated noise results in lower value disclosure risk (Tendick and Matloff 1994; Muralidhar et al. 1999)

21 Other Procedures For some other procedures (micro- aggregation, data swapping, etc.), it may be necessary to perform the masking and use the data to compute the canonical correlations

22 Data sets with Categorical non- confidential Variables MIDP can be computed for subsets as well Example Data set with 2000 observations Six numerical variables Three categorical (non-confidential) variables Gender Marital status Age group (1 – 6) Masking procedure is Rank Based Proximity Swap

23 MIDP

24 Using IDP We can use the IDP bound to implement a record re-identification procedure by choosing masked record with highest IDP value

25 An IDP Example Data set consisting of 25 observations from a MVN(0,1) Perturbed using independent noise with variance = 0.45 MIDP = Approximately 6 observations should be re- identified using this criteria Re-identification by chance = 1/n = 0.04

26 An IDP Example

27 Advantages Possible to compute MIDP with just aggregate information Possible to use IDP as “record- linkage” tool for assessing disclosure risk characteristics of a masking technique Computationally easier than alternative existing methods

28 Disadvantages Assumes that the data has a multivariate normal distribution For large n, the lower bound is weak. MIDP appears to be overly pessimistic, we are working on finding out why this is so, and possibly modifying the bound.

29 Weak Bound? Sample result n=50 simple noise addition NoiseMIDPActual

30 Conclusion Canonical correlation analysis can be used to assess both identity and value disclosure For normal data, this provides the best measure of both identity and value disclosure

31 Further Research Sensitivity to normality assumption Comparison with Fellegi-Sunter based record linkage procedures Refining the bounds

32 Our Research You can find the details of our current and prior research at: