1 Numerical Data Masking Techniques for Maintaining Sub-Domain Characteristics Krish Muralidhar University of Kentucky Rathindra Sarathy Oklahoma State.

Slides:



Advertisements
Similar presentations
Copula Representation of Joint Risk Driver Distribution
Advertisements

The Robert Gordon University School of Engineering Dr. Mohamed Amish
Statistical Disclosure Control (SDC) for 2011 Census Progress Update Keith Spicer – ONS SDC Methodology 23 April 2009.
Cross Sectional Designs
SDC for continuous variables under edit restrictions Natalie Shlomo & Ton de Waal UN/ECE Work Session on Statistical Data Editing, Bonn, September 2006.
Formulation of the objectives Variable vs
Copyright © 2011 Wolters Kluwer Health | Lippincott Williams & Wilkins Chapter 12 Measures of Association.
The General Linear Model Or, What the Hell’s Going on During Estimation?
1 A Common Measure of Identity and Value Disclosure Risk Krish Muralidhar University of Kentucky Rathin Sarathy Oklahoma State University.
Departments of Medicine and Biostatistics
Simple Linear Regression. Start by exploring the data Construct a scatterplot  Does a linear relationship between variables exist?  Is the relationship.
Cox Model With Intermitten and Error-Prone Covariate Observation Yury Gubman PhD thesis in Statistics Supervisors: Prof. David Zucker, Prof. Orly Manor.
1 Multivariate Statistics ESM 206, 5/17/05. 2 WHAT IS MULTIVARIATE STATISTICS? A collection of techniques to help us understand patterns in and make predictions.
Dissemination and Critical Evaluation of Published Research Peg Bottjen, MPA, MT(ASCP)SC.
7. Least squares 7.1 Method of least squares K. Desch – Statistical methods of data analysis SS10 Another important method to estimate parameters Connection.
Today Concepts underlying inferential statistics
Introduction to Regression with Measurement Error STA431: Spring 2013.
Chapter 7 Correlational Research Gay, Mills, and Airasian
Chapter 5 Copyright © Allyn & Bacon 2008 This multimedia product and its contents are protected under copyright law. The following are prohibited by law:
Techniques for studying correlation and covariance structure
Robert delMas (Univ. of Minnesota, USA) Ann Ooms (Kingston College, UK) Joan Garfield (Univ. of Minnesota, USA) Beth Chance (Cal Poly State Univ., USA)
Trieschmann, Hoyt & Sommer Risk Identification and Evaluation Chapter 2 ©2005, Thomson/South-Western.
AUDIT PROCEDURES. Commonly used Audit Procedures Analytical Procedures Analytical Procedures Basic Audit Approaches - Basic Audit Approaches - System.
Computer vision: models, learning and inference Chapter 5 The Normal Distribution.
Inferential statistics Hypothesis testing. Questions statistics can help us answer Is the mean score (or variance) for a given population different from.
A PRIMER ON DATA MASKING TECHNIQUES FOR NUMERICAL DATA Krish Muralidhar Gatton College of Business & Economics.
1 Statistical Tools for Multivariate Six Sigma Dr. Neil W. Polhemus CTO & Director of Development StatPoint, Inc.
1 STATISTICAL HYPOTHESES AND THEIR VERIFICATION Kazimieras Pukėnas.
Microdata Simulation for Confidentiality of Tax Returns Using Quantile Regression and Hot Deck Jennifer Huckett Iowa State University June 20, 2007.
Simple Linear Regression
Data Shuffling for Protecting Confidential Data Data Shuffling for Protecting Confidential Data A Software Demonstration Rathindra Sarathy* and Krish Muralidhar**
A Multivariate Statistical Model of a Firm’s Advertising Activities and their Financial Implications Oleg Vlasov, Vassilly Voinov, Ramesh Kini and Natalie.
L 1 Chapter 12 Correlational Designs EDUC 640 Dr. William M. Bauer.
EMIS 8381 – Spring Netflix and Your Next Movie Night Nonlinear Programming Ron Andrews EMIS 8381.
Educational Research: Competencies for Analysis and Application, 9 th edition. Gay, Mills, & Airasian © 2009 Pearson Education, Inc. All rights reserved.
S14: Analytical Review and Audit Approaches. Session Objectives To define analytical review To define analytical review To explain commonly used analytical.
Disclosure risk when responding to queries with deterministic guarantees Krish Muralidhar University of Kentucky Rathindra Sarathy Oklahoma State University.
The Correlational Research Strategy
1 IPAM 2010 Privacy Protection from Sampling and Perturbation in Surveys Natalie Shlomo and Chris Skinner Southampton Statistical Sciences Research Institute.
Risk Analysis & Modelling Lecture 2: Measuring Risk.
September 18-19, 2006 – Denver, Colorado Sponsored by the U.S. Department of Housing and Urban Development Conducting and interpreting multivariate analyses.
Experimental Research Methods in Language Learning Chapter 10 Inferential Statistics.
Generalised method of moments approach to testing the CAPM Nimesh Mistry Filipp Levin.
Can you eat your cake and have it too? S haring healthcare data without compromising privacy or confidentiality 12 th National HIPAA Summit Concurrent.
©2010 John Wiley and Sons Chapter 2 Research Methods in Human-Computer Interaction Chapter 2- Experimental Research.
McMillan Educational Research: Fundamentals for the Consumer, 6e © 2012 Pearson Education, Inc. All rights reserved. Educational Research: Fundamentals.
United Nations Workshop on Revision 3 of Principles and Recommendations for Population and Housing Censuses and Evaluation of Census Data, Amman 19 – 23.
The Correlational Research Strategy Chapter 12. Correlational Research The goal of correlational research is to describe the relationship between variables.
Personal Control over Development: Effects on the Perception and Emotional Evaluation of Personal Development in Adulthood.
IMPORTANCE OF STATISTICS MR.CHITHRAVEL.V ASST.PROFESSOR ACN.
Microdata masking as permutation Krish Muralidhar Price College of Business University of Oklahoma Josep Domingo-Ferrer UNESCO Chair in Data Privacy Dept.
Statistical Data Analysis 2010/2011 M. de Gunst Lecture 9.
Analytical Review and Audit Approaches
Stats Term Test 4 Solutions. c) d) An alternative solution is to use the probability mass function and.
Canonical Correlation. Canonical correlation analysis (CCA) is a statistical technique that facilitates the study of interrelationships among sets of.
BIVARIATE/MULTIVARIATE DESCRIPTIVE STATISTICS Displaying and analyzing the relationship between continuous variables.
Chapter 17 STRUCTURAL EQUATION MODELING. Structural Equation Modeling (SEM)  Relatively new statistical technique used to test theoretical or causal.
AP Statistics Review Day 1 Chapters 1-4. AP Exam Exploring Data accounts for 20%-30% of the material covered on the AP Exam. “Exploratory analysis of.
Stats 242.3(02) Statistical Theory and Methodology.
Statistics & Evidence-Based Practice
Risk Identification and Evaluation Chapter 2
Differences Among Group Means: Multifactorial Analysis of Variance
Measures for Information Loss in Protected Data
A Primer on Data Masking Techniques for Numerical Data
Computer vision: models, learning and inference
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
What is Regression Analysis?
Explained and unexplained variance
Fixed, Random and Mixed effects
Understanding Statistical Inferences
Presentation transcript:

1 Numerical Data Masking Techniques for Maintaining Sub-Domain Characteristics Krish Muralidhar University of Kentucky Rathindra Sarathy Oklahoma State University

2 Ideal Data Utility for Masking Numerical Data Ideally, results of all analyses using the masked data should be identical to that using the original data. Impossible to achieve in practice.

3 Practical Data Utility Results of most analyses using the masked data should be very similar to that using the original data. Performance of the masking technique should be predictable (theory-based methods are preferable over ad hoc methods)

4 Practical Assessment of Data Utility Univariate (Marginal) characteristics Maintain some sufficient statistics –When sufficient statistics are maintained in the masked data, results for analyses based on these statistics using the masked data can be guaranteed to be exactly the same as that using the original data Relationships –Linear –Monotonic –Non-monotonic

5 Sub-domain Characteristics An important component of data utility for Government agencies and users is the need to maintain characteristics of the original data within sub-domains, in the masked data With a few exceptions, this aspect of data utility has NOT been directly addressed when evaluating techniques for masking numerical data

6 Preferred Techniques In this study, we investigate the performance of two techniques in maintaining sub-domain characteristics when masking numerical data. –Sufficiency Based perturbation approach (Burridge 2003; Muralidhar and Sarathy 2007) –Data Shuffling (Muralidhar and Sarathy 2006) Why these two techniques? –These two techniques can maintain certain characteristics for sub-domains exactly –They dominate the performance of other techniques for masking numerical data

7 Sufficiency Based Linear Models X, S, and Y represent the confidential, non-confidential, and masked data, respectively; ε represent the noise term. Σ represents the covariance matrix between variables. Specification of β 2 dictates the extent of relationship between original and masked data

8 Data Shuffling (US Patent # )

9 Examples Simulated example Census Data In our presentation, we will focus on the simulated data. The manuscript has a complete discussion of the results for the Census data.

10 Simulated Example Number of observations = Three categorical, non-confidential variables –Gender (Male or Female) –Marital Status (Married or Other) –Age Group (1 to 6) Total of 24 sub-groups Three numerical, confidential variables –Home value (Positive, non-normal) –Mortgage balance (Positive, non-normal) –Net value of assets (normal)

11 Methods Data Shuffling Three Sufficiency based perturbations –Given S Y is conditionally independent of X (d = 0.00) Y is moderately related to X (d = 0.50) Y is closely related to X (d = 0.90) where d are the values of the diagonal elements of the diagonal matrix β 2

12 Evaluation Compare performance of techniques in sub-domains –Disclosure risk Identity (assessed using the procedure by Fuller (1993) Value (assessed by comparing proportion of variance explained in confidential variables, before & after masking) –Data utility Marginal (or univariate) distribution Linear relationship between variables Non-linear relationship between variables

13 Risk of Identity Disclosure

14 Risk of Value Disclosure Perturbed data with d = 0.50, 0.90 results in increased predictive accuracy. Does is matter?

15 Marginal (or Univariate) Distribution (Mortgage Balance) (Entire Data Set)

16 Sub-group Marginal Distribution (Home Value) (Gender = 0, Marital = 0, Age = 1)

17 Product Moment Correlation

18 Non-Linear Relationships

19 Rank Order Correlation

20 Comparison of the Methods Data Shuffling 1.Disclosure risk 1.Identity disclosure risk is 1/n 2.Providing access to masked data does not improve predictive ability [R 2 (X|S,Y) = R 2 (X|S)] 2.The mean, covariance and in fact the entire univariate distributions of masked data are exactly the same as the original data for every sub-group and the entire data set 3.Maintains (asymptotically) 1.Covariance matrix 2.Product moment correlation matrix 3.Rank order correlation matrix for every sub-domain and the overall data set

Comparison of the Methods Sufficiency Based Method 1.Disclosure risk is minimized for the perturbed data set when d = 0, but not in the other cases. 2.The univariate distribution of the masked data is very different from the original data. 3.Maintains (exactly) 1.Mean Vector 2.Covariance matrix 3.Product moment correlation for every sub-domain and the entire data set. 4.Does not maintain rank order correlation 21

22 Conclusion If it is known that the data will be used exclusively for traditional, parametric analysis, sufficiency based methods offer the best performance In all other cases, Data shuffling offers the best performance

23 Future Research We need to explore this topic further –Our initial result suggests that both techniques may even be capable of maintaining all types of relationships between the non-confidential variables and the masked variables. Is this true for all cases? –What if arbitrary sub-domains are created by using numerical variables?

24 For more details on our work, please visit: gatton.uky.edu/faculty/muralidhar/maskingpapers We have CD’s with copies of our paper, presentation, and the data sets. We will be happy to share it with you. or

25 We welcome your questions or comments or suggestions. Thank you.