Data Shuffling for Protecting Confidential Data Data Shuffling for Protecting Confidential Data A Software Demonstration Rathindra Sarathy* and Krish Muralidhar**

Slides:



Advertisements
Similar presentations
Combining Like Terms. Only combine terms that are exactly the same!! Whats the same mean? –If numbers have a variable, then you can combine only ones.
Advertisements

Estimating Identification Risks for Microdata Jerome P. Reiter Institute of Statistics and Decision Sciences Duke University, Durham NC, USA.
Numerical Analysis. TOPIC Interpolation TOPIC Interpolation.
Data Mining in Computer Games By Adib Adam Hussain & Mohammed Sarfraz.
Confidentiality risks of releasing measures of data quality Jerry Reiter Department of Statistical Science Duke University
Controlled shuffling experiment: detailed 10% sample of 2011 census of Ireland - Risk, confidentiality and utility Presenter: Robert McCaa,
SDC for continuous variables under edit restrictions Natalie Shlomo & Ton de Waal UN/ECE Work Session on Statistical Data Editing, Bonn, September 2006.
1 A Common Measure of Identity and Value Disclosure Risk Krish Muralidhar University of Kentucky Rathin Sarathy Oklahoma State University.
11 ACS Public Use Microdata Samples of 2005 and 2006 – How to Use the Replicate Weights B. Dale Garrett and Michael Starsinic U.S. Census Bureau AAPOR.
Welcome to Dave’s Data Demonstration This presentation is designed for users of SPSS with some familiarity with the program and a willingness to experiment.
Leting Wu Xiaowei Ying, Xintao Wu Dept. Software and Information Systems Univ. of N.C. – Charlotte Reconstruction from Randomized Graph via Low Rank Approximation.
Converting Between Numbers and Characters CSC 1401: Introduction to Programming with Java Week 4 – Lecture 3 Wanda M. Kunkle.
Statistics 350 Lecture 21. Today Last Day: Tests and partial R 2 Today: Multicollinearity.
Privacy Preserving Market Basket Data Analysis Ling Guo, Songtao Guo, Xintao Wu University of North Carolina at Charlotte.
Studying Behavior. Variables Any event, situation, behavior or individual characteristic that varies In our context, the “things” that make up an experiment.
SAC’06 April 23-27, 2006, Dijon, France Towards Value Disclosure Analysis in Modeling General Databases Xintao Wu UNC Charlotte Songtao Guo UNC Charlotte.
Nemours Biomedical Research Statistics April 2, 2009 Tim Bunnell, Ph.D. & Jobayer Hossain, Ph.D. Nemours Bioinformatics Core Facility.
Chapter 11 Multiple Regression.
1 When Does Randomization Fail to Protect Privacy? Wenliang (Kevin) Du Department of EECS, Syracuse University.
Analyzing quantitative data – section III Week 10 Lecture 1.
Data Analysis Statistics. Levels of Measurement Nominal – Categorical; no implied rankings among the categories. Also includes written observations and.
Aspects of Bayesian Inference and Statistical Disclosure Control in Python Duncan Smith Confidentiality and Privacy Group CCSR University of Manchester.
Data Envelopment Analysis. Weights Optimization Primal – Dual Relations.
A PRIMER ON DATA MASKING TECHNIQUES FOR NUMERICAL DATA Krish Muralidhar Gatton College of Business & Economics.
1 Numerical Data Masking Techniques for Maintaining Sub-Domain Characteristics Krish Muralidhar University of Kentucky Rathindra Sarathy Oklahoma State.
Selecting the Correct Statistical Test
Topic 16: Multicollinearity and Polynomial Regression.
CHAPTER 2 Making and Using Graphs
Chapter 9 Comparing More than Two Means. Review of Simulation-Based Tests  One proportion:  We created a null distribution by flipping a coin, rolling.
Discriminant Function Analysis Basics Psy524 Andrew Ainsworth.
Bayesian Sets Zoubin Ghahramani and Kathertine A. Heller NIPS 2005 Presented by Qi An Mar. 17 th, 2006.
1 Trend Analysis Step vs. monotonic trends; approaches to trend testing; trend tests with and without exogeneous variables; dealing with seasonality; Introduction.
Quantification of the non- parametric continuous BBNs with expert judgment Iwona Jagielska Msc. Applied Mathematics.
2 Categorical Variables (frequencies) Testing mean differences of a continuous variable between groups (categorical variable) 2 Continuous Variables 2.
Disclosure risk when responding to queries with deterministic guarantees Krish Muralidhar University of Kentucky Rathindra Sarathy Oklahoma State University.
10/22/20151 PUAF 610 TA Session 8. 10/22/20152 Recover from midterm.
The use of protected microdata in tabulation: case of SDC-methods microaggregation and PRAM Researcher Janika Konnu Manchester, United Kingdom December.
Rules for Means and Variances. Rules for Means: Rule 1: If X is a random variable and a and b are constants, then If we add a constant a to every value.
Descriptive Statistics vs. Factor Analysis Descriptive statistics will inform on the prevalence of a phenomenon, among a given population, captured by.
WP 19 Assessment of Statistical Disclosure Control Methods for the 2001 UK Census Natalie Shlomo University of Southampton Office for National Statistics.
1 IPAM 2010 Privacy Protection from Sampling and Perturbation in Surveys Natalie Shlomo and Chris Skinner Southampton Statistical Sciences Research Institute.
 Some variables are inherently categorical, for example:  Sex  Race  Occupation  Other categorical variables are created by grouping values of a.
Section Copyright © 2014, 2012, 2010 Pearson Education, Inc. Lecture Slides Elementary Statistics Twelfth Edition and the Triola Statistics Series.
Can you eat your cake and have it too? S haring healthcare data without compromising privacy or confidentiality 12 th National HIPAA Summit Concurrent.
Protection of frequency tables – current work at Statistics Sweden Karin Andersson Ingegerd Jansson Karin Kraft Joint UNECE/Eurostat.
European Conference on Quality in Official Statistics, Rome, July 2008 Community Innovation Survey: a Flexible Approach to the Dissemination of Microdata.
Daniel O. Rice Loyola College in Maryland (with Robert Garfinkel and Ram Gopal University of Connecticut) The Protection of Numerical Information in Databases.
Lesson Topic: The Relationship of Addition and Subtraction, Multiplication and Division, Multiplication and Addition, and Division and Subtraction Lesson.
Privacy-preserving data publishing
Course Title: Using Epi Info™ 7 Using Classic Analysis (Continuation) April Epi Info™ 7 Training Software for Public Health Epi Info™ 7 Training.
Correlation Chapter 6. What is a Correlation? It is a way of measuring the extent to which two variables are related. It measures the pattern of responses.
Microdata masking as permutation Krish Muralidhar Price College of Business University of Oklahoma Josep Domingo-Ferrer UNESCO Chair in Data Privacy Dept.
Christopher Dougherty EC220 - Introduction to econometrics (chapter 2) Slideshow: exercise 2.11 Original citation: Dougherty, C. (2012) EC220 - Introduction.
Social Science Research Design and Statistics, 2/e Alfred P. Rovai, Jason D. Baker, and Michael K. Ponton Spearman Rank Order Correlation Test PowerPoint.
= the matrix for T relative to the standard basis is a basis for R 2. B is the matrix for T relative to To find B, complete:
Basics in R part 2. Variable types in R Common variable types: Numeric - numeric value: 3, 5.9, Logical - logical value: TRUE or FALSE (1 or 0)
NMF Demo: Lee, Seung Bryan Russell Computer Demonstration.
 Kolmogor-Smirnov test  Mann-Whitney U test  Wilcoxon test  Kruskal-Wallis  Friedman test  Cochran Q test.
Reconciling Confidentiality Risk Measures from Statistics and Computer Science Jerry Reiter Department of Statistical Science Duke University.
Oracle apps application. JavaWindow is now identified as window() While running now script not able to identify ‘Find Requesitions’ Original script which.
Analyze Of VAriance. Application fields ◦ Comparing means for more than two independent samples = examining relationship between categorical->metric variables.
Spearman’s Rho Correlation
Measures for Information Loss in Protected Data
A Primer on Data Masking Techniques for Numerical Data
Dynamic graphics, Principal Component Analysis
Bryan Russell Computer Demonstration
Statistical Inference for Managers
Amos Introduction In this tutorial, you will be briefly introduced to the student version of the SEM software known as Amos. You should download the current.
Correlation & Trend Lines
Imputation as a Practical Alternative to Data Swapping
Presentation transcript:

Data Shuffling for Protecting Confidential Data Data Shuffling for Protecting Confidential Data A Software Demonstration Rathindra Sarathy* and Krish Muralidhar** * Oklahoma State University, Stillwater, OK USA ** University of Kentucky, Lexington, KY 40506, USA

Data Shuffling Method that: ◦ Combines the strengths of data perturbation and data swapping ◦ Shuffled data uses only original confidential values – no added noise or “unreasonable” values ◦ Preserves marginals exactly and all monotonic relationships among variables (preserves pairwise rank order correlation closely) ◦ Non-parametric method ◦ Maximum protection against identity and value disclosure

Technical Basis X represents M confidential variables S represents L non-confidential variables. X is assumed numerical; S can be categorical or numerical variables. Y represents the masked values of X. Let R is rank order correlation matrix of {X, S}. Define variables as follows: X* and S* as:

Technical Basis - continued

Example Dataset – Shuffled data

Example Dataset – Rank Order Correlations

Example Dataset - Relationships

Software Demo Two versions are available – both in early Beta (Java and Windows) Example shown earlier was run on Java version of software We will demonstrate the software later