Transforming Data to Satisfy Privacy Constraints 컴퓨터교육 전공 032CSE15 최미희.

Slides:



Advertisements
Similar presentations
Systematic Data Selection to Mine Concept Drifting Data Streams Wei Fan IBM T.J.Watson.
Advertisements

Estimating Identification Risks for Microdata Jerome P. Reiter Institute of Statistics and Decision Sciences Duke University, Durham NC, USA.
ADBIS 2007 Discretization Numbers for Multiple-Instances Problem in Relational Database Rayner Alfred Dimitar Kazakov Artificial Intelligence Group, Computer.
UNIT-2 Data Preprocessing LectureTopic ********************************************** Lecture-13Why preprocess the data? Lecture-14Data cleaning Lecture-15Data.
1 Data Mining: and Knowledge Acquizition — Chapter 5 — BIS /2014 Summer.
On method-specific record linkage for risk assessment Jordi Nin Javier Herranz Vicenç Torra.
Data Mining Classification: Alternative Techniques
Data Mining Classification: Alternative Techniques
Data Anonymization - Generalization Algorithms Li Xiong CS573 Data Privacy and Anonymity.
UTEPComputer Science Dept.1 University of Texas at El Paso Privacy in Statistical Databases Dr. Luc Longpré Computer Science Department Spring 2006.
K Beyond k-Anonimity: A Decision Theoretic Framework for Assessing Privacy Risk M.Scannapieco, G.Lebanon, M.R.Fouad and E.Bertino.
Finding Personally Identifying Information Mark Shaneck CSCI 5707 May 6, 2004.
Privacy Preserving Market Basket Data Analysis Ling Guo, Songtao Guo, Xintao Wu University of North Carolina at Charlotte.
Basic Data Mining Techniques Chapter Decision Trees.
Haplotyping via Perfect Phylogeny Conceptual Framework and Efficient (almost linear-time) Solutions Dan Gusfield U.C. Davis RECOMB 02, April 2002.
Basic Data Mining Techniques
Anatomy: Simple and Effective Privacy Preservation Israel Chernyak DB Seminar (winter 2009)
Branch and Bound Algorithm for Solving Integer Linear Programming
Privacy Preserving Data Mining: An Overview and Examination of Euclidean Distance Preserving Data Transformation Chris Giannella cgiannel AT acm DOT org.
Privacy Preserving OLAP Rakesh Agrawal, IBM Almaden Ramakrishnan Srikant, IBM Almaden Dilys Thomas, Stanford University.
Privacy-Preserving Data Mining Rakesh Agrawal Ramakrishnan Srikant IBM Almaden Research Center 650 Harry Road, San Jose, CA Published in: ACM SIGMOD.
On the Application of Artificial Intelligence Techniques to the Quality Improvement of Industrial Processes P. Georgilakis N. Hatziargyriou Schneider ElectricNational.
Data Mining and Decision Tree CS157B Spring 2006 Masumi Shimoda.
PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning RASTOGI, Rajeev and SHIM, Kyuseok Data Mining and Knowledge Discovery, 2000, 4.4.
DATA MINING : CLASSIFICATION. Classification : Definition  Classification is a supervised learning.  Uses training sets which has correct answers (class.
Inductive learning Simplest form: learn a function from examples
Preserving Link Privacy in Social Network Based Systems Prateek Mittal University of California, Berkeley Charalampos Papamanthou.
Self-Enforcing Private Inference Control Yanjiang Yang (I2R, Singapore) Yingjiu Li (SMU, Singapore) Jian Weng (Jinan Univ. China) Jianying Zhou (I2R, Singapore)
Disclosure Avoidance: An Overview Irene Wong ACCOLEDS/DLI Training December 8, 2003.
Differentially Private Data Release for Data Mining Noman Mohammed*, Rui Chen*, Benjamin C. M. Fung*, Philip S. Yu + *Concordia University, Montreal, Canada.
GA-Based Feature Selection and Parameter Optimization for Support Vector Machine Cheng-Lung Huang, Chieh-Jen Wang Expert Systems with Applications, Volume.
Thwarting Passive Privacy Attacks in Collaborative Filtering Rui Chen Min Xie Laks V.S. Lakshmanan HKBU, Hong Kong UBC, Canada UBC, Canada Introduction.
Privacy in computing Material/text on the slides from Chapter 10 Textbook: Pfleeger.
Disclosure risk when responding to queries with deterministic guarantees Krish Muralidhar University of Kentucky Rathindra Sarathy Oklahoma State University.
1 Assessing the Impact of SDC Methods on Census Frequency Tables Natalie Shlomo Southampton Statistical Sciences Research Institute University of Southampton.
Accuracy-Constrained Privacy-Preserving Access Control Mechanism for Relational Data.
Data Anonymization (1). Outline  Problem  concepts  algorithms on domain generalization hierarchy  Algorithms on numerical data.
Data Anonymization – Introduction and k-anonymity Li Xiong CS573 Data Privacy and Security.
Security Control Methods for Statistical Database Li Xiong CS573 Data Privacy and Security.
Other Perturbation Techniques. Outline  Randomized Responses  Sketch  Project ideas.
Preservation of Proximity Privacy in Publishing Numerical Sensitive Data J. Li, Y. Tao, and X. Xiao SIGMOD 08 Presented by Hongwei Tian.
Additive Data Perturbation: the Basic Problem and Techniques.
Data Perturbation An Inference Control Method for Database Security Dissertation Defense Bob Nielson Oct 23, 2009.
WP 19 Assessment of Statistical Disclosure Control Methods for the 2001 UK Census Natalie Shlomo University of Southampton Office for National Statistics.
1 IPAM 2010 Privacy Protection from Sampling and Perturbation in Surveys Natalie Shlomo and Chris Skinner Southampton Statistical Sciences Research Institute.
1 Using Fixed Intervals to Protect Sensitive Cells Instead of Cell Suppression By Steve Cohen and Bogong Li U.S. Bureau of Labor Statistics UNECE/Work.
Privacy vs. Utility Xintao Wu University of North Carolina at Charlotte Nov 10, 2008.
Privacy-preserving data publishing
Version 1.1 Improving our knowledge of metaheuristic approaches for cell suppression problem Andrea Toniolo Staggemeier Alistair R. Clark James Smith Jonathan.
CSCI 347, Data Mining Data Anonymization.
Data Anonymization - Generalization Algorithms Li Xiong, Slawek Goryczka CS573 Data Privacy and Anonymity.
Data Mining By Farzana Forhad CS 157B. Agenda Decision Tree and ID3 Rough Set Theory Clustering.
1 Statistical Analysis Professor Lynne Stokes Department of Statistical Science Lecture #1 Chi-square Contingency Table Test.
Agenda  INTRODUCTION  GENETIC ALGORITHMS  GENETIC ALGORITHMS FOR EXPLORING QUERY SPACE  SYSTEM ARCHITECTURE  THE EFFECT OF DIFFERENT MUTATION RATES.
Massive Support Vector Regression (via Row and Column Chunking) David R. Musicant and O.L. Mangasarian NIPS 99 Workshop on Learning With Support Vectors.
1 By: Ashmi Banerjee (125186) Suman Datta ( ) CSE- 3rd year.
The London Health Observatory: monitoring health and health care in the capital, supporting practitioners and informing decision-makers Disclosure control.
Reconciling Confidentiality Risk Measures from Statistics and Computer Science Jerry Reiter Department of Statistical Science Duke University.
11 Measuring Disclosure Risk and Data Utility for Flexible Table Generators Natalie Shlomo, Laszlo Antal, Mark Elliot University of Manchester
Hirophysics.com The Genetic Algorithm vs. Simulated Annealing Charles Barnes PHY 327.
ACHIEVING k-ANONYMITY PRIVACY PROTECTION USING GENERALIZATION AND SUPPRESSION International Journal on Uncertainty, Fuzziness and Knowledge-based Systems,
Xiaokui Xiao and Yufei Tao Chinese University of Hong Kong
Rule Induction for Classification Using
Fast Kernel-Density-Based Classification and Clustering Using P-Trees
Presented by: Dr Beatriz de la Iglesia
Privacy Preserving Data Publishing
By (Group 17) Mahesha Yelluru Rao Surabhee Sinha Deep Vakharia
Classification & Prediction
Welcome!.
Data Pre-processing Lecture Notes for Chapter 2
Presentation transcript:

Transforming Data to Satisfy Privacy Constraints 컴퓨터교육 전공 032CSE15 최미희

Page 2 Transforming Data to Satisfy Privacy Constraints Contents Contents 1. Introduction 1. Introduction 2. Usage based metrics 2. Usage based metrics 3. Genetic algorithm framework 3. Genetic algorithm framework 4. Experiments 4. Experiments 5. Conclusion 5. Conclusion

Page 3 Transforming Data to Satisfy Privacy Constraints 1. Introduction ◆ Importance of protecting individual data ◆ Importance of protecting individual data - explicitly identifying data (social security number) - explicitly identifying data (social security number) - potentially identifying data (date of birth, gender, zip code) - potentially identifying data (date of birth, gender, zip code) ◆ how to protect ◆ how to protect - replace any explicitly identifying information by some - replace any explicitly identifying information by some randomized data randomized data - but not sufficient because it can be easily inferred. - but not sufficient because it can be easily inferred. ex) social security number <= zip code, date of birth, ex) social security number <= zip code, date of birth, gender on some data set gender on some data set

Page 4 Transforming Data to Satisfy Privacy Constraints 1. Introduction (cont’d) ◆ approach to solve the identity disclosure problem ◆ approach to solve the identity disclosure problem - to perturb the data - to perturb the data - our approach - our approach (1) generalization (1) generalization (2) suppression (2) suppression # generalization # generalization : > generalization -> 1977 : > generalization -> 1977 ◆ Goal ◆ Goal : preserving the anonymity of the individuals : preserving the anonymity of the individuals by generalizations and suppressions by generalizations and suppressions

Page 5 Transforming Data to Satisfy Privacy Constraints 2. Usage based metrics (1) Background ◆ flexible generalization ◆ flexible generalization 1. Categorical information 1. Categorical information - e.g., zip code, race, marital status - e.g., zip code, race, marital status - a set of nodes S A, leaf node Y, - a set of nodes S A, leaf node Y, node encountered by root from leaf node P node encountered by root from leaf node P - Y is generalized in A to P - Y is generalized in A to P

Page 6 Transforming Data to Satisfy Privacy Constraints 2. Usage based metrics (cont’d) (1) Background ◆ flexible generalization ◆ flexible generalization 2. numeric information 2. numeric information - e.g., age, education in years - e.g., age, education in years - discretization values into set of disjoint interval - discretization values into set of disjoint interval - alternatively numeric value for each interval - alternatively numeric value for each interval (e.g., median) (e.g., median) ex) age ex) age {[0,20),[20,40),[40,60),[60,80),[80,∞)} {[0,20),[20,40),[40,60),[60,80),[80,∞)}

Page 7 Transforming Data to Satisfy Privacy Constraints 2. Usage based metrics (cont’d) (2) Multiple uses ◆ Multiple usage, unknown usage ◆ Multiple usage, unknown usage ◆ assumption ◆ assumption : all potentially identifying columns are equally important : all potentially identifying columns are equally important ◆ Consider “loss” ◆ Consider “loss” 1. Categorical information (fig.1) 1. Categorical information (fig.1) - M : the total number of leaf nodes - M : the total number of leaf nodes - M P : the number of leaf noes in the subtree rooted at - M P : the number of leaf noes in the subtree rooted at node P node P - loss : (M P – 1) / ( M –1) -> 2/7 - loss : (M P – 1) / ( M –1) -> 2/7

Page 8 Transforming Data to Satisfy Privacy Constraints 2. Usage based metrics (cont’d) (2) Multiple uses 2. numeric information 2. numeric information - a entry which is generalized to an interval i - a entry which is generalized to an interval i - lower end points Li, upper end points Ui - lower end points Li, upper end points Ui - lower bounds for values in this column L - lower bounds for values in this column L upper bounds for values in this column U upper bounds for values in this column U - loss : (Ui – Li) / (U-L) -> 2/15 - loss : (Ui – Li) / (U-L) -> 2/15

Page 9 Transforming Data to Satisfy Privacy Constraints 2. Usage based metrics (cont’d) (3) Predictive modeling use ◆ transformed table is used to build predictive models for some attributes. some attributes. ex) modeling the customers interested in specific category ex) modeling the customers interested in specific category of products. of products. ◆ data accuracy vs privacy protection ◆ data accuracy <- all the rows in G have the same class label ◆ definition of classification metric CM

Page 10 Transforming Data to Satisfy Privacy Constraints 3. Genetic algorithm framework Solving optimization problem ◆ 자연의 진화론을 적용 ◆ 각 규칙들은 비트스트링 (chromosome) 으로 표현됨 -> 일반화 ◆ 속성 A1 과 A2, 클래스 C1 과 C2 “IF A1 AND NOT A2 THEN C2" -> "100" “IF A1 AND NOT A2 THEN C2" -> "100" "IF NOT A1 AND NOT A2 THEN C1" -> "001“ "IF NOT A1 AND NOT A2 THEN C1" -> "001“ ◆ 새로운 규칙생성 : 교차 (crossover), 돌연변이 (mutation) : 교차 (crossover), 돌연변이 (mutation)

Page 11 Transforming Data to Satisfy Privacy Constraints 4. Experiments ◆ records from adult benchmark in the UCI repository ◆ 8 attr : age, work class, education, marital status, occupation, race, gender, native country occupation, race, gender, native country [Experiment 1] 1. CM: little degradation as k=10~ low value (around 0.18) : good transformation : good transformation 3. solution for k=250 generalizes away all the information in the attrs.. all the information in the attrs.. 4. LM : algorithm didn’t optimize LM metic : algorithm didn’t optimize LM metic 5. Solutions targeted at one usage * higher value K * higher value K -> stricter privacy constraints -> stricter privacy constraints

Page 12 Transforming Data to Satisfy Privacy Constraints 4. Experiments (cont’d) [Experiment 2] 1. LM values 0.21~0.49 as k 10~ tradeoff level of privacy (by k) vs level of privacy (by k) vs loss of information (LM) loss of information (LM) 3. CM fall in the range from 0.3 to 0.4 [By experiment 1,2] -the value is tailored to the purpose of which the data is disseminated. -It is Difficult to produce truly multi- purpose data set

Page 13 Transforming Data to Satisfy Privacy Constraints Data transform ation is Done by G, S Identifyin g content Data transform ation is Done by G, S Identifyin g content We conside red Informati on Loss Caused by transform ation By using metrics We conside red Informati on Loss Caused by transform ation By using metricsCon-Clu-sion Generalizing suppressing Generalizing suppressing Information loss Information loss 5. Conclusion future works Usefulnes s vs privacy Usefulnes s vs privacy Dual goals ◆ wider data set : potentially identifying attr disclosure risk ◆ wider data set : potentially identifying attr ↑, disclosure risk ↑ ◆ sensitive attribute finding adequate way of handling sensitive attr ◆ additive noise swapping : approach to inferential disclosure of sensitive attr ◆ non-identifying attr : both consideration -> better solution