Template-Based Privacy Preservation in Classification Problems IEEE ICDM 2005 Benjamin C. M. Fung Simon Fraser University BC, Canada Ke.

Slides:



Advertisements
Similar presentations
Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.
Advertisements

Is Random Model Better? -On its accuracy and efficiency-
Anonymity for Continuous Data Publishing
Frequent Itemset Mining Methods. The Apriori algorithm Finding frequent itemsets using candidate generation Seminal algorithm proposed by R. Agrawal and.
Associative Classification (AC) Mining for A Personnel Scheduling Problem Fadi Thabtah.
Anonymizing Sequential Releases ACM SIGKDD 2006 Benjamin C. M. Fung Simon Fraser University Ke Wang Simon Fraser University
Hani AbuSharkh Benjamin C. M. Fung fung (at) ciise.concordia.ca
10 -1 Lecture 10 Association Rules Mining Topics –Basics –Mining Frequent Patterns –Mining Frequent Sequential Patterns –Applications.
Anonymizing Healthcare Data: A Case Study on the Blood Transfusion Service Benjamin C.M. Fung Concordia University Montreal, QC, Canada
Privacy-Preserving Data Mashup Benjamin C.M. Fung Concordia University Montreal, QC, Canada Noman Mohammed Concordia University.
Machine Learning and Data Mining Course Summary. 2 Outline  Data Mining and Society  Discrimination, Privacy, and Security  Hype Curve  Future Directions.
Learning on Probabilistic Labels Peng Peng, Raymond Chi-wing Wong, Philip S. Yu CSE, HKUST 1.
Privacy Preserving Association Rule Mining in Vertically Partitioned Data Reporter : Ximeng Liu Supervisor: Rongxing Lu School of EEE, NTU
Integrating Bayesian Networks and Simpson’s Paradox in Data Mining Alex Freitas University of Kent Ken McGarry University of Sunderland.
Data Mining Association Rules Yao Meng Hongli Li Database II Fall 2002.
An architecture for Privacy Preserving Mining of Client Information Jaideep Vaidya Purdue University This is joint work with Murat.
Finding Personally Identifying Information Mark Shaneck CSCI 5707 May 6, 2004.
SAC’06 April 23-27, 2006, Dijon, France Towards Value Disclosure Analysis in Modeling General Databases Xintao Wu UNC Charlotte Songtao Guo UNC Charlotte.
Basic Data Mining Techniques Chapter Decision Trees.
Basic Data Mining Techniques
Privacy Preserving K-means Clustering on Vertically Partitioned Data Presented by: Jaideep Vaidya Joint work: Prof. Chris Clifton.
Mining Long Sequential Patterns in a Noisy Environment Jiong Yang, Wei Wang, Philip S. Yu, Jiawei Han SIGMOD 2002.
CSE 634 Data Mining Techniques Association Rules Hiding (Not Mining) Prateek Duble ( ) Course Instructor: Prof. Anita Wasilewska State University.
Privacy-Preserving Data Mining Rakesh Agrawal Ramakrishnan Srikant IBM Almaden Research Center 650 Harry Road, San Jose, CA Published in: ACM SIGMOD.
An Experimental Study of Association Rule Hiding Techniques Emmanuel Pontikakis* Dept. of Computer Engineering and Informatics.
Differentially Private Data Release for Data Mining Benjamin C.M. Fung Concordia University Montreal, QC, Canada Noman Mohammed Concordia University Montreal,
Differentially Private Transit Data Publication: A Case Study on the Montreal Transportation System Rui Chen, Concordia University Benjamin C. M. Fung,
Abrar Fawaz AlAbed-AlHaq Kent State University October 28, 2011
1 Data Mining Books: 1.Data Mining, 1996 Pieter Adriaans and Dolf Zantinge Addison-Wesley 2.Discovering Data Mining, 1997 From Concept to Implementation.
Shiyuan Wang, Divyakant Agrawal, Amr El Abbadi Department of Computer Science UC Santa Barbara DBSec 2010.
Integrating Private Databases for Data Analysis IEEE ISI 2005 Benjamin C. M. Fung Simon Fraser University BC, Canada Ke Wang Simon Fraser.
Differentially Private Data Release for Data Mining Noman Mohammed*, Rui Chen*, Benjamin C. M. Fung*, Philip S. Yu + *Concordia University, Montreal, Canada.
Protecting Sensitive Labels in Social Network Data Anonymization.
Treatment Learning: Implementation and Application Ying Hu Electrical & Computer Engineering University of British Columbia.
Systems and Internet Infrastructure Security (SIIS) LaboratoryPage Systems and Internet Infrastructure Security Network and Security Research Center Department.
SFU Pushing Sensitive Transactions for Itemset Utility (IEEE ICDM 2008) Presenter: Yabo, Xu Authors: Yabo Xu, Benjam C.M. Fung, Ke Wang, Ada. W.C. Fu,
Disclosure risk when responding to queries with deterministic guarantees Krish Muralidhar University of Kentucky Rathindra Sarathy Oklahoma State University.
Data Anonymization (1). Outline  Problem  concepts  algorithms on domain generalization hierarchy  Algorithms on numerical data.
Data Anonymization – Introduction and k-anonymity Li Xiong CS573 Data Privacy and Security.
Expert Systems with Applications 34 (2008) 459–468 Multi-level fuzzy mining with multiple minimum supports Yeong-Chyi Lee, Tzung-Pei Hong, Tien-Chin Wang.
Security Control Methods for Statistical Database Li Xiong CS573 Data Privacy and Security.
Randomization in Privacy Preserving Data Mining Agrawal, R., and Srikant, R. Privacy-Preserving Data Mining, ACM SIGMOD’00 the following slides include.
Is Sampling Useful in Data Mining? A Case in the Maintenance of Discovered Association Rules S.D. Lee, David W. Cheung, Ben Kao The University of Hong.
Privacy Preserving Data Mining Benjamin Fung bfung(at)cs.sfu.ca.
Privacy vs. Utility Xintao Wu University of North Carolina at Charlotte Nov 10, 2008.
1 AC-Close: Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : Hong.
Privacy-preserving data publishing
1/3/ A Framework for Privacy- Preserving Cluster Analysis IEEE ISI 2008 Benjamin C. M. Fung Concordia University Canada Lingyu.
1 Privacy Preserving Data Mining Introduction August 2 nd, 2013 Shaibal Chakrabarty.
Probabilistic km-anonymity (Efficient Anonymization of Large Set-valued Datasets) Gergely Acs (INRIA) Jagdish Achara (INRIA)
Data Mining By Farzana Forhad CS 157B. Agenda Decision Tree and ID3 Rough Set Theory Clustering.
Optimization of Association Rules Extraction Through Exploitation of Context Dependent Constraints Arianna Gallo, Roberto Esposito, Rosa Meo, Marco Botta.
Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Chapter 3 Basic Data Mining Techniques Jason C. H. Chen, Ph.D. Professor of MIS School of Business.
1 Overview of Query Evaluation Chapter Outline  Query Optimization Overview  Algorithm for Relational Operations.
Privacy Preserving Outlier Detection using Locality Sensitive Hashing
1 Top Down FP-Growth for Association Rule Mining By Ke Wang.
Chapter 3 Data Mining: Classification & Association Chapter 4 in the text box Section: 4.3 (4.3.1),
1 Maintaining Data Privacy in Association Rule Mining Speaker: Minghua ZHANG Oct. 11, 2002 Authors: Shariq J. Rizvi Jayant R. Haritsa VLDB 2002.
Privacy Issues in Graph Data Publishing Summer intern: Qing Zhang (from NC State University) Mentors: Graham Cormode and Divesh Srivastava.
ACHIEVING k-ANONYMITY PRIVACY PROTECTION USING GENERALIZATION AND SUPPRESSION International Journal on Uncertainty, Fuzziness and Knowledge-based Systems,
Rule Induction for Classification Using
Privacy Preserving Data Publishing
Differential Privacy in Practice
iSRD Spam Review Detection with Imbalanced Data Distributions
Anonymizing Sequential Releases
Presented by : SaiVenkatanikhil Nimmagadda
Walking in the Crowd: Anonymizing Trajectory Data for Pattern Analysis
Presentation transcript:

Template-Based Privacy Preservation in Classification Problems IEEE ICDM 2005 Benjamin C. M. Fung Simon Fraser University BC, Canada Ke Wang Simon Fraser University BC, Canada Philip S. Yu IBM T.J. Watson Research Center

2 Outline Privacy threats caused by data mining abilities Our method: Progressive Disclosure Algorithm Experimental evaluation Related works Conclusions

3 Privacy Concern Most previous works concern the input of data mining tools where private information is revealed directly by inspection of the data without sophisticated analysis. Our privacy concern is on the output of data mining methods. –The aggregate patterns can be used to infer sensitive information about individuals.

Motivating Example: A data owner wants to release a table to a data mining firm for classification analysis on Rating, but does not want the firm to infer the bankruptcy state Discharged using the attributes Job and Country. This work aims at releasing data with dual goals: –Preserve information for wanted classification analysis. –Limit usefulness of unwanted sensitive inferences.

5 Motivating Example: A data owner wants to release a table to a data mining firm for classification analysis on Rating, but does not want the firm to infer the bankruptcy state Discharged using the attributes Job and Country. Inference: {Trader,UK}  Discharged Confidence = 4/5 = 80% An inference is sensitive if its confidence > threshold.

6 Eliminate Low Support Inferences? In data mining, association or classification rules are used to capture general patterns of large populations. –A low support means the lack of statistical significance. In privacy protection, inference rules are used to infer sensitive information about individuals. –Eliminate sensitive inferences of any support. –In fact, a sensitive inference in a small group could present even more threats because individuals in a small group are more identifiable.

7 The Problem Consider a table T(M 1,…,M m,  1, …,  n,  ) Classification goal: Modeling class attribute  Privacy goal: Limit sensitive inferences on  i –Specified by one or more templates –IC is a set of attributes containing some masking attributes M j, e.g., IC = {Job, Country} –  is a value from some  i, e.g.,  = Discharged –h is a threshold on confidence –ic   is an inference, where ic contains values from IC –T satisfies if every matching inference ic   has a confidence conf(ic   ) ≤ h.

8 Flexibility of Templates Selectively protecting certain values while not protecting other values. Specifying a different threshold h for a different template IC  . Specifying multiple inference channels ICs (even for the same  ). Specifying templates for multiple sensitive attributes. These flexibilities minimize unnecessary masking, i.e., minimize unnecessary information loss.

9 Achieve goals by suppressing some values on masking attributes M 1,…,M m To eliminate {Trader,UK}  Discharged –Suppress Trader and Clerk to  Job –Suppress UK and Canada to  Country –Reduced confidence = 5/10 = 50%

10 Challenges Incorrect suppression may eliminate some desired classification structures for modeling . Finding an optimal suppression is hard. –For a table with a total of q distinct values on masking attributes, there are 2 q possible suppressed tables. –We present an approximate solution based on a search that iteratively improves the solution and prunes the search whenever no better solution is possible.

11 The Algorithm Progressive Disclosure Algorithm (PDA) iteratively discloses domain values starting from the most suppressed T in which each masking attribute M j in ∪ IC contains only  j. Sup j contains all currently suppressed values in M j. In each iteration, disclose one suppressed value w. To disclose a value w from Sup j, we replace  j with w in all suppressed records that currently contain  j and originally contain w before suppression. This process repeats until no disclosure is possible without violating the set of templates.

12 Progressive Disclosure Algorithm (PDA) 1: suppress every value of M j to  j where M j  ∪ IC; 2: every Sup j contains all domain values of M j  ∪ IC; 3: while there is a candidate in ∪ Sup j do 4: find winner w of highest Score(w) from ∪ Sup j ; 5: disclose w on T and remove w from ∪ Sup j ; 6: update Score(x) and status for x in ∪ Sup j ; 7: end while 8: output the suppressed T and ∪ Sup j ;

Conf = 5 / 24 = 21%Conf = 1 / 4 = 25%

14 Search Criteria: Score Disclosing a value v gains information and loses privacy Score(v) measures the information gain per unit of privacy loss. InfoGain measures the information gain of disclosing v.

15 Search Criteria: Score PrivLoss measures the privacy loss of disclosing v, defined as the average increase of Conf(IC   ) over all affected IC  . where Conf and Conf v represent the confidence before and after disclosing v. The key to the scalability of our algorithm is incrementally updating Score(v) in each iteration for candidates v in ∪ Sup j. (see paper for details)

16 Cost Analysis At each iteration, the cost can be summarized as two operations. 1.Scan the partitions on Link[  w ] for disclosing the winner w and maintaining some count statistics. 2.Make use of the count statistics to update the score and status of every affected candidate without accessing data records. Thus, each iteration accesses only the records suppressed to  w. The number of iterations is bounded by the number of distinct values in the masking attributes.

17 Experimental Evaluation Data quality –Experiment with a broad range of templates. –Use C4.5 classifier. –Measure classification error before and after suppression. Efficiency and Scalability

18 Data sets Japanese Credit Screening (CRX) –Credit card applications –8 categorical attributes and 2 classes –465 recs. for training and 188 recs. for testing Adult –Census data –8 categorical attributes and 2 classes –30162 recs. for training and recs. for testing –Previously used in Bayardo et al. (2005), Fung et al. (2005), Iyengar (2002), and Wang et al. (2004).

Results on CRX TopN sensitive attributes  1,…,  N ; an IC includes the remaining masking attributes M 1,…M m. Base Error (BE) for original data = 15.4% Suppression Error (SE) for suppressed data Removal Error (RE) for removed  1,…,  N RE-SE measures benefits of suppression Took at most 2 seconds for each experiment

20 Results on Adult Base Error (BE) for original data = 17.6% Took at most 14 seconds for each experiment.

Scalability Replicate the Adult data set and substitute some random data. A time consuming setting: –1 sensitive attribute –Remaining 7 as masking attributes –h=90%

22 Related Works Iyengar (2002) proposed a genetic algorithm to address the problem of k-anonymity for classification. Bayardo et al. (2005) employed generalization and suppression to address a similar problem. Our work concerns over the output of data mining methods, where the threats are caused by what data mining tools can discover.

23 Related Works Clifton (2000) suggested to eliminate sensitive inferences by limiting the data size. Verykios et al. (2004) proposed several algorithms for hiding association rules in a transaction database with minimal modification to the data. –Hide one rule at a time by either decreasing its support or its confidence –Achieved by removing items from transactions. –Our work considers the use of the data for classification analysis and eliminates all sensitive inferences including those with a low support.

24 Related Works Cox (1980) proposed the k%- dominance rule which suppresses a sensitive cell if the attribute values of two or three entities in the cell contribute more than k% of the corresponding SUM statistic. –Such “cell suppression” suppresses the count or other statistics stored in a cell of a statistical table. –Very different from the “value suppression” considered in our work.

25 Conclusions Formulate a template-based privacy preservation problem. Show that suppression is an effective way to eliminate sensitive inferences. Present an effective algorithm based on a search that iteratively improves the solution. Evaluate this method on real life data sets.

26 References 1.R. Agrawal, T. Imielinski, and A. N. Swami. Mining association rules between sets of items in large datasets. In Proc. of the 1993 ACM SIGMOD, pages , R. J. Bayardo and R. Agrawal. Data privacy through optimal k- anonymization. In Proc. of the 21st IEEE ICDE, pages , C. Clifton. Using sample size to limit exposure to data mining. Journal of Computer Security, 8(4): , C. Clifton, M. Kantarcioglu, J. Vaidya, X. Lin, and M. Y. Zhu. Tools for privacy preserving data mining. SIGKDD Explorations, 4(2), L. H. Cox. Suppression methodology and statistical disclosure control. Journal of the American Statistics Association, Theory and Method Section, 75: , 1980.

27 References 6.A. Evfimievski, R. Srikant, R. Agrawal, and J. Gehrke. Privacy preserving mining of association rules. In Proc. of the 8th ACM SIGKDD, pages , C. Farkas and S. Jajodia. The inference problem: A survey. SIGKDD Explorations, 4(2):6-11, B. C. M. Fung, K. Wang, and P. S. Yu. Top-down specialization for information and privacy preservation. In Proc. of the 21st IEEE ICDE, pages , Tokyo, Japan, S. Hettich and S. D. Bay. The UCI KDD Archive, V. S. Iyengar. Transforming data to satisfy privacy constraints. In Proc. of the 8th ACM SIGKDD, M. Kantarcioglu, J. Jin, and C. Clifton. When do data mining results violate privacy? In Proc. of the 2004 ACM SIGKDD, pages , 2004.

28 References 12.J. Kim and W. Winkler. Masking microdata files. In ASA Proc. of the Section on Survey Research Methods, W. Kloesgen. Knowledge discovery in databases and data privacy. In IEEE Expert Symposium: Knowledge Discovery in Databases, J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, L. Sweeney. Datafly: A system for providing anonymity in medical data. In Proc. of the 11th International Conference on Database Security, pages , L. Sweeney. Achieving k-anonymity privacy protection using generalization and suppression. International Journal on Uncertainty, Fuzziness, and Knowledge-based Systems, 10(5): , 2002.

29 References 17.V. S. Verykios, A. K. Elmagarmid, E. Bertino, Y. Saygin, and E. Dasseni. Association rule hiding. IEEE TKDE, 16(4): , K. Wang, P. S. Yu, and S. Chakraborty. Bottom-up generalization: a data mining solution to privacy protection. In Proc. of the 4th IEEE ICDM, R. W. Yip and K. N. Levitt. The design and implementation of a data level database inference detection system. In Proc. of the 12th International Working Conference on Database Security XII, pages , 1999.

30 FAQ Q: Inference rules with low supports are insignificant anyway, why do we bother eliminating them? A: Keeping those low-support inferences is a relaxation of our current privacy requirement. In other words, the suppression error will be even lower (better) than our current model. If the user prefers, she may introduce another threshold (minimum support). This will further improve the classification quality.