1/3/2016 1 A Framework for Privacy- Preserving Cluster Analysis IEEE ISI 2008 Benjamin C. M. Fung Concordia University Canada Lingyu.

Slides:



Advertisements
Similar presentations
Anonymity for Continuous Data Publishing
Advertisements

Differentially Private Transit Data Publication: A Case Study on the Montreal Transportation System ` Introduction With the deployment of smart card automated.
Hani AbuSharkh Benjamin C. M. Fung fung (at) ciise.concordia.ca
Secure Distributed Framework for Achieving -Differential Privacy Dima Alhadidi, Noman Mohammed, Benjamin C. M. Fung, and Mourad Debbabi Concordia Institute.
Template-Based Privacy Preservation in Classification Problems IEEE ICDM 2005 Benjamin C. M. Fung Simon Fraser University BC, Canada Ke.
Data Anonymization - Generalization Algorithms Li Xiong CS573 Data Privacy and Anonymity.
Anonymizing Healthcare Data: A Case Study on the Blood Transfusion Service Benjamin C.M. Fung Concordia University Montreal, QC, Canada
Privacy-Preserving Data Mashup Benjamin C.M. Fung Concordia University Montreal, QC, Canada Noman Mohammed Concordia University.
PRIVACY AND SECURITY ISSUES IN DATA MINING P.h.D. Candidate: Anna Monreale Supervisors Prof. Dino Pedreschi Dott.ssa Fosca Giannotti University of Pisa.
Anonymizing Sequential Releases ACM SIGKDD 2006 Benjamin C. M. Fung Simon Fraser University Ke Wang Simon Fraser University
Fast Data Anonymization with Low Information Loss 1 National University of Singapore 2 Hong Kong University
Project topics – Private data management Nov
Integrating Bayesian Networks and Simpson’s Paradox in Data Mining Alex Freitas University of Kent Ken McGarry University of Sunderland.
Privacy Preserving Serial Data Publishing By Role Composition Yingyi Bu 1, Ada Wai-Chee Fu 1, Raymond Chi-Wing Wong 2, Lei Chen 2, Jiuyong Li 3 The Chinese.
Finding Personally Identifying Information Mark Shaneck CSCI 5707 May 6, 2004.
1 On the Anonymization of Sparse High-Dimensional Data 1 National University of Singapore 2 Chinese University of Hong.
Anatomy: Simple and Effective Privacy Preservation Israel Chernyak DB Seminar (winter 2009)
Suppose I learn that Garth has 3 friends. Then I know he must be one of {v 1,v 2,v 3 } in Figure 1 above. If I also learn the degrees of his neighbors,
Privacy Preserving Data Mining: An Overview and Examination of Euclidean Distance Preserving Data Transformation Chris Giannella cgiannel AT acm DOT org.
PRIVACY CRITERIA. Roadmap Privacy in Data mining Mobile privacy (k-e) – anonymity (c-k) – safety Privacy skyline.
Anonymization of Set-Valued Data via Top-Down, Local Generalization Yeye He Jeffrey F. Naughton University of Wisconsin-Madison 1.
Differentially Private Data Release for Data Mining Benjamin C.M. Fung Concordia University Montreal, QC, Canada Noman Mohammed Concordia University Montreal,
Task 1: Privacy Preserving Genomic Data Sharing Presented by Noman Mohammed School of Computer Science McGill University 24 March 2014.
Differentially Private Transit Data Publication: A Case Study on the Montreal Transportation System Rui Chen, Concordia University Benjamin C. M. Fung,
Basic Data Mining Techniques
Publishing Microdata with a Robust Privacy Guarantee
APPLYING EPSILON-DIFFERENTIAL PRIVATE QUERY LOG RELEASING SCHEME TO DOCUMENT RETRIEVAL Sicong Zhang, Hui Yang, Lisa Singh Georgetown University August.
Integrating Private Databases for Data Analysis IEEE ISI 2005 Benjamin C. M. Fung Simon Fraser University BC, Canada Ke Wang Simon Fraser.
Differentially Private Data Release for Data Mining Noman Mohammed*, Rui Chen*, Benjamin C. M. Fung*, Philip S. Yu + *Concordia University, Montreal, Canada.
Querying Structured Text in an XML Database By Xuemei Luo.
Thwarting Passive Privacy Attacks in Collaborative Filtering Rui Chen Min Xie Laks V.S. Lakshmanan HKBU, Hong Kong UBC, Canada UBC, Canada Introduction.
Time Series Data Analysis - I Yaji Sripada. Dept. of Computing Science, University of Aberdeen2 In this lecture you learn What are Time Series? How to.
Protecting Sensitive Labels in Social Network Data Anonymization.
Systems and Internet Infrastructure Security (SIIS) LaboratoryPage Systems and Internet Infrastructure Security Network and Security Research Center Department.
SFU Pushing Sensitive Transactions for Itemset Utility (IEEE ICDM 2008) Presenter: Yabo, Xu Authors: Yabo Xu, Benjam C.M. Fung, Ke Wang, Ada. W.C. Fu,
Hierarchical Document Clustering using Frequent Itemsets Benjamin C. M. Fung, Ke Wang, Martin Ester SDM 2003 Presentation Serhiy Polyakov DSCI 5240 Fall.
Accuracy-Constrained Privacy-Preserving Access Control Mechanism for Relational Data.
K-Anonymity & Algorithms
Data Anonymization (1). Outline  Problem  concepts  algorithms on domain generalization hierarchy  Algorithms on numerical data.
Data Anonymization – Introduction and k-anonymity Li Xiong CS573 Data Privacy and Security.
Security Control Methods for Statistical Database Li Xiong CS573 Data Privacy and Security.
Hybrid l-Diversity* Mehmet Ercan NergizMuhammed Zahit GökUfuk Özkanlı
Preservation of Proximity Privacy in Publishing Numerical Sensitive Data J. Li, Y. Tao, and X. Xiao SIGMOD 08 Presented by Hongwei Tian.
Randomization in Privacy Preserving Data Mining Agrawal, R., and Srikant, R. Privacy-Preserving Data Mining, ACM SIGMOD’00 the following slides include.
Privacy Preserving Data Mining Benjamin Fung bfung(at)cs.sfu.ca.
Privacy vs. Utility Xintao Wu University of North Carolina at Charlotte Nov 10, 2008.
Privacy-preserving data publishing
2015/12/251 Hierarchical Document Clustering Using Frequent Itemsets Benjamin C.M. Fung, Ke Wangy and Martin Ester Proceeding of International Conference.
Thesis Sumathie Sundaresan Advisor: Dr. Huiping Guo.
Anonymity and Privacy Issues --- re-identification
A Scalable Two-Phase Top-Down Specialization Approach for Data Anonymization Using MapReduce on Cloud.
Anonymizing Data with Quasi-Sensitive Attribute Values Pu Shi 1, Li Xiong 1, Benjamin C. M. Fung 2 1 Departmen of Mathematics and Computer Science, Emory.
Probabilistic km-anonymity (Efficient Anonymization of Large Set-valued Datasets) Gergely Acs (INRIA) Jagdish Achara (INRIA)
Data Anonymization - Generalization Algorithms Li Xiong, Slawek Goryczka CS573 Data Privacy and Anonymity.
Using category-Based Adherence to Cluster Market-Basket Data Author : Ching-Huang Yun, Kun-Ta Chuang, Ming-Syan Chen Graduate : Chien-Ming Hsiao.
Personalized Privacy Preservation: beyond k-anonymity and ℓ-diversity SIGMOD 2006 Presented By Hongwei Tian.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Basic Data Mining Techniques Chapter 3-A. 3.1 Decision Trees.
An agency of the European Union Guidance on the anonymisation of clinical reports for the purpose of publication in accordance with policy 0070 Industry.
Deriving Private Information from Association Rule Mining Results Zutao Zhu, Guan Wang, and Wenliang Du ICDE /3/181.
ACHIEVING k-ANONYMITY PRIVACY PROTECTION USING GENERALIZATION AND SUPPRESSION International Journal on Uncertainty, Fuzziness and Knowledge-based Systems,
Xiaokui Xiao and Yufei Tao Chinese University of Hong Kong
Rule Induction for Classification Using
Data Mining Jim King.
By (Group 17) Mahesha Yelluru Rao Surabhee Sinha Deep Vakharia
CSc4730/6730 Scientific Visualization
Presented by : SaiVenkatanikhil Nimmagadda
Walking in the Crowd: Anonymizing Trajectory Data for Pattern Analysis
CoXML: A Cooperative XML Query Answering System
Presentation transcript:

1/3/ A Framework for Privacy- Preserving Cluster Analysis IEEE ISI 2008 Benjamin C. M. Fung Concordia University Canada Lingyu Wang Mourad Debbabi Concordia University Canada {wang, Ke Wang Simon Fraser University Canada

1/3/ Agenda Motivation Problem Scope: Anonymity in Clustering Proposed Method: Top-Down Specialization (TDS) Proposed Framework Experimental Results Related Work Conclusion Q & A

1/3/ Motivation Corporations, agencies, governments, individuals are desirous to share valuable information. But, are reluctant to do so due to privacy issues. The focus of this study is to publish data for the purpose of cluster analysis. But to satisfy both the privacy goal and the clustering goal?

1/3/ Motivation (cont.) Real world scenario A data owner wants to release a person-specific data table to another party (or the public) for the purpose of cluster analysis without compromising privacy of the individuals in the released data. Data ownerData recipients Person-specific data Adversary

Privacy Threat Looking at the tables below, a description on (Education, Sex) is so specific that not many people match it, releasing such tables will lead to link a unique or a small number of individuals with their sensitive information. EducationSexDisease # of Recs. 9thF30Flu3 10thM32Heart4 11thF35Fever5 12thF37Fever4 BachelorsF42Flu6 BachelorsF44Heart4 MastersM44Flu4 MastersF44Flu3 DoctorateF44HIV1 Total:34 NameEducationSex… AliceBachelorsF… BobBachelorsM… CathyMastersF… DougMastersF… EmilyDoctorateF…

1/3/ Privacy Goal: k-Anonymity  The privacy goal is specified by the anonymity on a combination of attributes called Quasi-Identifier (QID), where each description on a QID is required to be shared by at least k records in the table [Sweeney and Samarati 1998]  Anonymity requirement  Consider QID 1,…, QID p.e.g., QID = {Education, Sex}.  a(qid i ) denotes the number of data records in T that share the value qid i on QID i. e.g., qid = {Doctorate, Female}.  A(qid i ) denotes the smallest a(qid i ) for any value qid i on QID i.  A table T satisfies the anonymity requirement {, …, } if A(qid i ) ≥ h i for 1 ≤ i ≤ p, where h i is the anonymity threshold on QID i, specified by the data owner.

1/3/ Anonymity Requirement Example: QID 1 = {Education, Sex}, h 1 = 4 EducationSexAgeClass# of Recs. 9thF300G3B3 10thM320G4B4 11thF352G3B5 12thF373G1B4 BachelorsF424G2B6 BachelorsF444G0B4 MastersM444G0B4 MastersF443G0B3 DoctorateF441G0B1 Total:34 a( qid 1 ) A(QID 1 ) = 1

1/3/ Generalization Generalize values in UVID j. EducationSexAge Disease # of Recs. 9thF30 Flu 3 10thM32 Heart 4 11thF35 Fever 5 12thF37 Fever 4 BachelorsF42 Flu 6 BachelorsF44 Heart 4 MastersM44 Flu 4 MastersF44 Flu 3 DoctorateF44 HIV 1 EducationSexAge Disease # of Recs. 9thF30 Flu 3 10thM32 Heart 4 11thF35 Fever 5 12thF37 Fever 4 BachelorsF42 Flu 6 BachelorsF44 Heart 4 Grad SchoolM44 Flu 4 Grad SchoolF44 Flu/HIV 4

1/3/ Problem Statement  Anonymity in Cluster Analysis  Given a table T, an anonymity requirement, and a taxonomy tree of each categorical attribute in UQID j, generalize T to satisfy the anonymity requirement while preserving as much information as possible (cluster structure) for cluster analysis.  We use the existing k-anonymity algorithms available in the current literature [Sweeny 2002; Bayardo and Agrawal 2005; Fung et al. 2005, 2007; LeFevere et al. 2005]

1/3/ Intuition  Clustering goal and privacy goal are mutually exclusive:  Privacy goal: Masking sensitive information, usually specific descriptions that identify individuals.  Clustering goal: Grouping similar items together and extract general structures that capture trends and patterns.  Generalization eliminates outliers, but general cluster structures could be preserved. If generalization is performed, “carefully”, identifying information can be masked while still preserving trends and patterns for clustering.

1/3/ Challenges  What exactly are the cluster structures?  What information should we preserve?  Our previous work [Fung et al. 2005] addressed the problem of anonymity for classification analysis. EducationSexAgeClass# of Recs. 9thF300G3B3 10thM320G4B4 11thF352G3B5 12thF373G1B4 BachelorsF424G2B6 BachelorsF444G0B4 MastersM444G0B4 MastersF443G0B3 DoctorateF441G0B1 EducationSexAgeClass# of Recs. 9thF300G3B3 10thM320G4B4 11thF352G3B5 12thF373G1B4 BachelorsF424G2B6 BachelorsF444G0B4 Grad SchoolM444G0B4 Grad SchoolF444G0B4

1/3/ Raw Labeled Table T l Generalized Labeled Table T l Generalized Table T l Raw Table T l Step 1 Clustering & Labeling Data- Owner Data- User Step 2 Generalizing Step 3 Clustering & Labeling Step3 Comparing Cluster Structures The Framework: Convert the Problem Step 4 Release Apply clustering algorithm Apply Top-Down Specialization (TDS) F-measure Apply clustering algorithm

1/3/ Algorithm: Top-Down Specialization (TDS) Initialize every value in T to the top most value. Initialize Cut i to include the top most value. while some x  UCut i is valid do Find the Best specialization of the highest Score in UCut i. Perform the Best specialization on T and update UCut i. Update Score(x) and validity for x  UCut i. end while return Generalized T and UCut i. Age ANY [1-99) [1-37)[37-99)

1/3/ Search Criteria: Score  Consider a specialization v  child(v). To heuristically maximize the information of the generalized data for achieving a given anonymity, we favor the specialization on v that has the maximum information gain for each unit of privacy loss:

1/3/ Experimental Evaluation  Objectives:  Evaluate the information loss (in terms of cluster quality) caused due to generalization. This is the cost for achieving anonymity.  Evaluate the information gain (in terms of cluster quality) compared to existing k-anonymization algorithms (without the focus of preserving cluster structures). This is the benefit of using our method.  Data set: de facto benchmark – Adult data set  US census data  45,222 records (each record represents one US resident)

1/3/ Experimental Evaluation (cont.) Cost = 1-clusterFM (In terms of loss in clusters structure) Benefit = clusterFM-distortFM

1/3/ Experimental Evaluation (cont.)

1/3/ Related Works [Sweeny 2002] employed bottom-up generalization to achieve k- anonymity.  Single QID. Not considering specific use of data. [Iyengar 2002] proposed a genetic algorithm (GA) to address the problem of anonymity for classification.  Single QID.  GA needs 18 hours to generalize records. [Fung et al. 2005] proposed an efficient top-down specialization method for the problem of anonymity for classification.  TDS needs only 7 seconds to generalize same set of records (with comparable classification accuracy.

1/3/ Conclusion Quality clustering and privacy preservation can coexist. An effective top-down method to iteratively specialize the data, guided by maximizing the information utility and minimizing privacy specificity. Great applicability to both public and private sectors that share information for mutual benefits.