1/3/2016 1 A Framework for Privacy- Preserving Cluster Analysis IEEE ISI 2008 Benjamin C. M. Fung Concordia University Canada Lingyu.

1/3/2016 1 A Framework for Privacy- Preserving Cluster Analysis IEEE ISI 2008 Benjamin C. M. Fung Concordia University Canada fung@ciise.concordia.ca Lingyu Wang Mourad Debbabi Concordia University Canada {wang, debbabi}@ciise.concordia.ca Ke Wang Simon Fraser University Canada wangk@cs.sfu.ca

1/3/2016 2 Agenda Motivation Problem Scope: Anonymity in Clustering Proposed Method: Top-Down Specialization (TDS) Proposed Framework Experimental Results Related Work Conclusion Q & A

1/3/2016 3 Motivation Corporations, agencies, governments, individuals are desirous to share valuable information. But, are reluctant to do so due to privacy issues. The focus of this study is to publish data for the purpose of cluster analysis. But to satisfy both the privacy goal and the clustering goal?

1/3/2016 4 Motivation (cont.) Real world scenario A data owner wants to release a person-specific data table to another party (or the public) for the purpose of cluster analysis without compromising privacy of the individuals in the released data. Data ownerData recipients Person-specific data Adversary

Privacy Threat Looking at the tables below, a description on (Education, Sex) is so specific that not many people match it, releasing such tables will lead to link a unique or a small number of individuals with their sensitive information. EducationSexDisease # of Recs. 9thF30Flu3 10thM32Heart4 11thF35Fever5 12thF37Fever4 BachelorsF42Flu6 BachelorsF44Heart4 MastersM44Flu4 MastersF44Flu3 DoctorateF44HIV1 Total:34 NameEducationSex… AliceBachelorsF… BobBachelorsM… CathyMastersF… DougMastersF… EmilyDoctorateF…

1/3/2016 6 Privacy Goal: k-Anonymity  The privacy goal is specified by the anonymity on a combination of attributes called Quasi-Identifier (QID), where each description on a QID is required to be shared by at least k records in the table [Sweeney and Samarati 1998]  Anonymity requirement  Consider QID 1,…, QID p.e.g., QID = {Education, Sex}.  a(qid i ) denotes the number of data records in T that share the value qid i on QID i. e.g., qid = {Doctorate, Female}.  A(qid i ) denotes the smallest a(qid i ) for any value qid i on QID i.  A table T satisfies the anonymity requirement {, …, } if A(qid i ) ≥ h i for 1 ≤ i ≤ p, where h i is the anonymity threshold on QID i, specified by the data owner.

1/3/2016 7 Anonymity Requirement Example: QID 1 = {Education, Sex}, h 1 = 4 EducationSexAgeClass# of Recs. 9thF300G3B3 10thM320G4B4 11thF352G3B5 12thF373G1B4 BachelorsF424G2B6 BachelorsF444G0B4 MastersM444G0B4 MastersF443G0B3 DoctorateF441G0B1 Total:34 a( qid 1 ) 3 4 5 4 10 4 3 1 A(QID 1 ) = 1

1/3/2016 8 Generalization Generalize values in UVID j. EducationSexAge Disease # of Recs. 9thF30 Flu 3 10thM32 Heart 4 11thF35 Fever 5 12thF37 Fever 4 BachelorsF42 Flu 6 BachelorsF44 Heart 4 MastersM44 Flu 4 MastersF44 Flu 3 DoctorateF44 HIV 1 EducationSexAge Disease # of Recs. 9thF30 Flu 3 10thM32 Heart 4 11thF35 Fever 5 12thF37 Fever 4 BachelorsF42 Flu 6 BachelorsF44 Heart 4 Grad SchoolM44 Flu 4 Grad SchoolF44 Flu/HIV 4

1/3/2016 9 Problem Statement  Anonymity in Cluster Analysis  Given a table T, an anonymity requirement, and a taxonomy tree of each categorical attribute in UQID j, generalize T to satisfy the anonymity requirement while preserving as much information as possible (cluster structure) for cluster analysis.  We use the existing k-anonymity algorithms available in the current literature [Sweeny 2002; Bayardo and Agrawal 2005; Fung et al. 2005, 2007; LeFevere et al. 2005]

1/3/2016 10 Intuition  Clustering goal and privacy goal are mutually exclusive:  Privacy goal: Masking sensitive information, usually specific descriptions that identify individuals.  Clustering goal: Grouping similar items together and extract general structures that capture trends and patterns.  Generalization eliminates outliers, but general cluster structures could be preserved. If generalization is performed, “carefully”, identifying information can be masked while still preserving trends and patterns for clustering.

1/3/2016 11 Challenges  What exactly are the cluster structures?  What information should we preserve?  Our previous work [Fung et al. 2005] addressed the problem of anonymity for classification analysis. EducationSexAgeClass# of Recs. 9thF300G3B3 10thM320G4B4 11thF352G3B5 12thF373G1B4 BachelorsF424G2B6 BachelorsF444G0B4 MastersM444G0B4 MastersF443G0B3 DoctorateF441G0B1 EducationSexAgeClass# of Recs. 9thF300G3B3 10thM320G4B4 11thF352G3B5 12thF373G1B4 BachelorsF424G2B6 BachelorsF444G0B4 Grad SchoolM444G0B4 Grad SchoolF444G0B4

1/3/2016 12 Raw Labeled Table T l Generalized Labeled Table T l Generalized Table T l Raw Table T l Step 1 Clustering & Labeling Data- Owner Data- User Step 2 Generalizing Step 3 Clustering & Labeling Step3 Comparing Cluster Structures The Framework: Convert the Problem Step 4 Release Apply clustering algorithm Apply Top-Down Specialization (TDS) F-measure Apply clustering algorithm

1/3/2016 13 Algorithm: Top-Down Specialization (TDS) Initialize every value in T to the top most value. Initialize Cut i to include the top most value. while some x  UCut i is valid do Find the Best specialization of the highest Score in UCut i. Perform the Best specialization on T and update UCut i. Update Score(x) and validity for x  UCut i. end while return Generalized T and UCut i. Age ANY [1-99) [1-37)[37-99)

1/3/2016 14 Search Criteria: Score  Consider a specialization v  child(v). To heuristically maximize the information of the generalized data for achieving a given anonymity, we favor the specialization on v that has the maximum information gain for each unit of privacy loss:

1/3/2016 15 Experimental Evaluation  Objectives:  Evaluate the information loss (in terms of cluster quality) caused due to generalization. This is the cost for achieving anonymity.  Evaluate the information gain (in terms of cluster quality) compared to existing k-anonymization algorithms (without the focus of preserving cluster structures). This is the benefit of using our method.  Data set: de facto benchmark – Adult data set  US census data  45,222 records (each record represents one US resident)

1/3/2016 16 Experimental Evaluation (cont.) Cost = 1-clusterFM (In terms of loss in clusters structure) Benefit = clusterFM-distortFM

1/3/2016 17 Experimental Evaluation (cont.)

1/3/201618 Related Works [Sweeny 2002] employed bottom-up generalization to achieve k- anonymity.  Single QID. Not considering specific use of data. [Iyengar 2002] proposed a genetic algorithm (GA) to address the problem of anonymity for classification.  Single QID.  GA needs 18 hours to generalize 45000 records. [Fung et al. 2005] proposed an efficient top-down specialization method for the problem of anonymity for classification.  TDS needs only 7 seconds to generalize same set of records (with comparable classification accuracy.

1/3/2016 19 Conclusion Quality clustering and privacy preservation can coexist. An effective top-down method to iteratively specialize the data, guided by maximizing the information utility and minimizing privacy specificity. Great applicability to both public and private sectors that share information for mutual benefits.

1/3/2016 1 A Framework for Privacy- Preserving Cluster Analysis IEEE ISI 2008 Benjamin C. M. Fung Concordia University Canada Lingyu.

Similar presentations

Presentation on theme: "1/3/2016 1 A Framework for Privacy- Preserving Cluster Analysis IEEE ISI 2008 Benjamin C. M. Fung Concordia University Canada Lingyu."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1/3/2016 1 A Framework for Privacy- Preserving Cluster Analysis IEEE ISI 2008 Benjamin C. M. Fung Concordia University Canada Lingyu.

Similar presentations

Presentation on theme: "1/3/2016 1 A Framework for Privacy- Preserving Cluster Analysis IEEE ISI 2008 Benjamin C. M. Fung Concordia University Canada Lingyu."— Presentation transcript:

Similar presentations

About project

Feedback