Privacy-Preserving Data Mashup Benjamin C.M. Fung Concordia University Montreal, QC, Canada Noman Mohammed Concordia University.

Slides:

Advertisements

Similar presentations

Anonymity for Continuous Data Publishing

Advertisements

Anonymizing Sequential Releases ACM SIGKDD 2006 Benjamin C. M. Fung Simon Fraser University Ke Wang Simon Fraser University

Anonymizing Location-based data Jarmanjit Singh Jar_sing(at)encs.concordia.ca Harpreet Sandhu h_san(at)encs.concordia.ca Qing Shi q_shi(at)encs.concordia.ca.

Hani AbuSharkh Benjamin C. M. Fung fung (at) ciise.concordia.ca

Data Mining Techniques: Classification. Classification What is Classification? –Classifying tuples in a database –In training set E each tuple consists.

Decision Tree Approach in Data Mining

Secure Distributed Framework for Achieving -Differential Privacy Dima Alhadidi, Noman Mohammed, Benjamin C. M. Fung, and Mourad Debbabi Concordia Institute.

Template-Based Privacy Preservation in Classification Problems IEEE ICDM 2005 Benjamin C. M. Fung Simon Fraser University BC, Canada Ke.

Data Anonymization - Generalization Algorithms Li Xiong CS573 Data Privacy and Anonymity.

Anonymizing Healthcare Data: A Case Study on the Blood Transfusion Service Benjamin C.M. Fung Concordia University Montreal, QC, Canada

Unit 7: Store and Retrieve it Database Management Systems (DBMS)

Machine Learning and Data Mining Course Summary. 2 Outline  Data Mining and Society  Discrimination, Privacy, and Security  Hype Curve  Future Directions.

Fast Data Anonymization with Low Information Loss 1 National University of Singapore 2 Hong Kong University

SLIQ: A Fast Scalable Classifier for Data Mining Manish Mehta, Rakesh Agrawal, Jorma Rissanen Presentation by: Vladan Radosavljevic.

UTEPComputer Science Dept.1 University of Texas at El Paso Privacy in Statistical Databases Dr. Luc Longpré Computer Science Department Spring 2006.

Integrating Bayesian Networks and Simpson’s Paradox in Data Mining Alex Freitas University of Kent Ken McGarry University of Sunderland.

Spreadsheet Management. Field Interviews with Senior Managers by Caulkins et. al. (2007) report that Spreadsheet errors are common and have been observed.

An architecture for Privacy Preserving Mining of Client Information Jaideep Vaidya Purdue University This is joint work with Murat.

Finding Personally Identifying Information Mark Shaneck CSCI 5707 May 6, 2004.

Privacy Preserving Data Mining Yehuda Lindell & Benny Pinkas.

Privacy Preserving Data Mining: An Overview and Examination of Euclidean Distance Preserving Data Transformation Chris Giannella cgiannel AT acm DOT org.

Classification.

Privacy-Preserving Data Mining Rakesh Agrawal Ramakrishnan Srikant IBM Almaden Research Center 650 Harry Road, San Jose, CA Published in: ACM SIGMOD.

Enterprise systems infrastructure and architecture DT211 4

Database Laboratory Regular Seminar TaeHoon Kim.

LÊ QU Ố C HUY ID: QLU OUTLINE  What is data mining ?  Major issues in data mining 2.

Differentially Private Data Release for Data Mining Benjamin C.M. Fung Concordia University Montreal, QC, Canada Noman Mohammed Concordia University Montreal,

Task 1: Privacy Preserving Genomic Data Sharing Presented by Noman Mohammed School of Computer Science McGill University 24 March 2014.

Overview of Distributed Data Mining Xiaoling Wang March 11, 2003.

Differentially Private Transit Data Publication: A Case Study on the Montreal Transportation System Rui Chen, Concordia University Benjamin C. M. Fung,

Chapter 7 Decision Tree.

Basic Data Mining Techniques

Privacy and trust in social network

Data Mining: Classification

1 Data Mining Books: 1.Data Mining, 1996 Pieter Adriaans and Dolf Zantinge Addison-Wesley 2.Discovering Data Mining, 1997 From Concept to Implementation.

Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.

Overview of Privacy Preserving Techniques.  This is a high-level summary of the state-of-the-art privacy preserving techniques and research areas  Focus.

1 Adapted from Pearson Prentice Hall Adapted form James A. Senn’s Information Technology, 3 rd Edition Chapter 7 Enterprise Databases and Data Warehouses.

Automatically Extracting Data Records from Web Pages Presenter: Dheerendranath Mundluru

Integrating Private Databases for Data Analysis IEEE ISI 2005 Benjamin C. M. Fung Simon Fraser University BC, Canada Ke Wang Simon Fraser.

Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.

Differentially Private Data Release for Data Mining Noman Mohammed*, Rui Chen*, Benjamin C. M. Fung*, Philip S. Yu + *Concordia University, Montreal, Canada.

Systems and Internet Infrastructure Security (SIIS) LaboratoryPage Systems and Internet Infrastructure Security Network and Security Research Center Department.

Accuracy-Constrained Privacy-Preserving Access Control Mechanism for Relational Data.

Secure Sensor Data/Information Management and Mining Bhavani Thuraisingham The University of Texas at Dallas October 2005.

Decision Tree Learning Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata August 25, 2014.

Data Anonymization (1). Outline  Problem  concepts  algorithms on domain generalization hierarchy  Algorithms on numerical data.

Data Anonymization – Introduction and k-anonymity Li Xiong CS573 Data Privacy and Security.

Security Control Methods for Statistical Database Li Xiong CS573 Data Privacy and Security.

Additive Data Perturbation: the Basic Problem and Techniques.

ITGS Databases.

Privacy Preserving Data Mining Benjamin Fung bfung(at)cs.sfu.ca.

Privacy Preserving Payments in Credit Networks By: Moreno-Sanchez et al from Saarland University Presented By: Cody Watson Some Slides Borrowed From NDSS’15.

Privacy-preserving data publishing

1/3/ A Framework for Privacy- Preserving Cluster Analysis IEEE ISI 2008 Benjamin C. M. Fung Concordia University Canada Lingyu.

An Introduction Student Name: Riaz Ahmad Program: MSIT( ) Subject: Data warehouse & Data Mining.

Anonymizing Data with Quasi-Sensitive Attribute Values Pu Shi 1, Li Xiong 1, Benjamin C. M. Fung 2 1 Departmen of Mathematics and Computer Science, Emory.

1 Privacy Preserving Data Mining Introduction August 2 nd, 2013 Shaibal Chakrabarty.

Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:

An Interval Classifier for Database Mining Applications Rakes Agrawal, Sakti Ghosh, Tomasz Imielinski, Bala Iyer, Arun Swami Proceedings of the 18 th VLDB.

Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,

Basic Data Mining Techniques Chapter 3-A. 3.1 Decision Trees.

Privacy Preserving Outlier Detection using Locality Sensitive Hashing

Chapter 3 Data Mining: Classification & Association Chapter 4 in the text box Section: 4.3 (4.3.1),

Privacy Issues in Graph Data Publishing Summer intern: Qing Zhang (from NC State University) Mentors: Graham Cormode and Divesh Srivastava.

ACHIEVING k-ANONYMITY PRIVACY PROTECTION USING GENERALIZATION AND SUPPRESSION International Journal on Uncertainty, Fuzziness and Knowledge-based Systems,

A Consensus-Based Clustering Method

Anonymizing Sequential Releases

Walking in the Crowd: Anonymizing Trajectory Data for Pattern Analysis

Presentation transcript:

Privacy-Preserving Data Mashup Benjamin C.M. Fung Concordia University Montreal, QC, Canada Noman Mohammed Concordia University Montreal, QC, Canada Ke Wang Simon Fraser University Burnaby, BC, Canada Patrick C. K. Hung UOI T Oshawa, ON, Canada EDBT 2009

Outline Problem: Private Data Mashup Our solution: Top-Down Specialization for multiple Parties Experimental results Related works Conclusions 2

Motivation Mashup is a web technology that combines information and services from more than one source into a single web application. Data mashup is a special type of mashup application that aims at integrating data from multiple data providers depending on the user's service request. This research problem was discovered in a collaborative project with a financial institute in Sweden. 3

Architecture of Data Mashup The data mashup application dynamically determines the data providers, collects information from them and then integrates the collected information to fulfill the service request.

Scenario 5 Suppose a loan company A and a bank B observe different sets of attributes about the same set of individuals identified by the common key SSN, e.g., T A (SSN; Age; Sex; Balance) T B (SSN; Job; Salary) These companies want to integrate their data to support better decision making such as loan or card limit approval. The integrated data can be shared with third party company C (credit card company).

Scenario 6 After integrating the two tables (by matching the SSN field), the “female lawyer” becomes unique, therefore, vulnerable to be linked to sensitive information such as Salary.

Problem: Private Data Mashup 7 Given two private tables for the same set of records on different sets of attributes, we want to produce an integrated table on all attributes for release to both parties. The integrated table must satisfy the following two requirements: –Classification requirement: The integrated data is as useful as possible to classification analysis. –Privacy requirements: Given a specified subset of attributes called a quasi-identifier (QID), each value of the quasi-identifier must identify at least k records [5]. At any time in this integration / generalization, no party should learn more detailed information about the other party other than those in the final integrated table.

Example: k-anonymity QID 1 = {Sex, Job}, k 1 = 4 SexJobSalaryClass# of Recs. MJanitor30K0Y3N3 MMover32K0Y4N4 MCarpenter35K2Y3N5 FTechnician37K3Y1N4 FManager42K4Y2N6 FManager44K3Y0N3 MAccountant44K3Y0N3 FAccountant44K3Y0N3 MLawyer44K2Y0N2 FLawyer44K1Y0N1 Total:34 8 a( qid 1 ) Minimum a(qid 1 ) = 1

Generalization SexJob…Class a(qid 1 ) MJanitor0Y3N 3 MMover0Y4N 4 MCarpenter2Y3N 5 FTechnician3Y1N 4 FManager4Y2N 9 FManager3Y0N MAccountant3Y0N 3 FAccountant3Y0N 3 MLawyer2Y0N 2 FLawyer1Y0N1 SexJob…Class a(qid 1 ) MJanitor0Y3N 3 MMover0Y4N 4 MCarpenter2Y3N 5 FTechnician3Y1N 4 FManager4Y2N 9 FManager3Y0N MProfessional5Y0N 5 FProfessional4Y0N 4 9

Intuition 1.Classification goal and privacy goal have no conflicts: – Privacy goal: mask sensitive information, usually specific descriptions that identify individuals. – Classification goal: extract general structures that capture trends and patterns. 2.A table contains multiple classification structures. Generalizations destroy some classification structures, but other structures emerge to help. If generalization is “carefully” performed, identifying information can be masked while still preserving trends and patterns for classification. 10

Two simple but incorrect approaches 11 1.Generalize-then-integrate: first generalize each table locally and then integrate the generalized tables. –Does not work for QID that spans two tables. 2.Integrate-then-generalize: first integrate the two tables and then generalize the integrated table using some single table methods, such as: –Iyengar’s Genetic Algorithm [10] or Fung et al.’s Top-Down Specialization [8]. –Any party holding the integrated table will immediately know all private information of both parties. Violated our privacy requirement.

Algorithm Top-Down Specialization (TDS) for Single Party Initialize every value in T to the top most value. Initialize Cut i to include the top most value. while there is some candidate in UCut i do Find the Winner specialization of the highest Score. Perform the Winner specialization on T. Update Cut i and Score(x) in UCut i. end while return Generalized T and UCut i. 12 Salary ANY [1-99) [1-37)[37-99)

Algorithm Privacy-preserving Data Mashup for 2 Parties (PPMashup) Initialize every value in T A to the top most value. Initialize Cut i to include the top most value. while there is some candidate in UCut i do Find the local candidate x of the highest Score(x). Communicate Score(x) with Party B to find the winner. if the winner w is local then Specialize w on T A. Instruct Party B to specialize w. else Wait for the instruction from Party B. Specialize w on T A using the instruction. end if Update the local copy of Cut i. Update Score(x) in UCut i. end while return Generalized T A and UCut i. 13

Search Criteria: Score Consider a specialization v  child(v). To heuristically maximize the information of the generalized data for achieving a given anonymity, we favor the specialization on v that has the maximum information gain for each unit of anonymity loss: 14

Search Criteria: Score R v denotes the set of records having value v before the specialization. R c denotes the set of records having value c after the specialization where c  child(v). I(R x ) is the entropy of R x : freq(R x, cls) is the number records in R x having the class cls. Intuitively, I(R x ) measures the impurity of classes for the data records in R x. A good specialization reduces the impurity of classes.

Perform the Winner Specialization To perform the Winner specialization w  child(w), we need to retrieve R w, the set of data records containing the value Winner. Taxonomy Indexed PartitionS (TIPS) is a tree structure with each node representing a generalized record over UQID j, and each child node representing a specialization of the parent node on exactly one attribute. Stored with each leaf node is the set of data records having the same generalized record. 16

SexJobSalary# of Recs. ANY_SexANY_Job[1-99)34 ANY_SexANY_Job[37-99)22ANY_SexANY_Job[1-37)12 Salary ANY [1-99) [1-37)[37-99) ANY_SexBlue- collar [1-37)12 ANY_SexBlue- collar [37-99)4ANY_SexWhite -collar [37-99)18 Link [37-99) Link ANY_Sex Consider QID 1 = {Sex, Job}, QID 2 = {Job, Salary} ABB IDs: 1-12IDs: 13-34

Practical Features of PPMashup Handling multiple QIDs – Treating all QIDs as a single QID leads to over generalization. – QIDs span across two parties. Handling both categorical and continuous attributes – Dynamically generate taxonomy tree for continuous attributes. Anytime solution – Determine a desired trade-off between privacy and accuracy. – Stop any time and obtain a generalized table satisfying the anonymity requirement. Bottom-up approach does not support this feature. Scalable computation 18

Experimental Evaluation Data quality & Efficiency – A broad range of anonymity requirements. – Used C4.5 classifier. Adult data set – Used in Iyengar [6]. – Census data. – 6 continuous attributes. – 8 categorical attributes. – Two classes. – recs. for training. – recs. for testing.

Data Quality Include the TopN most important attributes into a SingleQID, which is more restrictive than breaking them into multiple QIDs. 20

Comparing with Genetic Algorithm Our method is comparable with genetic algorithm in terms of data quality 21

Efficiency and Scalability Took at most 20 seconds for all previous experiments. Replicate the Adult data set and substitute some random data. 22

Related Works Sweeney: achieve k-anonymity by generalization [6, 7]. Fung et. al. [8], Wang et. al. [9], Iyengar [10], LeFevre et. at. [12]: consider anonymity for classification on a single data source. Jiang and Clifton[13]: Integrate two private databases together but not aiming preserving classification analysis. 23

Conclusions We studied private data mashup of multiple databases for the purpose of a joint classification analysis. We formalized this problem as achieving the k-anonymity on the integrated data without revealing more detailed information in this process. Quality classification and privacy preservation can coexist. Allow data sharing instead of only result sharing. Great applicability to both public and private sectors that share information for mutual benefits. 24

References 1.The House of Commons in Canada: The personal information protection and electronic documents act (2000) 2.Agrawal, R., Evfimievski, A., Srikant, R.: Information sharing across private databases. In: Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, San Diego, California (2003) 3.Yao, A.C.: Protocols for secure computations. In: Proceedings of the 23rd Annual IEEE Symposium on Foundations of Computer Science. (1982) 4.Liang, G., Chawathe, S.S.: Privacy-preserving inter-database operations. In: Proceedings of the 2nd Symposium on Intelligence and Security Informatics. (2004) 5.Dalenius, T.: Finding a needle in a haystack - or identifying anonymous census record. Journal of Official Statistics 2 (1986) Sweeney, L.: Achieving k-anonymity privacy protection using generalization and suppression. International Journal on Uncertainty, Fuzziness, and Knowledge-based Systems 10 (2002) Hundepool, A., Willenborg, L.:  - and  -argus : Software for statistical disclosure control. In: Third International Seminar on Statistical Confidentiality, Bled (1996) 8.Fung, B.C.M., Wang, K., Yu, P.S.: Top-down specialization for information and privacy preservation. In: Proceedings of the 21st IEEE International Conference on Data Engineering, Tokyo, Japan (2005)

References 9.Wang, K., Yu, P., Chakraborty, S.: Bottom-up generalization: a data mining solution to privacy protection. In: Proceedings of the 4th IEEE International Conference on Data Mining. (2004) 10.Iyengar, V.S.: Transforming data to satisfy privacy constraints. In: Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Edmonton, AB, Canada (2002) Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann (1993) 12.LeFevre, K., DeWitt, D. J., Ramakrishnan, R.: Workload-aware anonymization. In: Proceedings of the 12 th ACM SIGKDD, Philadelphia, PA, (2006) 13.Jiang, W., Clifton, C.: A secure distributed framework for achieving k-anonymity. In: Very Large Data Bases Journal (VLDBJ), 15(4): , November (2006) 26

Thank you ! 27