Anonymizing Sequential Releases ACM SIGKDD 2006 Benjamin C. M. Fung Simon Fraser University Ke Wang Simon Fraser University

Slides:



Advertisements
Similar presentations
Cipher Techniques to Protect Anonymized Mobility Traces from Privacy Attacks Chris Y. T. Ma, David K. Y. Yau, Nung Kwan Yip and Nageswara S. V. Rao.
Advertisements

Anonymity for Continuous Data Publishing
Differentially Private Transit Data Publication: A Case Study on the Montreal Transportation System ` Introduction With the deployment of smart card automated.
Simulatability “The enemy knows the system”, Claude Shannon CompSci Instructor: Ashwin Machanavajjhala 1Lecture 6 : Fall 12.
Center for Secure Information Systems Concordia Institute for Information Systems Engineering k-Jump Strategy for Preserving Privacy in Micro-Data Disclosure.
Wang, Lakshmanan Probabilistic Privacy Analysis of Published Views, IDAR'07 Probabilistic Privacy Analysis of Published Views Hui (Wendy) Wang Laks V.S.
Anonymizing Location-based data Jarmanjit Singh Jar_sing(at)encs.concordia.ca Harpreet Sandhu h_san(at)encs.concordia.ca Qing Shi q_shi(at)encs.concordia.ca.
Hani AbuSharkh Benjamin C. M. Fung fung (at) ciise.concordia.ca
Template-Based Privacy Preservation in Classification Problems IEEE ICDM 2005 Benjamin C. M. Fung Simon Fraser University BC, Canada Ke.
Data Anonymization - Generalization Algorithms Li Xiong CS573 Data Privacy and Anonymity.
Anonymizing Healthcare Data: A Case Study on the Blood Transfusion Service Benjamin C.M. Fung Concordia University Montreal, QC, Canada
Privacy-Preserving Data Mashup Benjamin C.M. Fung Concordia University Montreal, QC, Canada Noman Mohammed Concordia University.
1 Privacy in Microdata Release Prof. Ravi Sandhu Executive Director and Endowed Chair March 22, © Ravi Sandhu.
Anonymizing Sequential Releases ACM SIGKDD 2006 Benjamin C. M. Fung Simon Fraser University Ke Wang Simon Fraser University
Anatomy: Simple and Effective Privacy Preservation Xiaokui Xiao, Yufei Tao Chinese University of Hong Kong.
Fast Data Anonymization with Low Information Loss 1 National University of Singapore 2 Hong Kong University
1 Privacy Preserving Data Publishing Prof. Ravi Sandhu Executive Director and Endowed Chair March 29, © Ravi.
UTEPComputer Science Dept.1 University of Texas at El Paso Privacy in Statistical Databases Dr. Luc Longpré Computer Science Department Spring 2006.
Probabilistic Inference Protection on Anonymized Data
Privacy Preserving Serial Data Publishing By Role Composition Yingyi Bu 1, Ada Wai-Chee Fu 1, Raymond Chi-Wing Wong 2, Lei Chen 2, Jiuyong Li 3 The Chinese.
Finding Personally Identifying Information Mark Shaneck CSCI 5707 May 6, 2004.
Temporal Data Mining Claudio Bettini, X.Sean Wang and Sushil Jajodia Presented by Zhuang Liu.
SAC’06 April 23-27, 2006, Dijon, France Towards Value Disclosure Analysis in Modeling General Databases Xintao Wu UNC Charlotte Songtao Guo UNC Charlotte.
L-Diversity: Privacy Beyond K-Anonymity
PRIVACY CRITERIA. Roadmap Privacy in Data mining Mobile privacy (k-e) – anonymity (c-k) – safety Privacy skyline.
Auditing Batches of SQL Queries Rajeev Motwani Shubha Nabar Dilys Thomas Stanford University.
Differentially Private Data Release for Data Mining Benjamin C.M. Fung Concordia University Montreal, QC, Canada Noman Mohammed Concordia University Montreal,
Task 1: Privacy Preserving Genomic Data Sharing Presented by Noman Mohammed School of Computer Science McGill University 24 March 2014.
Differentially Private Transit Data Publication: A Case Study on the Montreal Transportation System Rui Chen, Concordia University Benjamin C. M. Fung,
Preserving Privacy in Published Data
Database Systems Normal Forms. Decomposition Suppose we have a relation R[U] with a schema U={A 1,…,A n } – A decomposition of U is a set of schemas.
Publishing Microdata with a Robust Privacy Guarantee
CS573 Data Privacy and Security Statistical Databases
Integrating Private Databases for Data Analysis IEEE ISI 2005 Benjamin C. M. Fung Simon Fraser University BC, Canada Ke Wang Simon Fraser.
Differentially Private Data Release for Data Mining Noman Mohammed*, Rui Chen*, Benjamin C. M. Fung*, Philip S. Yu + *Concordia University, Montreal, Canada.
Background Knowledge Attack for Generalization based Privacy- Preserving Data Mining.
Refined privacy models
SFU Pushing Sensitive Transactions for Itemset Utility (IEEE ICDM 2008) Presenter: Yabo, Xu Authors: Yabo Xu, Benjam C.M. Fung, Ke Wang, Ada. W.C. Fu,
K-Anonymity & Algorithms
Data Anonymization (1). Outline  Problem  concepts  algorithms on domain generalization hierarchy  Algorithms on numerical data.
Chapter No 4 Query optimization and Data Integrity & Security.
Data Anonymization – Introduction and k-anonymity Li Xiong CS573 Data Privacy and Security.
Security Control Methods for Statistical Database Li Xiong CS573 Data Privacy and Security.
Preservation of Proximity Privacy in Publishing Numerical Sensitive Data J. Li, Y. Tao, and X. Xiao SIGMOD 08 Presented by Hongwei Tian.
Privacy vs. Utility Xintao Wu University of North Carolina at Charlotte Nov 10, 2008.
Creating Open Data whilst maintaining confidentiality Philip Lowthian, Caroline Tudor Office for National Statistics 1.
Privacy-preserving data publishing
1/3/ A Framework for Privacy- Preserving Cluster Analysis IEEE ISI 2008 Benjamin C. M. Fung Concordia University Canada Lingyu.
Anonymizing Data with Quasi-Sensitive Attribute Values Pu Shi 1, Li Xiong 1, Benjamin C. M. Fung 2 1 Departmen of Mathematics and Computer Science, Emory.
Differential Privacy Xintao Wu Oct 31, Sanitization approaches Input perturbation –Add noise to data –Generalize data Summary statistics –Means,
Data Anonymization - Generalization Algorithms Li Xiong, Slawek Goryczka CS573 Data Privacy and Anonymity.
Personalized Privacy Preservation: beyond k-anonymity and ℓ-diversity SIGMOD 2006 Presented By Hongwei Tian.
Basic Data Mining Techniques Chapter 3-A. 3.1 Decision Trees.
Aggregator Stage : Definition : Aggregator classifies data rows from a single input link into groups and calculates totals or other aggregate functions.
Transforming Data to Satisfy Privacy Constraints 컴퓨터교육 전공 032CSE15 최미희.
Versatile Publishing For Privacy Preservation
University of Texas at El Paso
ACHIEVING k-ANONYMITY PRIVACY PROTECTION USING GENERALIZATION AND SUPPRESSION International Journal on Uncertainty, Fuzziness and Knowledge-based Systems,
Xiaokui Xiao and Yufei Tao Chinese University of Hong Kong
ADAPTIVE DATA ANONYMIZATION AGAINST INFORMATION FUSION BASED PRIVACY ATTACKS ON ENTERPRISE DATA Srivatsava Ranjit Ganta, Shruthi Prabhakara, Raj Acharya.
Privacy Preserving Data Publishing
By (Group 17) Mahesha Yelluru Rao Surabhee Sinha Deep Vakharia
Privacy Preserving Data Mining
Anonymizing Sequential Releases
Regression Testing.
Presented by : SaiVenkatanikhil Nimmagadda
Walking in the Crowd: Anonymizing Trajectory Data for Pattern Analysis
TELE3119: Trusted Networks Week 4
Refined privacy models
Presentation transcript:

Anonymizing Sequential Releases ACM SIGKDD 2006 Benjamin C. M. Fung Simon Fraser University Ke Wang Simon Fraser University

2 Motivation: Sequential Releases Previous works address single release only. Data are typically released sequentially in multiple versions. –New information become available. –A tailored view for each data sharing purpose. –Separate releases for sensitive and identifying information.

3 T2: Previous Release PidJobDisease 1BankerCancer 2BankerCancer 3ClerkHIV 4DriverCancer 5EngineerHIV T1: Current Release PidNameJobClass 1AliceBankerc1 2AliceBankerc1 3BobClerkc2 4BobDriverc3 5CathyEngineerc4 The join on T1.Job = T2.Job PidNameJobDiseaseClass 1AliceBankerCancerc1 2AliceBankerCancerc1 3BobClerkHIVc2 4BobDriverCancerc3 5CathyEngineerHIVc4 -AliceBankerCancerc1 -AliceBankerCancerc1 Do not want Name to be linked to Disease in the join of the two releases.

4 T2: Previous Release PidJobDisease 1BankerCancer 2BankerCancer 3ClerkHIV 4DriverCancer 5EngineerHIV T1: Current Release PidNameJobClass 1AliceBankerc1 2AliceBankerc1 3BobClerkc2 4BobDriverc3 5CathyEngineerc4 The join on T1.Job = T2.Job PidNameJobDiseaseClass 1AliceBankerCancerc1 2AliceBankerCancerc1 3BobClerkHIVc2 4BobDriverCancerc3 5CathyEngineerHIVc4 -AliceBankerCancerc1 -AliceBankerCancerc1 join sharpens identification: {Bob, HIV} has groups size 1.

5 T2: Previous Release PidJobDisease 1BankerCancer 2BankerCancer 3ClerkHIV 4DriverCancer 5EngineerHIV T1: Current Release PidNameJobClass 1AliceBankerc1 2AliceBankerc1 3BobClerkc2 4BobDriverc3 5CathyEngineerc4 The join on T1.Job = T2.Job PidNameJobDiseaseClass 1AliceBankerCancerc1 2AliceBankerCancerc1 3BobClerkHIVc2 4BobDriverCancerc3 5CathyEngineerHIVc4 -AliceBankerCancerc1 -AliceBankerCancerc1 join weakens identification: {Alice, Cancer} has groups size 4. lossy join: combat join attack.

6 T2: Previous Release PidJobDisease 1BankerCancer 2BankerCancer 3ClerkHIV 4DriverCancer 5EngineerHIV T1: Current Release PidNameJobClass 1AliceBankerc1 2AliceBankerc1 3BobClerkc2 4BobDriverc3 5CathyEngineerc4 The join on T1.Job = T2.Job PidNameJobDiseaseClass 1AliceBankerCancerc1 2AliceBankerCancerc1 3BobClerkHIVc2 4BobDriverCancerc3 5CathyEngineerHIVc4 -AliceBankerCancerc1 -AliceBankerCancerc1 join enables inferences across tables: Alice  Cancer has 100% confidence.

7 Related Work k-anonymity [ SS98, FWY05, BA05, LDR05, WYC04, WLFW06 ] –Quasi-identifier (QID): e.g., {Job, birth date, Zip}. –The database is made anonymous to its local QID. –In sequential releases, the database must be made anonymous to a global QID spanning the join of all releases thus far. Explicit ID (removed) QID (anonymized to groups of size ≥ k) Sensitive attributes

8 Related Work l-diversity [MGK06] –Sensitive values are “well-represented” in each QID group (measured by entropy). Confidence limiting [WFY05, WFY06]: qid  s, confidence < h where qid is a QID group, s is a sensitive value.

9 Related Work View releases –T1 and T2 are two views in one release, both can be modified before the release. –[MW04, DP05] measures information disclosure of a view set wrt a secret view. –[YWJ05, KG06] detects privacy violation by a view set over a base table. –Detect, not eliminate, violations.

10 Sequential Release Sequential release: –Current release T1. Previous release T2. –T1 was unknown when T2 was released. –T2 cannot be modified when T1 is released. Solution #1: k-anonymize all attributes in T1 - excessive distortion. Solution #2: generalize T1 based on T2 - monotonically distort the later release. Solution #3: anonymize a complete cohort of all potential releases at one time – must predict all future releases

11 Intuition of Our Approach A lossy join hides the true join relationship to cripple a global QID. Generalize T1 so that the join with T2 becomes lossy enough to disorient the attacker. Two general privacy notions: (X,Y)- anonymity and (X,Y)-linkability, where X and Y are sets of attributes.

12 (X,Y)-Privacy k-anonymity: # of distinct records for each QID group ≥ k. (X,Y)-anonymity: # of distinct Y values for each X group ≥ k. (X,Y)-linkability: the maximum confidence of having a Y value given having a X value is ≤ k. Generalize k-anonymity [SS98] and confidence limiting [WFY05, WFY06].

13 Example: (X,Y)-Anonymity PidJobZipPoBTest 1Banker123CanadaHIV 1Banker123CanadaDiabetes 1Banker123CanadaEye 2Clerk456JapanHIV 2Clerk456JapanDiabetes 2Clerk456JapanEye 2Clerk456JapanHeart k-anonymity uses # of records as anonymity, fails to ensure k distinct patients.

14 Example: (X,Y)-Anonymity Anonymity wrt patients (instead of records): –X = {Job, Zip, PoB} and Y = Pid –Each X group is linked to at least k distinct values on Pid. Anonymity wrt tests: –X = {Job, Zip, PoB} and Y = Test –Each X group is linked to at least k distinct tests.

15 Example: (X,Y)-Linkability PidJobZipPoBTest 1Banker123CanadaHIV 2Banker123CanadaHIV 3Banker123CanadaHIV 4Banker123CanadaDiabetes 5Clerk456JapanDiabetes 6Clerk456JapanDiabetes {Banker,123,Canada}  HIV (75% confidence). With Y = Test, (X,Y)-linkability states that no test can be inferred from a X group with confidence > a given threshold.

16 Problem Statement The data holder made previous release T2 and now makes current release T1, where T2 and T1 are projections of the same underlying table. Want to ensure (X,Y)-privacy on the join of T1 and T2, where X and Y are attribute sets on the join. Sequential anonymization: generalize T1 on X ∩ att(T1) so that the join satisfies (X,Y)-privacy and T1 remains as useful as possible.

17 Generalization / Specialization Each generalization replaces all child values with the parent value. –A cut contains exactly one value on every root-to-leaf path. Alternatively, each specialization replaces the parent value with a consistent child value in the record. Job ANY ProfessionalAdmin EngineerLawyerBankerClerk

18 Match Function The attacker applies prior knowledge to match the records in T1 and T2. So, the data holder applies such prior knowledge in sequential anonymization We consider prior knowledge: –schema information of T1 and T2. –taxonomies for attributes. –the inclusion-exclusion principle.

19 Match Function Let t1  T1 and t2  T2. Inclusion Predicate: t1.A matches t2.A if they are on the same generalization path for attribute A. –e.g., Male matches Single Male. Exclusion Predicate: t1.A matches t2.B only if they are not semantically inconsistent (based on common sense). –To exclude impossible matches. –e.g., Male and Pregnant are semantically inconsistent, so are Married Male and 6 Month Pregnant.

20 Algorithm Overview Top-Down Specialization Input: T1, T2, (X,Y)-privacy, a taxonomy tree for each attribute in X1=X ∩ att(T1). Output: a generalized T1 satisfying the privacy requirement. 1. generalize every value of A j to ANY j where A j  X1; 2. while there is a valid candidate in ỤCut j do 3. find the winner w of highest Score(w) from ỤCut j ; 4. specialize w on T1 and remove w from ỤCut j ; 5. update Score(v) and the valid status for all v in ỤCut j ; 6. end while 7. output the generalized T1 and ỤCut j ;

21 Anti-Monotone Privacy Theorem 1: On a single table, (X,Y)- privacy is anti-monotone wrt specialization on X: if violated, remains violated after a specialization. On the join of T1 and T2, (X,Y)-privacy is not anti-monotone wrt specialization of T1. –Specializing T1 may create dangling records, e.g., by specializing “CA” into “LA” and “San Francisco”, “LA” records in T1 no longer match “San Francisco” records in T2.

22 Anti-Monotone Privacy Theorem 2: Assume that T1 and T2 are projections of the same underlying table, (X,Y)-privacy on the join of T1 and T2 is anti-monotone wrt specialization of T1 on X ∩ att(T1).

23 Score Metric Each specialization gains some information and loses some privacy. We maximize gain per loss InfoGain(v) is measured on T1. PrivLoss(v) is measured on the join of T1 and T2.

24 Challenges Each specialization affects the matching of join, Score(v), and privacy checking. rejoining T1 and T2 for each specialization is too expensive. Materializing the join is impractical because a lossy join can be very large. Our solution: Incrementally maintains some count statistics without executing the join – extension of Top-Down Specialization [FWY05][WFY05]

25 Empirical Study The Adult data set records. Categorical attributes only.

DepartmentAttribute# of Leaves # of Levels Taxation (T1) Education (E)165 Occupation (O)143 Work-class (W)85 Common (T1 & T2) Marital-status (M)74 Relationship (Ra)63 Sex (S)22 Immigration (T2) Native-country (Nc)405 Race (Ra)53 Schema for T1 and T2 T1 contains the Class Income level

27 Empirical Study Classification metric –Classification error on the generalized testing set of T1. Distortion metric [SS98] –1 unit of distortion for generalization of each value in each record. –Normalized by the number of records.

28 (X,Y)-Anonymity TopN attributes: most important for classification. Join attributes are Top3 attributes. X contains –TopN attributes in T1 (to ensure that the generalization is performed on important attributes), –all join attributes, –all attributes in T2 (to ensure X is global).

29 Distortion of (X,Y)-anonymity Ki denotes the key in Ti. XYD: our method with Y = K1. KAD: k-anonymization on QID=att(T1).

30 Classification error of (X,Y)-anonymity XYE: our method with Y = K1. XYE(row): our method with Y={K1,K2}. BLE: the unmodified data. KAE: k-anonymization on QID=att(T1). RJE: removing all join attributes from T1.

31 (X,Y)-Linkability Y contains TopN attributes. –If not important, simply remove them. X contains the rest of the attributes in T1 and T2. Focus on classification error because no previous work studies distortion for (X,Y)- linkability.

32 Classification error of (X,Y)-linkability XYE: our method with Y = TopN. BLE: the unmodified data. RJE: removing all join attributes from T1. RSE: removing all attributes in Y from T1.

33 (X,Y)-anonymity (k=40)(X,Y)-linkability (k=90%) Scalability

34 Conclusion Previous k-anonymization focused on a single release of data. Studied the sequential anonymization problem when data are released sequentially and a global QID may span several releases. Introduced lossy join to hide the join relationship and weaken the global QID. Addressed challenges due to large size of lossy join. Extendable to more than two releases T2,…,Tp.

35 References [BA05] R. Bayardo and R. Agrawal. Data privacy through optimal k-anonymization. In IEEE ICDE, pages , [DP05] A. Deutsch and Y. Papakonstantinou. Privacy in database publishing. In ICDT, [FWY05] B. C. M. Fung, K. Wang, and P. S. Yu. Top-down specialization for information and privacy preservation. In IEEE ICDE, pages , April [KG06] D. Kifer and J. Gehrke. Injecting utility into anonymized datasets. In ACM SIGMOD, Chicago, IL, June 2006.

36 References [LDR05] K. LeFevre, D. J. DeWitt, and R. Ramakrishnan. Incognito: Efcient full-domain k- anonymity. In ACM SIGMOD, [MGK06] A. Machanavajjhala, J. Gehrke, and D. Kifer. l-diversity: Privacy beyond k-anonymity. In IEEE ICDE, [MW04] A. Meyerson and R. Williams. On the complexity of optimal k-anonymity. In PODS, [SS98] P. Samarati and L. Sweeney. Protecting privacy when disclosing information: k-anonymity and its enforcement through generalization and suppression. In IEEE Symposium on Research in Security and Privacy, May 1998.

37 References [WFY05] K. Wang, B. C. M. Fung, and P. S. Yu. Template-based privacy preservation in classification problems. In IEEE ICDM, pages , November [WFY06] K. Wang, B. C. M. Fung, and P. S. Yu. Handicapping attacker's condence: An alternative to k-anonymization. Knowledge and Information Systems: An International Journal, [WYC04] K. Wang, P. S. Yu, and S. Chakraborty. Bottom-up generalization: A data mining solution to privacy protection. In IEEE ICDM, November 2004.

38 References [WLFW06] R. C. W. Wong, J. Li., A. W. C. Fu, and K. Wang. ( ,k)-anonymity: An enhanced k- anonymity model for privacy preserving data publishing. In ACM SIGKDD, [YWJ05] C. Yao, X. S. Wang, and S. Jajodia. Checking for k-anonymity violation by views. In VLDB, 2005.