Preserving Privacy in Published Data

Slides:



Advertisements
Similar presentations
Recap: Mining association rules from large datasets
Advertisements

Simulatability “The enemy knows the system”, Claude Shannon CompSci Instructor: Ashwin Machanavajjhala 1Lecture 6 : Fall 12.
Wang, Lakshmanan Probabilistic Privacy Analysis of Published Views, IDAR'07 Probabilistic Privacy Analysis of Published Views Hui (Wendy) Wang Laks V.S.
Anonymizing Location-based data Jarmanjit Singh Jar_sing(at)encs.concordia.ca Harpreet Sandhu h_san(at)encs.concordia.ca Qing Shi q_shi(at)encs.concordia.ca.
M-Invariance: Towards Privacy Preserving Re-publication of Dynamic Datasets by Tyrone Cadenhead.
M-Invariance and Dynamic Datasets based on: Xiaokui Xiao, Yufei Tao m-Invariance: Towards Privacy Preserving Re-publication of Dynamic Datasets Slawomir.
Privacy-Preserving Data Publishing Donghui Zhang Northeastern University Acknowledgement: some slides come from Yufei Tao and Dimitris Sacharidis.
Data Anonymization - Generalization Algorithms Li Xiong CS573 Data Privacy and Anonymity.
Anonymizing Healthcare Data: A Case Study on the Blood Transfusion Service Benjamin C.M. Fung Concordia University Montreal, QC, Canada
Anatomy: Simple and Effective Privacy Preservation Xiaokui Xiao, Yufei Tao Chinese University of Hong Kong.
Fast Data Anonymization with Low Information Loss 1 National University of Singapore 2 Hong Kong University
Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong.
UTEPComputer Science Dept.1 University of Texas at El Paso Privacy in Statistical Databases Dr. Luc Longpré Computer Science Department Spring 2006.
Association Analysis. Association Rule Mining: Definition Given a set of records each of which contain some number of items from a given collection; –Produce.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms
Probabilistic Inference Protection on Anonymized Data
Privacy and k-Anonymity Guy Sagy November 2008 Seminar in Databases (236826)
1 On the Anonymization of Sparse High-Dimensional Data 1 National University of Singapore 2 Chinese University of Hong.
Anatomy: Simple and Effective Privacy Preservation Israel Chernyak DB Seminar (winter 2009)
L-Diversity: Privacy Beyond K-Anonymity
Ιδιωτικότητα σε Βάσεις Δεδομένων Οκτώβρης Roadmap Motivation Core ideas Extensions 2.
PRIVACY CRITERIA. Roadmap Privacy in Data mining Mobile privacy (k-e) – anonymity (c-k) – safety Privacy skyline.
Anonymization of Set-Valued Data via Top-Down, Local Generalization Yeye He Jeffrey F. Naughton University of Wisconsin-Madison 1.
Privacy-preserving Anonymization of Set Value Data Manolis Terrovitis, Nikos Mamoulis University of Hong Kong Panos Kalnis National University of Singapore.
Privacy-preserving Anonymization of Set Value Data Manolis Terrovitis Institute for the Management of Information Systems (IMIS), RC Athena Nikos Mamoulis.
Database Laboratory Regular Seminar TaeHoon Kim.
Privacy and trust in social network
Beyond k-Anonymity Arik Friedman November 2008 Seminar in Databases (236826)
Publishing Microdata with a Robust Privacy Guarantee
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.
Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.
Data Publishing against Realistic Adversaries Johannes Gerhrke Cornell University Ithaca, NY Michaela Götz Cornell University Ithaca, NY Ashwin Machanavajjhala.
Background Knowledge Attack for Generalization based Privacy- Preserving Data Mining.
Refined privacy models
SFU Pushing Sensitive Transactions for Itemset Utility (IEEE ICDM 2008) Presenter: Yabo, Xu Authors: Yabo Xu, Benjam C.M. Fung, Ke Wang, Ada. W.C. Fu,
K-Anonymity & Algorithms
Data Anonymization (1). Outline  Problem  concepts  algorithms on domain generalization hierarchy  Algorithms on numerical data.
Data Anonymization – Introduction and k-anonymity Li Xiong CS573 Data Privacy and Security.
Expert Systems with Applications 34 (2008) 459–468 Multi-level fuzzy mining with multiple minimum supports Yeong-Chyi Lee, Tzung-Pei Hong, Tien-Chin Wang.
Hybrid l-Diversity* Mehmet Ercan NergizMuhammed Zahit GökUfuk Özkanlı
Preservation of Proximity Privacy in Publishing Numerical Sensitive Data J. Li, Y. Tao, and X. Xiao SIGMOD 08 Presented by Hongwei Tian.
Privacy-preserving rule mining. Outline  A brief introduction to association rule mining  Privacy preserving rule mining Single party  Perturbation.
Privacy vs. Utility Xintao Wu University of North Carolina at Charlotte Nov 10, 2008.
Privacy-preserving data publishing
Spatial Indexing Techniques Introduction to Spatial Computing CSE 5ISC Some slides adapted from Spatial Databases: A Tour by Shashi Shekhar Prentice Hall.
Thesis Sumathie Sundaresan Advisor: Dr. Huiping Guo.
CSCI 347, Data Mining Data Anonymization.
Anonymizing Data with Quasi-Sensitive Attribute Values Pu Shi 1, Li Xiong 1, Benjamin C. M. Fung 2 1 Departmen of Mathematics and Computer Science, Emory.
Probabilistic km-anonymity (Efficient Anonymization of Large Set-valued Datasets) Gergely Acs (INRIA) Jagdish Achara (INRIA)
Differential Privacy Xintao Wu Oct 31, Sanitization approaches Input perturbation –Add noise to data –Generalize data Summary statistics –Means,
Data Anonymization - Generalization Algorithms Li Xiong, Slawek Goryczka CS573 Data Privacy and Anonymity.
Unraveling an old cloak: k-anonymity for location privacy
Personalized Privacy Preservation: beyond k-anonymity and ℓ-diversity SIGMOD 2006 Presented By Hongwei Tian.
ROLE OF ANONYMIZATION FOR DATA PROTECTION Irene Schluender and Murat Sariyar (TMF)
Data Mining And Privacy Protection Prepared by: Eng. Hiba Ramadan Supervised by: Dr. Rakan Razouk.
Chapter 3 Data Mining: Classification & Association Chapter 4 in the text box Section: 4.3 (4.3.1),
Output Perturbation with Query Relaxation By: XIAO Xiaokui and TAO Yufei Presenter: CUI Yingjie.
Privacy Issues in Graph Data Publishing Summer intern: Qing Zhang (from NC State University) Mentors: Graham Cormode and Divesh Srivastava.
Versatile Publishing For Privacy Preservation
Fast Data Anonymization with Low Information Loss
ACHIEVING k-ANONYMITY PRIVACY PROTECTION USING GENERALIZATION AND SUPPRESSION International Journal on Uncertainty, Fuzziness and Knowledge-based Systems,
Xiaokui Xiao and Yufei Tao Chinese University of Hong Kong
By (Group 17) Mahesha Yelluru Rao Surabhee Sinha Deep Vakharia
Presented by : SaiVenkatanikhil Nimmagadda
TELE3119: Trusted Networks Week 4
Towards identity-anonymization on graphs
Refined privacy models
Privacy-Preserving Data Publishing
Presentation transcript:

Preserving Privacy in Published Data Dimitris Sacharidis

outline introduction k-anonymity l-diversity other issues motivation definitions k-anonymity incognito mondrian topdown l-diversity anatomy other issues

motivation several agencies, institutions, bureaus, organizations make (sensitive) data involving people publicly available termed microdata (vs. aggregated macrodata) used for analysis often required and imposed by law to protect privacy microdata are sanitized explicit identifiers (SSN, name, phone #) are removed is this sufficient for preserving privacy? no! susceptible to link attacks publicly available databases (voter lists, city directories) can reveal the “hidden” identity

link attack example Sweeney [S01a] managed to re-identify the medical record of the governor of Massachussetts MA collects and publishes sanitized medical data for state employees (microdata) left circle voter registration list of MA (publicly available data) right circle looking for governor’s record join the tables: 6 people had his birth date 3 were men 1 in his zipcode regarding the US 1990 census data 87% of the population are unique based on (zipcode, gender, dob)

privacy in microdata the role of attributes in microdata explicit identifiers are removed quasi identifiers can be used to re-identify individuals sensitive attributes (may not exist!) carry sensitive information identifier quasi identifiers sensitive Name Birthdate Sex Zipcode Disease Andre 21/1/79 male 53715 Flu Beth 10/1/81 female 55410 Hepatitis Carol 1/10/44 90210 Brochitis Dan 21/2/84 02174 Sprained Ankle Ellen 19/4/72 02237 AIDS Name Birthdate Sex Zipcode Disease Andre 21/1/79 male 53715 Flu Beth 10/1/81 female 55410 Hepatitis Carol 1/10/44 90210 Brochitis Dan 21/2/84 02174 Sprained Ankle Ellen 19/4/72 02237 AIDS goal of privacy preservation (rough definition) de-associate individuals from sensitive information

outline introduction k-anonymity l-diversity other issues motivation definitions k-anonymity incognito mondrian topdown l-diversity anatomy other issues

k-anonymity preserve privacy via k-anonymity, proposed by Sweeney and Samarati [S01, S02a, S02b] k-anonymity: intuitively, hide each individual among k-1 others each QI set of values should appear at least k times in the released microdata linking cannot be performed with confidence > 1/k sensitive attributes are not considered (going to revisit this...) how to achieve this? generalization and suppression value perturbation is not considered (we should remain truthful to original values ) privacy vs utility tradeoff do not anonymize more than necessary

k-anonymity example tools for anonymization generalization suppression publish more general values, i.e., given a domain hierarchy, roll-up suppression remove tuples, i.e., do not publish outliers often the number of suppressed tuples is bounded original microdata 2-anonymous data Birthdate Sex Zipcode 21/1/79 male 53715 10/1/79 female 55410 1/10/44 90210 21/2/83 02274 19/4/82 02237 Birthdate Sex Zipcode group 1 */1/79 person 5**** suppressed 1/10/44 female 90210 group 2 */*/8* male 022**

generalization lattice assume domain hierarchies exist for all QI attributes sex zipcode birthdate objective find the minimum generalization that satisfies k-anonymity construct the generalization lattice for the entire QI set more generalization i.e., maximize utility by finding minimum distance vector with k-anonymity less

generalization lattice incognito [LDR05] exploit monotonicity properties regarding frequency of tuples in lattice reminiscent of OLAP hierarchies and frequent itemset mining (I) generalization property (~rollup) if at some node k-anonymity holds, then it also holds for any ancestor node e.g., <S1, Z0> is k-anonymous and, thus, so is <S1, Z1> and <S1, Z2> (II) subset property (~apriori) if for a set of QI attributes k-anonymity doesn’t hold then it doesn’t hold for any of its supersets note: the entire lattice, which includes three dimensions <S,Z,B>, is too complex to show e.g., <S0, Z0> is not k-anonymous and, thus <S0, Z0, B0> and <S0, Z0, B1> cannot be k-anonymous incognito [LDR05] considers sets of QI attributes of increasing cardinality (~apriori) and prunes nodes in the lattice using the two properties above

seen in the domain space consider the multi-dimensional domain space QI attributes are the dimensions tuples are points in this space attribute hierarchies partition dimensions zipcode hierarchy sex hierarchy

seen in the domain space not 2-anonymous incognito example 2 QI attributes, 7 tuples, hierarchies shown with bold lines zipcode rollup sex 2-anonymous sex rollup zipcode

seen in the domain space taxonomy [LDR05, LDR06] generalization taxonomy according to groupings allowed single dimensional global recoding incognito [LDR05] multi dimensional global recoding mondrian [LDR06] multi dimensional local recoding topdown [XWP+06] generalization strength

mondrian [LDR06] define utility measure: discernability metric (DM) penalizes each tuple with the size of the group it belongs intuitively, the ideal grouping is the one in which all groups have size k mondrian tries to construct groups of roughly equal size k what else (besides Mondrian) does this painting remind you? it’s reminiscent of the kd tree: cycle among dimensions median splits 2-anonymous

measuring group quality DM depends only on the cardinality of the group no measure of how tight the group is a good group is one that contains tuples with similar QI values define a new metric [XWP+06]: normalized certainty penalty (NCP) measures the perimeter of the group bad generalization long boxes good generalization square-like boxes

topdown [XWP+06] start with the entire data set iteratively split in two reminiscent of R-tree quadratic split continue until left with groups which contain <2k-1 tuples split algorithm find seeds, 2 points that are furthest away heuristic, not complete quadratic search the seeds will become the 2 split groups examine points randomly (unlike quadratic split) assign point to the group whose NCP will increase the least

boosting privacy with external data external databases (e.g., voter list) are used by attackers can we use them to our benefit? try to improve the utility of anonymized data join k-anonymity (JKA) [SMP] 3-anonymous microdata k-anonymity join public data joined microdata join 3-anonymous JKA

outline introduction k-anonymity l-diversity other issues motivation definitions k-anonymity incognito mondrian topdown l-diversity anatomy other issues

k-anonymity problems k-anonymity example homogeneity attack: in the last group everyone has cancer background knowledge: in the first group, Japanese have low chance of heart disease we need to consider the sensitive values microdata 4-anonymous data id Zipcode Sex National. Disease 1 13053 28 Russian Heart Disease 2 13068 29 American 3 21 Japanese Viral Infection 4 23 5 14853 50 Indian Cancer 6 55 7 14850 47 8 49 9 31 10 37 11 36 12 35 id Zipcode Sex National. Disease 1 130** <30 ∗ Heart Disease 2 3 Viral Infection 4 5 1485* ≥40 Cancer 6 7 8 9 3∗ 10 11 12

l-diversity (simplified definition) l-diversity [MGK+06] make sure each group contains well represented sensitive values protect from homogeneity attacks protect from background knowledge l-diversity (simplified definition) a group is l-diverse if the most frequent sensitive value appears at most 1/l times in group l-diversity definition has monotonicity properties similar to k-anonymity simply adapt existing algorithms to count frequencies of sensitive values

anatomy [XT06] algorithm fast l-diversity algorithm anatomy is not generalization seperates sensitive values from tuples shuffles sensitive values among groups algorithm assign sensitive values to buckets create groups by drawing from l largest buckets id Age Sex Zipcode Group ID 1 23 M 11000 2 27 13000 3 35 59000 4 59 12000 5 61 F 54000 6 65 25000 7 8 70 30000 Group-ID Disease Count 1 dyspepsia 2 pneumonia bronchitis flu gastritis

outline introduction k-anonymity l-diversity other issues motivation definitions k-anonymity incognito mondrian topdown l-diversity anatomy other issues

other issues how to preserve privacy in dynamic data [XT07] protection against minimality attack [WFW+07] anonymity in location-based services spatial cloaking

references [LDR05] LeFevre, K., DeWitt, D.J., Ramakrishnan, R. Incognito: Efficient Full-domain k-Anonymity. SIGMOD, 2005. [LDR06] LeFevre, K., DeWitt, D.J., Ramakrishnan, R. Mondrian Multidimensional k-Anonymity. ICDE, 2006. [S01] Samarati, P. Protecting Respondents' Identities in Microdata Release. IEEE TKDE, 13(6):1010-1027, 2001. [S02a] Sweeney, L. k-Anonymity: A Model for Protecting Privacy. International Journal on Uncertainty, Fuzziness and Knowledge-based Systems, 2002. [S02b] Sweeney, L. k-Anonymity: Achieving k-Anonymity Privacy Protection using Generalization and Suppresion. International Journal on Uncertainty, Fuzziness and Knowledge-based Systems, 2002. [SMP] Sacharidis, D., Mouratidis, K., Papadias, D. k-Anonymity in the Presence of External Databases (submitted) [WFWP07] Wong, R. C., Fu, A. W., Wang, K., Pei, J. Minimality Attack in Privacy Preserving Data Publishing. VLDB, 2007. [XT06] Xiao, X, Tao, Y. Anatomy: Simple and Effective Privacy Preservation. VLDB, 2006. [XT07] Xiao, X, Tao, Y. m-Invariance: Towards Privacy Preserving Re-publication of Dynamic Datasets. SIGMOD, 2007. [XWP+06] Xu, J., Wang, W., Pei, J., Wang, X., Shi, B., Fu, A., Utility-Based Anonymization Using Local Recoding. SIGKDD, 2006.