Refined privacy models

Slides:



Advertisements
Similar presentations
Simulatability “The enemy knows the system”, Claude Shannon CompSci Instructor: Ashwin Machanavajjhala 1Lecture 6 : Fall 12.
Advertisements

Wang, Lakshmanan Probabilistic Privacy Analysis of Published Views, IDAR'07 Probabilistic Privacy Analysis of Published Views Hui (Wendy) Wang Laks V.S.
M-Invariance: Towards Privacy Preserving Re-publication of Dynamic Datasets by Tyrone Cadenhead.
Fundamentals of Data Analysis Lecture 12 Methods of parametric estimation.
1 Privacy in Microdata Release Prof. Ravi Sandhu Executive Director and Endowed Chair March 22, © Ravi Sandhu.
Amplicon-Based Quasipecies Assembly Using Next Generation Sequencing Nick Mancuso Bassam Tork Computer Science Department Georgia State University.
Multivariate linear models for regression and classification Outline: 1) multivariate linear regression 2) linear classification (perceptron) 3) logistic.
Distributed Algorithms for Secure Multipath Routing
Privacy-MaxEnt: Integrating Background Knowledge in Privacy Quantification Wenliang (Kevin) Du, Zhouxuan Teng, and Zutao Zhu. Department of Electrical.
Probabilistic Inference Protection on Anonymized Data
Anatomy: Simple and Effective Privacy Preservation Israel Chernyak DB Seminar (winter 2009)
L-Diversity: Privacy Beyond K-Anonymity
Privacy Preserving Data Mining: An Overview and Examination of Euclidean Distance Preserving Data Transformation Chris Giannella cgiannel AT acm DOT org.
8-1 Copyright ©2011 Pearson Education, Inc. publishing as Prentice Hall Chapter 8 Confidence Interval Estimation Statistics for Managers using Microsoft.
Copyright ©2011 Pearson Education 8-1 Chapter 8 Confidence Interval Estimation Statistics for Managers using Microsoft Excel 6 th Global Edition.
PRIVACY CRITERIA. Roadmap Privacy in Data mining Mobile privacy (k-e) – anonymity (c-k) – safety Privacy skyline.
Preserving Privacy in Clickstreams Isabelle Stanton.
Overview of Privacy Preserving Techniques.  This is a high-level summary of the state-of-the-art privacy preserving techniques and research areas  Focus.
Beyond k-Anonymity Arik Friedman November 2008 Seminar in Databases (236826)
Publishing Microdata with a Robust Privacy Guarantee
Basic Business Statistics, 11e © 2009 Prentice-Hall, Inc. Chap 8-1 Chapter 8 Confidence Interval Estimation Basic Business Statistics 11 th Edition.
Confidence Interval Estimation
Health and Disease in Populations 2001 Sources of variation (2) Jane Hutton (Paul Burton)
General Database Statistics Using Maximum Entropy Raghav Kaushik 1, Christopher Ré 2, and Dan Suciu 3 1 Microsoft Research 2 University of Wisconsin--Madison.
Data Publishing against Realistic Adversaries Johannes Gerhrke Cornell University Ithaca, NY Michaela Götz Cornell University Ithaca, NY Ashwin Machanavajjhala.
Protecting Sensitive Labels in Social Network Data Anonymization.
Background Knowledge Attack for Generalization based Privacy- Preserving Data Mining.
CS573 Data Privacy and Security Anonymization methods Li Xiong.
SFU Pushing Sensitive Transactions for Itemset Utility (IEEE ICDM 2008) Presenter: Yabo, Xu Authors: Yabo Xu, Benjam C.M. Fung, Ke Wang, Ada. W.C. Fu,
CPSC 404, Laks V.S. Lakshmanan1 Evaluation of Relational Operations: Other Operations Chapter 14 Ramakrishnan & Gehrke (Sections ; )
Accuracy-Constrained Privacy-Preserving Access Control Mechanism for Relational Data.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
Data Anonymization (1). Outline  Problem  concepts  algorithms on domain generalization hierarchy  Algorithms on numerical data.
1 Bayesian Methods. 2 Naïve Bayes New data point to classify: X=(x 1,x 2,…x m ) Strategy: – Calculate P(C i /X) for each class C i. – Select C i for which.
Data Anonymization – Introduction and k-anonymity Li Xiong CS573 Data Privacy and Security.
Other Perturbation Techniques. Outline  Randomized Responses  Sketch  Project ideas.
Preservation of Proximity Privacy in Publishing Numerical Sensitive Data J. Li, Y. Tao, and X. Xiao SIGMOD 08 Presented by Hongwei Tian.
Additive Data Perturbation: the Basic Problem and Techniques.
1 Publishing Naive Bayesian Classifiers: Privacy without Accuracy Loss Author: Barzan Mozafari and Carlo Zaniolo Speaker: Hongwei Tian.
9/22/1999 JHU CS /Jan Hajic 1 Introduction to Natural Language Processing ( ) LM Smoothing (The EM Algorithm) Dr. Jan Hajič CS Dept., Johns.
Linear Programming Erasmus Mobility Program (24Apr2012) Pollack Mihály Engineering Faculty (PMMK) University of Pécs João Miranda
1 Introduction to Natural Language Processing ( ) LM Smoothing (The EM Algorithm) AI-lab
Privacy-preserving rule mining. Outline  A brief introduction to association rule mining  Privacy preserving rule mining Single party  Perturbation.
Preserving Privacy in GPS Traces via Uncertainty- Aware Path Cloaking Baik Hoh, Marco Gruteser, Hui Xiong, Ansaf Alrabady Presented by Joseph T. Meyerowitz.
Privacy vs. Utility Xintao Wu University of North Carolina at Charlotte Nov 10, 2008.
Chap 8-1 Chapter 8 Confidence Interval Estimation Statistics for Managers Using Microsoft Excel 7 th Edition, Global Edition Copyright ©2014 Pearson Education.
Privacy-preserving data publishing
De novo discovery of mutated driver pathways in cancer Discussion leader: Matthew Bernstein Scribe: Kun-Chieh Wang Computational Network Biology BMI 826/Computer.
CHAPTER Basic Definitions and Properties  P opulation Characteristics = “Parameters”  S ample Characteristics = “Statistics”  R andom Variables.
Thesis Sumathie Sundaresan Advisor: Dr. Huiping Guo.
Machine Learning in Practice Lecture 5 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
CSCI 347, Data Mining Data Anonymization.
Anonymizing Data with Quasi-Sensitive Attribute Values Pu Shi 1, Li Xiong 1, Benjamin C. M. Fung 2 1 Departmen of Mathematics and Computer Science, Emory.
HASE: A Hybrid Approach to Selectivity Estimation for Conjunctive Queries Xiaohui Yu University of Toronto Joint work with Nick Koudas.
Probabilistic km-anonymity (Efficient Anonymization of Large Set-valued Datasets) Gergely Acs (INRIA) Jagdish Achara (INRIA)
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
Handout Six: Sample Size, Effect Size, Power, and Assumptions of ANOVA EPSE 592 Experimental Designs and Analysis in Educational Research Instructor: Dr.
A Brief Maximum Entropy Tutorial Presenter: Davidson Date: 2009/02/04 Original Author: Adam Berger, 1996/07/05
Comparing Counts Chapter 26. Goodness-of-Fit A test of whether the distribution of counts in one categorical variable matches the distribution predicted.
Unraveling an old cloak: k-anonymity for location privacy
Personalized Privacy Preservation: beyond k-anonymity and ℓ-diversity SIGMOD 2006 Presented By Hongwei Tian.
1 Overview of Query Evaluation Chapter Outline  Query Optimization Overview  Algorithm for Relational Operations.
Updating Probabilities Ariel Caticha and Adom Giffin Department of Physics University at Albany - SUNY MaxEnt 2006.
Deriving Private Information from Association Rule Mining Results Zutao Zhu, Guan Wang, and Wenliang Du ICDE /3/181.
ACHIEVING k-ANONYMITY PRIVACY PROTECTION USING GENERALIZATION AND SUPPRESSION International Journal on Uncertainty, Fuzziness and Knowledge-based Systems,
Chapter 25 Comparing Counts.
Chapter 26 Comparing Counts.
Presented by : SaiVenkatanikhil Nimmagadda
TELE3119: Trusted Networks Week 4
Refined privacy models
Presentation transcript:

Refined privacy models Data Anonymization Refined privacy models

Outline k-anonymity specification is not sufficient Enhancing privacy L-diversity T-closeness Max-Ent analysis

Linking the dots… Countering the privacy attack k-anonymity addresses one type of attacks  link attack ??  Other types of attacks

K-anonymity K-anonymity tries to counter this type of privacy attack: Individual -> quasi-identifier-> sensitive attributes Example: 4-anonymized table, at least 4 records share the same quasi-identifier Quasi-identifier: The attacker can find the link using other public data Typical method: domain-specific generalization

More Privacy Problems All existing k-anonymity approaches assume: Privacy is protected as long as the k-anonymity specification is satisfied. But there are other problems Homogeneity in sensitive attributes Background knowledge on individuals …

Problem1: Homogeneity Attack We know Bob is in the table… If Bob lives in the zip code 13053 And he is 31 years old -> Bob surely has cancer!

Problem 2: Background knowledge attack Japanese have an extremely low Incidence of heart disease! A Japanese Umeko lives in zip code 13068 and she is 21 years old -> Umeko has viral infection with high probability.

The cause of these two problems The values in the sensitive attribute of some blocks have no sufficient diversity Problem1: no diversity. Problem2: background knowledge helps to reduce the diversity.

major contributions of l-diversity Formally analyze the privacy model of k-anonymity with the Bayes-Optimal privacy model. Basic idea: increase the diversity of sensitive attribute values for each anonymized block Instantiation and implementation of l-diversity concept Entropy l-diveristy Recursive l-diversity More…

Modeling the attacks What is a privacy attack? - guess the sensitive values! (probability) Prior belief: without seeing the table, what can we guess? S: sensitive data, Q: quasi-identifier, prior: P(S=s|Q=q) Example: Japanese vs. heart disease Observed belief: with the observed table, our belief will change. T*: anonymized table, observed: P(S=s|(Q=q and T* is known)) Effective privacy attacks: Table T* should help to change the belief a lot! Prior is small, observed belief is large -> positive disclosure Prior is large, observed belief is small -> negative disclosure

The definition of observed belief # of records with S=s background knowledge, i.e. the prior p(S=s|Q=q) n(q*, s)/n(q*) q* S A q*-block: a k-anonymized group with q* as the quasi-identifier S=s n(q*, s) n(q*) records f(s|q*): the proportion of this part

Interpret privacy problem of k-anonymity Derived from the relationship between observed belief and privacy disclosure (positive) Extreme situation: (q,s,T*)  1 => positive disclosure Possibility 1. n(q*, s’) << n(q*, s) => Lack of diversity Possibility 2. Strong background knowledge helps to eliminate other items Knowledge: except one s, other s’ are not likely true while Q=q => f(s’|q)  0 Minimize the contribution of other items, and make  =>  0

Negative disclosure: (q,s,T*)  0 0  either n(q*,s)  0 or f(s|q)  0 The Umeko example

How to address the problems? We make n(q*, s’) << n(q*, s) is not satisfied Need more knowledge to get rid of other items “damaging instance-level knowledge” for f(s’|q)  0 If L distinct sensitive values in the q*-block, the attacker needs L-1 pieces of damaging knowledge to get rid of the L-1 possible sensitive values This is the principle of L-diversity

L-diversity: how to evaluate it? Entropy l-diversity Every q*-block satisfies the condition: *We like uniform distribution of sensitive values over each block! **Guarantees every q*-block has at least L distinct sensitive values Entropy of distinct sensitive values in q*-block Entropy of uniformly distributed L distinct sensitive values

Other extensions Entropy l-diversity is too restrictive Some positive disclosures are allowed Typically, some sensitive values my have very high frequency and are not sensitive, in practice. For example, “normal” in disease symptom. Log(L) cannot be satisfied in some cases. Principle to relax the strong condition Uniform distribution of sensitive values is good! When we can not achieve this, we choose to make most value frequencies as close as possible, especially the most frequent value. Recursive (c,l)-diversity is proposed Control the frequency difference between the most frequent item and the most non-frequent items r1 < c(rl+rl+1+…+rm)

Implementation of l-diversity build an algorithm having a structure similar to k-anonymity algorithms. With domain generalization hierarchy Check l-diversity instead of k-anonymity The Anatomy approach (without anonymization of QIs)

Discussion Not addressed problems Skewed data – common problem for l-diversity and k-anonymity Makes l-diversity very inefficient Balance between utility and privacy The entropy l-diversity and (c,l)-diversity methods do not guarantee good data utility The anatomy method is much better

t-closeness Address two types of attacks Skewness attack Similarity attack

Skewness attack Prob of cancer in the original table is low Prob of cancer in the anonymized table is much higher than the global prob

Semantic similarity attack Salary is low Has some kind of stomach diseas

The root of these two problems Sensitive values Difference between Global distribution and Local distribution in some block

The proposal of t-closeness Making the global and local distributions as similar as possible Evaluate the distribution similarity Semantic similarity Density : {3k,4k,5k} is denser than{6k,8k,11k} “Earth mover’s distance” as the similarity measure

Privacy MaxEnt Quantify the privacy under background knowledge attacks So that we know how vulnerable an anonymized dataset on different assumptions of attacker’s knowledge

All attacks are based on The attacker’s background knowledge Knowledge from the table Local/Global distribution of sensitive values can always be calculated Common knowledge Useful common knowledge should be consistent with the knowledge from the table Attack is an estimate  to find P(S|Q) Find QS with high confidence Both higher/lower P(S|Q) than common knowledge reveal info

Quantifying privacy Need to estimate the conditional probability P(S|Q) B: bucket Most interesting

Without background knowledge P(S|Q,B) is estimated with the portion of S in the bucket B With background knowledge Complicated … The paper proposes a Maximum Entropy based method to estimate P(S|Q,B), assuming the attacker knows different kinds of background knowledge Modeling background knowledge as the constraints

Types of background knowledge Rule-based knowledge: P (s | q) = 1. P (s | q) = 0. Probability-Based Knowledge P (s | q) = 0.2. P (s | Alice) = 0.2. Vague background knowledge 0.3 ≤ P (s | q) ≤ 0.5. Miscellaneous types P (s | q1) + P (s | q2) = 0.7 One of Alice and Bob has “Lung Cancer”.

Maximum Entropy Estimation MaxEnt principle If you don’t know the distribution, you assume it is uniform If you know part of the distribution, you still model the remaining uniform Uniform distribution  maximum entropy Maximizing entropy  making the distribution more like uniform.

MaxEnt for privacy analysis Maximizing entropy H(S|Q,B) H(S|Q,B) = H(Q,S,B) –H(Q,B) Equivalent to maximizing H(Q,S,B) H(Q,S,B) is maximized when P(Q,S,B) is uniform The problem Given a table D’. Solve the following optimization problem Find an assignment of P(Q,S,B) for all Q,S,B combination, which maximizes H(Q,S,B) Satisfy a list of constraints (including the background knowledge)

find constraints Knowledge about data distribution Constraints from the data knowledge about individuals

Modeling background knowledge about distributions P(S|Qv), Qv is a part of Q e.g. P(Breast cancer|male)=0 P(Qv,S) = P(S|Qv)*P(Qv) Q- = Q-Qv All buckets In previous example, if P(flu|male)=0.3 P({male,college}, Flu,1) + P({male,highschool}, Flu,1)+ P({male,college}, Flu,3) + P({male,graduate}, Flu,3} = 0.3*P(male) =0.3*6/10 = 0.18

Constraints from the Data Identify invariants from the disguised data QI-invariant equation SA-invariant equation Zero-invariant equation P(q,s,b) =0, if q is not in b, or s is not in b

Knowledge about individuals Can be modeled with similar methods Alice: (i1, q1) Bob: (i4, q2) Charlie: (i9, q5) Knowledge 1: Alice has either s1 or s4. Constraint: Knowledge 1: Two people among Alice, Bob, and Charlie have s4. Constraint:

Summary Attack analysis is an important part of anonymization Background knowledge modeling is the key to attack analysis But no way to enumerate all background knowledge…