Download presentation
Presentation is loading. Please wait.
Published byPeregrine Simpson Modified over 9 years ago
1
Data Publishing against Realistic Adversaries Johannes Gerhrke Cornell University Ithaca, NY Michaela Götz Cornell University Ithaca, NY Ashwin Machanavajjhala Yahoo! Reasearch Santa Clara, CA Amedeo D’Ascanio, University Of Bologna, Italy
2
Outline Introduction Є-privacy Adversary knowledge Adversary Classes Apply Є-privacy to Generalization Experimental evaluation Conclusion Data Publishing against Realistic Adversaries Amedeo D’Ascanio
3
Introduction Many reasons to Publish Data: requirements Preserve aggregate information about population Preserve privacy of sensitive information Privacy How much information can an adversary deduce from released data? Data Publishing against Realistic Adversaries Amedeo D’Ascanio
4
Example Alice knows that Rachel is 35 and she lives in 13058 Alice knows that Rachel is 20 and she has very low probability of Hart Disease Data Publishing against Realistic Adversaries Amedeo D’Ascanio
5
Previous Definitions L-diversity The adversary knows l-2 information about sensitive attribute The informations are equally like T-closeness Alice knows the distribution of sensitive values Rachel’s chances of having a disease follow the same odds Differential privacy Alice knows exact disease about every patient but Rachel’s one “It’s flu season, a lot of elderly people will be in the hospital with flu symptoms” How do we model such background knowledge with l-diversty or t- closeness? Does Alice knows everything about 1Billion patients? Unrealistic assumptions! Data Publishing against Realistic Adversaries Amedeo D’Ascanio
6
Є-privacy Flexible language to define information about each individual Privacy as difference of adversary’s belief between published table with and without the “victim” Different class of adversary (either realistic or unrealistic) modeled based on their knowledge Data Publishing against Realistic Adversaries Amedeo D’Ascanio
7
Modeling sensitive information Positive disclosure Alice knows that Rachel has flu Negative disclosure Alice knows that Rachel has not flu Sensitive information using positive discloser on a set of sensitive predicates Φ Data Publishing against Realistic Adversaries Amedeo D’Ascanio
8
Modeling sensitive information Example Negative discloser each takes the form where dom(S) is the domain of sensitive attribute Rachel can protect against any kind of disclosures for flu, cancer and any stomach disease if for each subset Positive discloser False True Data Publishing against Realistic Adversaries Amedeo D’Ascanio
9
Adversaries Knowledge Knowledge from other sources Usually modeled as the joint distribution P over N and S attributes. If the adversary has no preference for any value of i Data Publishing against Realistic Adversaries Amedeo D’Ascanio
10
Adversary Knowledge Two problems Where does the adversary learn their knowledge? If population with cancer is 10% (s i = s/10) For each i, p i =s i /s=0.1 What if T pub has only 10 enties? Can the adversary change his prior? The probability that a woman has cancer is p i =0.5 based on a sample of 100 women An adversary read another table with 20k tuples where s i is 2k (so that p i =0.1) If her prior is not strong p i will change accordingly Data Publishing against Realistic Adversaries Amedeo D’Ascanio
11
Adversary Knowledge To model adversaries we assume that The adversary knows more priors The tuples are not independent each other Exchangeability: a sequence of random variable X1,X2,..,Xn is exchangeable if every finite permutation of these random variables has the same joint probability distribution If H is healty an S is Sick, the probability of seeing the sequence SSHSH is the same as the probability of HHSSS Accordin to deFinetti’s representation Theorem, an exchangeable sequence of random variables is mathematically equivalent to Choosing a data-generating distribution θ at random Creating the data by independently sampling from this chosen distribution θ Data Publishing against Realistic Adversaries Amedeo D’Ascanio
12
Adversary Knowledge Example Assume two populations of equal size, Ω 1 with only healty people and Ω 2 with only sick people. Table T is drawn only form Ω 1 or Ω 2 If the adversary doesn’t know which population has been chosen: If the adversary knows that just one t is healthy then: If tuples are independent from each other? Still Pr[t=H] =0.5 Data Publishing against Realistic Adversaries Amedeo D’Ascanio
13
Dirichlet Distribution More generally: T (of size n) is generate in two steps: One of probability vector p is drawn from a distribution D Then n elements are drawn i.i.d. from the probability vector D encode the adversary knowledge If the adversary has no prior is drawn from D equally like If an adversary know that 999 people over 1k have cancer, he should model D in order to draw p no (cancer) = 0.001 and p yes (cancer) =0.999 Dirichlet Distribution to model prior over Data Publishing against Realistic Adversaries Amedeo D’Ascanio
14
Dirichlet Distribution belief that the probabilities of k rival events are x i given that each event has been observed σ i − 1 times. Adversary without knowledge: D(σ 1,…, σ k ) = D(1,…,1); After reading dataset whit counts (σ 1 -1,…, σ k -1) the adversary may update his prior to D(σ 1,…, σ k ). In this case not all are equally like Data Publishing against Realistic Adversaries Amedeo D’Ascanio
15
Dirichlet Distribution The vector with the maximum likelihood is As we increase σ the becomes more likely If is the only possible probability distribution Data Publishing against Realistic Adversaries Amedeo D’Ascanio
16
Other Adversary Knowledge Knowledge from individuals inside the published table Full knowledge about a subset B of tuples in T Data Publishing against Realistic Adversaries Amedeo D’Ascanio
17
Definition After Tpub is published the adversary belief in a sensitive predicate about an individual u in T is If the individual u is remove from T, the belief becomes Data Publishing against Realistic Adversaries Amedeo D’Ascanio
18
Definition The p in should not be much greater than p out The greater it is, the more information about an individual’s sensitive predicate the adversary learns A Table does not respect epsilon-privacy if Data Publishing against Realistic Adversaries Amedeo D’Ascanio
19
Adversary Classes Defined based on their prior built over the distribution of sensitive values Class I: Class II: Class III: Class IV: Data Publishing against Realistic Adversaries Amedeo D’Ascanio
20
Adversary classes - Examples Suppose to have another dataset with 30000 tuples: 12000 with flu and 18000 cancer Class I: σ= 30k, D(12k,18k) Class II: σ= 30k arbitrary shape Class III: arbitrary σ, distribution (.4,.6) Class IV: arbitrary prior Rachel is in the table. p in (flu) =.9 for all adversaries (depends only from published table) p out (flu) changes for each adversary Data Publishing against Realistic Adversaries Amedeo D’Ascanio
21
Adversary classes - Examples Class I : p out (flu) = (18k+12k)/(20k+30k) =.6 Class II : p out (flu) = (18k+1)/(20k+30k)=.36002 Class III: p out (flu) =.4 Class IV = every value So that Rachel is granted.4, 6.4, 6 and no privacy against respectively class I,II,III,IV adversary Data Publishing against Realistic Adversaries Amedeo D’Ascanio
22
Generalization and epsilon privacy Set of sensitive predicates for each individual u is We can define a set of constraint that have to be checked during the generalization process Data Publishing against Realistic Adversaries Amedeo D’Ascanio
23
Check for Class I R1 and R2 has to be respected Combination of Anonymity closeness Data Publishing against Realistic Adversaries Amedeo D’Ascanio
24
Check for Class II R1 and R2 has to be respected Combination of anonymity diversity Data Publishing against Realistic Adversaries Amedeo D’Ascanio
25
Check for Class III R1 and R2 has to be respected Only closeness epsilon-privacy doesn’t guarantee privacy against class IV adversary Data Publishing against Realistic Adversaries Amedeo D’Ascanio
26
Montonicity T1 and T2 generalization of T such that if T1 satisfies epsilon-privacy, then T2 also satisfies epsilon-privacy Useful for algorithms such as Incognito, Mondrian, PET algorithm All checks shown before can has a time complexity O(N) Data Publishing against Realistic Adversaries Amedeo D’Ascanio
27
Choosing Parameter The choice is application dependent: US Census Stubbornness: number of individuals Shape: distribution of sensitive values Epsilon: between 10 and 100 WHY? Data Publishing against Realistic Adversaries Amedeo D’Ascanio
28
Experimental results The more stubbornness we have, the grater epsilon we need to achieve privacy With small values of σ the cost function is better The average group size increases according to σ Data from Minnesota Population Center with nearly 3M tuples Data Publishing against Realistic Adversaries Amedeo D’Ascanio
29
Embedding prior work Epsilon-privacy can cover some instantiation of Recursive diversity (c,2)-diversity Differential privacy T-closeness Data Publishing against Realistic Adversaries Amedeo D’Ascanio
30
Conclusions Definition of epsilon-privacy Definition of Realistic Adversaries How to cover scenarios not taken in account in previous works Epsilon-privacy in generalization process Future work: Considering correlation between sensitive and non sensitive values apply epsilon privacy to other algorithm Data Publishing against Realistic Adversaries Amedeo D’Ascanio
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.