Download presentation
1
Refined privacy models
Data Anonymization Refined privacy models
2
Outline k-anonymity specification is not sufficient Enhancing privacy
L-diversity T-closeness Max-Ent analysis
3
Linking the dots… Countering the privacy attack
k-anonymity addresses one type of attacks link attack ?? Other types of attacks
4
K-anonymity K-anonymity tries to counter this type of privacy attack:
Individual -> quasi-identifier-> sensitive attributes Example: 4-anonymized table, at least 4 records share the same quasi-identifier Quasi-identifier: The attacker can find the link using other public data Typical method: domain-specific generalization
5
More Privacy Problems All existing k-anonymity approaches assume:
Privacy is protected as long as the k-anonymity specification is satisfied. But there are other problems Homogeneity in sensitive attributes Background knowledge on individuals …
6
Problem1: Homogeneity Attack
We know Bob is in the table… If Bob lives in the zip code 13053 And he is 31 years old -> Bob surely has cancer!
7
Problem 2: Background knowledge attack
Japanese have an extremely low Incidence of heart disease! A Japanese Umeko lives in zip code 13068 and she is 21 years old -> Umeko has viral infection with high probability.
8
The cause of these two problems
The values in the sensitive attribute of some blocks have no sufficient diversity Problem1: no diversity. Problem2: background knowledge helps to reduce the diversity.
9
major contributions of l-diversity
Formally analyze the privacy model of k-anonymity with the Bayes-Optimal privacy model. Basic idea: increase the diversity of sensitive attribute values for each anonymized block Instantiation and implementation of l-diversity concept Entropy l-diveristy Recursive l-diversity More…
10
Modeling the attacks What is a privacy attack? - guess the sensitive values! (probability) Prior belief: without seeing the table, what can we guess? S: sensitive data, Q: quasi-identifier, prior: P(S=s|Q=q) Example: Japanese vs. heart disease Observed belief: with the observed table, our belief will change. T*: anonymized table, observed: P(S=s|(Q=q and T* is known)) Effective privacy attacks: Table T* should help to change the belief a lot! Prior is small, observed belief is large -> positive disclosure Prior is large, observed belief is small -> negative disclosure
11
The definition of observed belief
# of records with S=s background knowledge, i.e. the prior p(S=s|Q=q) n(q*, s)/n(q*) q* S A q*-block: a k-anonymized group with q* as the quasi-identifier S=s n(q*, s) n(q*) records f(s|q*): the proportion of this part
12
Interpret privacy problem of k-anonymity
Derived from the relationship between observed belief and privacy disclosure (positive) Extreme situation: (q,s,T*) 1 => positive disclosure Possibility 1. n(q*, s’) << n(q*, s) => Lack of diversity Possibility 2. Strong background knowledge helps to eliminate other items Knowledge: except one s, other s’ are not likely true while Q=q => f(s’|q) 0 Minimize the contribution of other items, and make => 0
13
Negative disclosure: (q,s,T*) 0
0 either n(q*,s) 0 or f(s|q) 0 The Umeko example
14
How to address the problems?
We make n(q*, s’) << n(q*, s) is not satisfied Need more knowledge to get rid of other items “damaging instance-level knowledge” for f(s’|q) 0 If L distinct sensitive values in the q*-block, the attacker needs L-1 pieces of damaging knowledge to get rid of the L-1 possible sensitive values This is the principle of L-diversity
15
L-diversity: how to evaluate it?
Entropy l-diversity Every q*-block satisfies the condition: *We like uniform distribution of sensitive values over each block! **Guarantees every q*-block has at least L distinct sensitive values Entropy of distinct sensitive values in q*-block Entropy of uniformly distributed L distinct sensitive values
16
Other extensions Entropy l-diversity is too restrictive
Some positive disclosures are allowed Typically, some sensitive values my have very high frequency and are not sensitive, in practice. For example, “normal” in disease symptom. Log(L) cannot be satisfied in some cases. Principle to relax the strong condition Uniform distribution of sensitive values is good! When we can not achieve this, we choose to make most value frequencies as close as possible, especially the most frequent value. Recursive (c,l)-diversity is proposed Control the frequency difference between the most frequent item and the most non-frequent items r1 < c(rl+rl+1+…+rm)
17
Implementation of l-diversity
build an algorithm having a structure similar to k-anonymity algorithms. With domain generalization hierarchy Check l-diversity instead of k-anonymity The Anatomy approach (without anonymization of QIs)
18
Discussion Not addressed problems
Skewed data – common problem for l-diversity and k-anonymity Makes l-diversity very inefficient Balance between utility and privacy The entropy l-diversity and (c,l)-diversity methods do not guarantee good data utility The anatomy method is much better
19
t-closeness Address two types of attacks Skewness attack
Similarity attack
20
Skewness attack Prob of cancer in the original table is low
Prob of cancer in the anonymized table is much higher than the global prob
21
Semantic similarity attack
Salary is low Has some kind of stomach diseas
22
The root of these two problems
Sensitive values Difference between Global distribution and Local distribution in some block
23
The proposal of t-closeness
Making the global and local distributions as similar as possible Evaluate the distribution similarity Semantic similarity Density : {3k,4k,5k} is denser than{6k,8k,11k} “Earth mover’s distance” as the similarity measure
24
Privacy MaxEnt Quantify the privacy under background knowledge attacks
So that we know how vulnerable an anonymized dataset on different assumptions of attacker’s knowledge
25
All attacks are based on
The attacker’s background knowledge Knowledge from the table Local/Global distribution of sensitive values can always be calculated Common knowledge Useful common knowledge should be consistent with the knowledge from the table Attack is an estimate to find P(S|Q) Find QS with high confidence Both higher/lower P(S|Q) than common knowledge reveal info
26
Quantifying privacy Need to estimate the conditional probability P(S|Q) B: bucket Most interesting
27
Without background knowledge
P(S|Q,B) is estimated with the portion of S in the bucket B With background knowledge Complicated … The paper proposes a Maximum Entropy based method to estimate P(S|Q,B), assuming the attacker knows different kinds of background knowledge Modeling background knowledge as the constraints
28
Types of background knowledge
Rule-based knowledge: P (s | q) = 1. P (s | q) = 0. Probability-Based Knowledge P (s | q) = 0.2. P (s | Alice) = 0.2. Vague background knowledge 0.3 ≤ P (s | q) ≤ 0.5. Miscellaneous types P (s | q1) + P (s | q2) = 0.7 One of Alice and Bob has “Lung Cancer”.
29
Maximum Entropy Estimation
MaxEnt principle If you don’t know the distribution, you assume it is uniform If you know part of the distribution, you still model the remaining uniform Uniform distribution maximum entropy Maximizing entropy making the distribution more like uniform.
30
MaxEnt for privacy analysis
Maximizing entropy H(S|Q,B) H(S|Q,B) = H(Q,S,B) –H(Q,B) Equivalent to maximizing H(Q,S,B) H(Q,S,B) is maximized when P(Q,S,B) is uniform The problem Given a table D’. Solve the following optimization problem Find an assignment of P(Q,S,B) for all Q,S,B combination, which maximizes H(Q,S,B) Satisfy a list of constraints (including the background knowledge)
31
find constraints Knowledge about data distribution
Constraints from the data knowledge about individuals
32
Modeling background knowledge about distributions
P(S|Qv), Qv is a part of Q e.g. P(Breast cancer|male)=0 P(Qv,S) = P(S|Qv)*P(Qv) Q- = Q-Qv All buckets In previous example, if P(flu|male)=0.3 P({male,college}, Flu,1) + P({male,highschool}, Flu,1)+ P({male,college}, Flu,3) + P({male,graduate}, Flu,3} = 0.3*P(male) =0.3*6/10 = 0.18
33
Constraints from the Data
Identify invariants from the disguised data QI-invariant equation SA-invariant equation Zero-invariant equation P(q,s,b) =0, if q is not in b, or s is not in b
34
Knowledge about individuals
Can be modeled with similar methods Alice: (i1, q1) Bob: (i4, q2) Charlie: (i9, q5) Knowledge 1: Alice has either s1 or s4. Constraint: Knowledge 1: Two people among Alice, Bob, and Charlie have s4. Constraint:
35
Summary Attack analysis is an important part of anonymization
Background knowledge modeling is the key to attack analysis But no way to enumerate all background knowledge…
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.