2. Attacks on Anonymized Social Networks
Setting A social network Edges may be private –E.g., “communication graph” The study of social structure by social networks –E.g., the small world phenomenon –Requires data Common practice – anonymization –“A rose by any other word would smell as sweet” –An anonymized network has same connectivity, clusterability,etc. V2783! R3579X Y5873T D2893L FGH389 OE &V H#928! 928&23 I378FG
Main Contribution Raising a privacy concern –Data is never released in the void Proving the concern by presenting attacks One cannot rely on anonymization Thus, highlighting the need for mathematical rigor –(But isn’t DP + calibrated noise mechanism rigorous enough?) DB
Key Idea Goal: Given a single anonymized network, de- anonymize 2 nodes and learn if connected What is the challenge? –Compare to breaking anonymity of Netflix What special kind of auxiliary data can be used? –Hint: Active attacks in Cryptography Solution –“Steganography”
Outline Attacks on anonymized networks – high level description The “Walk-Based” active attack –Description –Analysis –Experiments Passive attack
Kinds of Attacks Active attack Passive attack Hybrid attack
Active Attacks - Challenges Let G be the network, H the subgraph With high probability, H must be: Uniquely identifiable in G –For any G Efficiently locatable –Tractable instance of subgraph isomorphism But undetectable –From the point of view of the data curator
Active Attacks - Approaches Basic idea: H is randomly generated –Start with k nodes, add edges independently at random Two variants: –k = Θ(logn) de-anonymizes Θ(log 2 n) users –k = Θ(√logn) de-anonymizes Θ(√ logn) users H needs to be “more unique” Achieved by “thin” attachment of H to G The “Walk-based” attack – better in practice The “Cut-based” attack – matches theoretical bound
Outline Attacks on anonymized networks – high level description The Walk-Based active attack –Description –Analysis –Experiments Passive attack
The Walk-Based Attack – Simplified Version Construction: –Pick target users W = {w 1,…,w k } –Create new users X = {x 1,…,x k } and random subgraph G[X] = H –Add edges (x i, w i ) Recovery –Find H in G ↔ No subgraph of G isomorphic to H –Label H as x 1,…,x k ↔ No automorphisms –Find w 1,…,w k W1W1 X2X2 W2W2 X1X1
The Walk-Based Attack – Full Version Construction: –Pick target users W = {w 1,…,w b } –Create new users X = {x 1,…,x k } and H –Connect w i to a unique subset N i of X –Between H and G – H Add Δ i edges from x i where d 0 ≤ Δ i ≤ d 1 =O(logn) –Inside H, add edges (x i, x i+1 ) To help find H X1X1 X2X2 X3X3
(2+δ)lognO(log 2 n) w1w1 w2w2 w4w4 w3w3 x1x1 x2x2 x3x3 N1N1 Δ3Δ3 Total degree of x i is Δ' i G Construction of H
Recovering H Search G based on: –Degrees Δ' i –Internal structure of H α1α1 αlαl Search tree T G root f (α 1 )f (α l ) v β
Analysis Theorem 1 [Correctness]: With high probability, H is unique in G. Formally: –H is a random subgraph –G is arbitrary –Edges between H and G – H are arbitrary –There are edges (x i, x i+1 ) Then WHP no subgraph of G is isomorphic to H. Theorem 2 [Efficiency]: Search tree T does not grow too large. Formally: –For every ε, WHP the size of T is O(n 1+ε )
Theorem 1 [Correctness] H is unique in G. Two cases: –For no disjoint subset S, G[S] isomorphic to H –For no overlapping S, G[S] isomorphic to H Case 1: –S = nodes in G – H –ε S – the event that s i ↔ x i is an isomorphism – –By Union Bound,
Theorem 1 continued Case 2: S and X overlap. Observation – H does no have much internal symmetry Claim (a): WHP, there are no disjoint isomorphic subgraphs of size c 1 logk in H. Assume this from now on. Claim (b): Most of A goes to B, most of Y is fixed under f (except c 1 logk nodes) (except c 2 logk nodes) G X B Y A B Y Y A f
Theorem 1 - Proof What is the probability of an overlapping second copy of H in G? f ABCD : AUY → BUY = X Let j = |A| = |B| = |C| ε ABCD – the event that f ABCD is an isomorphism #random edges inside C ≥ j(j-1)/2 – (j-1) #random edges between C and Y' ≥ (|Y'|)j – 2j Probability that the random edges match those of A Pr[ε ABCD ] ≤ 2 #random edges X A D Y' B C A B,C D
Theorem 2 [Efficiency] Claim: Size of search tree T is near-linear. Proof uses similar methods: –Define random variables: #nodes in T = Γ Γ = Γ' + Γ'' = #paths in G – H + #paths passing in H –This time we bound E(Γ') [and similarly E(Γ'')] –Number of paths of length j with max degree d 1 is bounded –Probability of such a path to have correct internal structure is bounded E(Γ') ≤ (#paths * Pr[correct internal struct])
Experiments Data: Network of friends on LiveJournal –4.4∙10 6 nodes, 77∙10 6 edges Uniqueness: With 7 nodes, an average of 70 nodes can be de-anonymized –Although log(4.4∙10 6 ) ≈ 15 Efficiency: |T| is typically ~9∙10 4 Detectability: –Only 7 nodes –Many subgraphs of 7 nodes in G are dense and well-connected
Probability that H is Unique
Outline Attacks on anonymized networks – high level description The Walk-Based active attack –Description –Analysis –Experiments Passive attack
Passive Attack H is a coalition, recovered by same search algorithm Nothing guaranteed, but works in practice
Summary & Open Questions One cannot rely on anonymization of social networks Major open problem – what (if anything) can be done in the non-interactive model? –Released object must answer many questions accurately while preserving privacy –Noise must increase with number of questions [DN03] Novel models
Any Questions? Thank you
Passive Attack - Results