Download presentation
Presentation is loading. Please wait.
Published byMervin Montgomery Modified over 9 years ago
1
Resisting Structural Re-identification in Anonymized Social Networks Michael Hay, Gerome Miklau, David Jensen, Don Towsley, Philipp Weis University of Massachusetts Amherst Session : Privacy & Authentication, VLDB 2008 2011-01-21 Presented by Yongjin Kwon
2
Copyright 2011 by CEBT Outline Introduction Adversary Knowledge Models Vertex Refinement Queries Subgraph Queries Hub Fingerprint Queries Disclosure in Real Networks Anonymity in Random Graphs Graph Generalization for Anonymization Conclusion 2
3
Copyright 2011 by CEBT Introduction There are a large amount of data in various storages. Supermarket Transactions Web Sever Logs Sensor Data Interactions in Social Networks Email, Twitter … Data owners publish sensitive information to facilitate research. Reveal as much important information as possible while preserving the privacy of the individuals in the data. In personal data, analysts may find valuable information. 3
4
Copyright 2011 by CEBT Introduction (Cont’d) A Face Is Exposed for AOL Searcher No. 4417749 [New York Times, August 9, 2006] AOL collected 20 million Web search queries and published them. Although the company naïvely anonymized the data, the identity of AOL user “No. 4417749” revealed: “Thelma Arnold, a 62-year-old widow who lives in Lilburn, Ga., frequently researches her friends’ medical ailments and loves her three dogs.” Serious problem of privacy risks! 4
5
Copyright 2011 by CEBT Introduction (Cont’d) Potential privacy risks in network data Risk network structure in the early epidemic phase of HIV trans- mission in Colorado Springs [Sexually Trans. Infections, 2002] – A social network, which represents a set of individuals related by sexual contacts and shared drug injections, is published in order to analyze how HIV spreads. Enron Email Dataset (http://www.cs.cmu.edu/~enron/)http://www.cs.cmu.edu/~enron/ – The email collection was released for investigation. – It is the only “real” email collection due to the privacy issues. 5
6
Copyright 2011 by CEBT Introduction (Cont’d) Attacks on (naïvely anonymized network data) Wherefore Art Thou R3579X? Anonymized Social Networks, Hidden Patterns, and Structural Steganography [WWW 2007] Active Attack – An adversary chooses a set of targets, creates a small number of fake nodes with edges to these targets, and construct a highly identifiable pattern of links among the new nodes. – After the network is released, the adversary can recognize the pattern and fake nodes, and reveal the sensitive information of targets. Passive Attack – Most vertices in network data usually belong to a small uniquely identifiable subgraph. – An adversary may collude with other friends to identify additional nodes connected to the distinct subset of the coalition. 6
7
Copyright 2011 by CEBT Introduction (Cont’d) An adversary may compromise privacy of some victims with some (structural) background knowledge. The naïve anonymization is NOT sufficient! A new way of resisting malicious actions to re-identify the identity of each individual in a published network data must be proposed. Need to think of… Types of adversary knowledge Theoretical approach of privacy risks A way of preserving privacy while maintaining high utility of data 7
8
Copyright 2011 by CEBT Adversary Knowledge Models The adversary’s background knowledge is modeled as “correct” answers to a restricted knowledge query. The adversary uses the query to refine the feasible candidate set. Three knowledge models Vertex Refinement Queries Subgraph Queries Hub Fingerprint Queries 8
9
Copyright 2011 by CEBT Vertex Refinement Queries These queries report on the local structure of the graph around the “target” node. 9 B Degree of B Degrees of neighbors of B
10
Copyright 2011 by CEBT Vertex Refinement Queries (Cont’d) Relative Equivalence If the adversary knows the answer of, then G can be quickly re-identified in the anonymized graph! 10 ABC FGH DE
11
Copyright 2011 by CEBT Subgraph Queries Two drawbacks of vertex refinement queries Always return “correct” information. Depend on the degree of the target node. These queries assert the existence of a subgraph around the “target” node. Assume that the adversary knows the number of edge facts around the target node. 11 B BBB 345Edge Facts :
12
Copyright 2011 by CEBT Hub Fingerprint Queries A hub is a node with high degree and high betweenness centrality. Hubs are easily re-identified by an adversary. A hub fingerprint for a node is a vector of distances from observable hub connections. 12 ABC FGH DE Hub Closed World : Not reachable within distance 1 Open World : Incomplete knowledge
13
Copyright 2011 by CEBT Disclosure in Real Networks Experiments for the impact of external information Three networked data set – Hep-Th : co-author graphs, taken from the arXiv archive – Enron : “real” email dataset, collected by the CALO Project – Net-trace : IP-level network trace collected at a major university Consider each node in turn as a target. Compute the candidate set for the target. – Smaller candidate set : more vulnerable! Characterize how many nodes are protected and how many are re- identifiable. 13
14
Copyright 2011 by CEBT Disclosure in Real Networks (Cont’d) Vertex Refinement Queries 14
15
Copyright 2011 by CEBT Disclosure in Real Networks (Cont’d) Subgraph Queries Two Strategies to build subgraphs – Sampled Subgraph – Degree Subgraph 15
16
Copyright 2011 by CEBT Disclosure in Real Networks (Cont’d) Hub Fingerprint Queries Hub : five highest degree nodes (Enron), ten highest degree nodes (Hep-Th, Net-trace) 16
17
Copyright 2011 by CEBT Anonymity in Random Graphs Theoretical approach of privacy risk with random graphs Erdős-Rényi Model (ER Model) with n nodes and edge connection probability p. – Asymptotic analysis of robustness against knowledge attack Sparse ER Graphs : robust against for any Dense ER Graphs : robust against, but vulnerable against Super-dense ER Graphs : vulnerable against 17
18
Copyright 2011 by CEBT Anonymity in Random Graphs (Cont’d) Anonymity Against Subgraph Queries Depends on the number of nodes in the largest clique If for a subgraph query, then The clique number is a useful lower bound on the disclosure. Random Graphs with Attributes 18
19
Copyright 2011 by CEBT Graph Generalization for Anonymization Generalize a naïvely-anonymized graph. Much uncertainty! (measured by the number of possible world) Find the partitioning that maximizes the likelihood while satisfying that the size of a supernode is larger than k. Apply the simulated annealing method to find the partitioning. 19
20
Copyright 2011 by CEBT Graph Generalization for Anonymization (Cont’d) How to analyze the generalized graph? Construct the synthetic graph using the tagged information. Perform standard graph analysis on this synthetic graph. 20
21
Copyright 2011 by CEBT Graph Generalization for Anonymization (Cont’d) How does graph generalization affect network properties? Examine five properties on the three real-world networks. – Degree – Path Length – Transitivity (Clustering Coefficient) – Network Resilience – Infectiousness Perform the experiments on the 200 synthetic graphs. Repeat for each. 21
22
Copyright 2011 by CEBT Graph Generalization for Anonymization (Cont’d) 22
23
Copyright 2011 by CEBT Conclusion Three contributions Formalize models of adversary knowledge. Provide a start point of theoretical study of privacy risks on a network data. Introduce a new anonymization technique by generalizing the original graph. 23
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.