De-anonymizing social networks Arvind Narayanan, Vitaly Shmatikov.

Slides:



Advertisements
Similar presentations
CSE 2331/5331 CSE 780: Design and Analysis of Algorithms Lecture 14: Directed Graph BFS DFS Topological sort.
Advertisements

Introduction to Algorithms NP-Complete
Two Segments Intersect?
DS.GR.14 Graph Matching Input: 2 digraphs G1 = (V1,E1), G2 = (V2,E2) Questions to ask: 1.Are G1 and G2 isomorphic? 2.Is G1 isomorphic to a subgraph of.
Slides from: Doug Gray, David Poole
Size-estimation framework with applications to transitive closure and reachability Presented by Maxim Kalaev Edith Cohen AT&T Bell Labs 1996.
Evaluating “find a path” reachability queries P. Bouros 1, T. Dalamagas 2, S.Skiadopoulos 3, T. Sellis 1,2 1 National Technical University of Athens 2.
LEARNING INFLUENCE PROBABILITIES IN SOCIAL NETWORKS Amit Goyal Francesco Bonchi Laks V. S. Lakshmanan University of British Columbia Yahoo! Research University.
Some Properties of SSA Mooly Sagiv. Outline Why is it called Static Single Assignment form What does it buy us? How much does it cost us? Open questions.
Types of Algorithms.
1 Discrete Structures & Algorithms Graphs and Trees: III EECE 320.
Structural Data De-anonymization: Quantification, Practice, and Implications Shouling Ji, Weiqing Li, and Raheem Beyah Georgia Institute of Technology.
1 Introduction to Computability Theory Lecture12: Decidable Languages Prof. Amos Israeli.
Yusuf Simonson Title Suggesting Friends Using the Implicit Social Graph.
Belief Propagation, Junction Trees, and Factor Graphs
CSE 421 Algorithms Richard Anderson Lecture 4. What does it mean for an algorithm to be efficient?
מבנה המחשב – מבוא למחשבים ספרתיים Foundations of Combinational Circuits תרגול מספר 3.
Route Planning Vehicle navigation systems, Dijkstra’s algorithm, bidirectional search, transit-node routing.
Penn ESE535 Spring DeHon 1 ESE535: Electronic Design Automation Day 15: March 18, 2009 Static Timing Analysis and Multi-Level Speedup.
Phylogenetic Tree Construction and Related Problems Bioinformatics.
A Customizable k-Anonymity Model for Protecting Location Privacy Written by: B. Gedik, L.Liu Presented by: Tal Shoseyov.
Structure based Data De-anonymization of Social Networks and Mobility Traces Shouling Ji, Weiqing Li, and Raheem Beyah Georgia Institute of Technology.
New Mexico Computer Science For All Introduction to Algorithms Maureen Psaila-Dombrowski.
1 Shortest Path Calculations in Graphs Prof. S. M. Lee Department of Computer Science.
CHAPTER 7: SORTING & SEARCHING Introduction to Computer Science Using Ruby (c) Ophir Frieder at al 2012.
TOWARDS IDENTITY ANONYMIZATION ON GRAPHS. INTRODUCTION.
Making Pattern Queries Bounded in Big Graphs 11 Yang Cao 1,2 Wenfei Fan 1,2 Jinpeng Huai 2 Ruizhe Huang 1 1 University of Edinburgh 2 Beihang University.
Time-Constrained Flooding A.Mehta and E. Wagner. Time-Constrained Flooding: Problem Definition ●Devise an algorithm that provides a subgraph containing.
Motif Discovery in Protein Sequences using Messy De Bruijn Graph Mehmet Dalkilic and Rupali Patwardhan.
Fixed Parameter Complexity Algorithms and Networks.
Privacy risks of collaborative filtering Yuval Madar, June 2012 Based on a paper by J.A. Calandrino, A. Kilzer, A. Narayanan, E. W. Felten & V. Shmatikov.
Chapter 11 – Neural Networks COMP 540 4/17/2007 Derek Singer.
Graph Theory Hamilton Paths and Hamilton Circuits.
WALKING IN FACEBOOK: A CASE STUDY OF UNBIASED SAMPLING OF OSNS junction.
June 21, 2007 Minimum Interference Channel Assignment in Multi-Radio Wireless Mesh Networks Anand Prabhu Subramanian, Himanshu Gupta.
A Clustering Algorithm based on Graph Connectivity Balakrishna Thiagarajan Computer Science and Engineering State University of New York at Buffalo.
Introduction to Graphs. Introduction Graphs are a generalization of trees –Nodes or verticies –Edges or arcs Two kinds of graphs –Directed –Undirected.
1 Maximal Independent Set. 2 Independent Set (IS): In a graph G=(V,E), |V|=n, |E|=m, any set of nodes that are not adjacent.
CP Summer School Modelling for Constraint Programming Barbara Smith 2. Implied Constraints, Optimization, Dominance Rules.
S. Salzberg CMSC 828N 1 Three classic HMM problems 2.Decoding: given a model and an output sequence, what is the most likely state sequence through the.
Sequence Comparison Algorithms Ellen Walker Bioinformatics Hiram College.
Types of Algorithms. 2 Algorithm classification Algorithms that use a similar problem-solving approach can be grouped together We’ll talk about a classification.
Community-enhanced De-anonymization of Online Social Networks Shirin Nilizadeh, Apu Kapadia, Yong-Yeol Ahn Indiana University Bloomington CCS 2014.
By Pavan kumar V.V.N. Introduction  Brain’s has extraordinary computational power which is determined in large part by the topology and geometry of its.
NETWORK FLOWS Shruti Aggrawal Preeti Palkar. Requirements 1.Implement the Ford-Fulkerson algorithm for computing network flow in bipartite graphs. 2.For.
Example Apply hierarchical clustering with d min to below data where c=3. Nearest neighbor clustering d min d max will form elongated clusters!
Error-Correcting Code
Network Partition –Finding modules of the network. Graph Clustering –Partition graphs according to the connectivity. –Nodes within a cluster is highly.
Bo Zong, Yinghui Wu, Ambuj K. Singh, Xifeng Yan 1 Inferring the Underlying Structure of Information Cascades
Computational Vision CSCI 363, Fall 2012 Lecture 17 Stereopsis II
Quantum Algorithms Oracles
Graphs Representation, BFS, DFS
New Characterizations in Turnstile Streams with Applications
Computing Connected Components on Parallel Computers
Randomized Algorithm (Lecture 2: Randomized Min_Cut)
Business Professional Women in Accounting
Types of Algorithms.
Graphs Representation, BFS, DFS
Instructor: Shengyu Zhang
Why Social Graphs Are Different Communities Finding Triangles
Three classic HMM problems
network of simple neuron-like computing elements
Backpropagation.
CSE 589 Applied Algorithms Spring 1999
Trevor Brown DC 2338, Office hour M3-4pm
X Y Relation (a set of ordered pairs) x y x y ( , ) x y Mapping
Maximum Bipartite Matching
Theory of Computability
Constraint Graph Binary CSPs
CS137: Electronic Design Automation
Presentation transcript:

De-anonymizing social networks Arvind Narayanan, Vitaly Shmatikov

Bob Bob has accounts in both facebook and myspace

overlap Facebook graph Myspace graph

Goal Given the auxiliary information, information (nodes, edges) about one social network(auxiliary network) and a small number of members of target network, the attacker wants to learn sensitive information about other members of target network.

Goal Facebook graphMyspace graph

Social network is based on graph A B C Name, location Friends Advisor/student

Algorithm 2 stages 1. Seed identification 2. Propagation

Seed identification Identify a small number of seed nodes in both target graph and auxiliary graph, and maps them to each other What is the seed Seeds are in an entity, like a clique How small

Seed identification Input: 1. A target graph 2. K seed nodes in the auxiliary graph 3. K nodes’ information, such as, degree values and pairs of common-neighbor counts 4. Error parameter ε Method Using brute-force search in the target graph with matching(with a factor of 1 ± ε) nodes degrees and common-neighbor counts.

Seed identification Output The mapping between seeds in auxiliary graph and target graph. This method does not gurantee an unique k- clique in target graph. Besides, the running time is exponential in k. Once we find a matched clique in target graph, stop searching so that it can decrease running time.

Propagation Propagation stage is a self-reinforcing process that seed mapping is extended to new nodes and new mapping is fed back to the algorithm. Input 1. G1(V1, E1) 2. G2(V2, E2) 3. A “seed” mapping between the two. No matter which is auxiliary graph or target graph Output New Mappings between nodes in G1 and G2 besides seed mappings.

Propagation Method A D C BY X Z W E G1G2 Mapping list={(B,Y), (C, Z),(D, W)} Score(A, X) = 2, Score(E, Z) = 0

Propagation We will get lots of scores such as, score(A, P), score(A, Q), score(A, R) and so on where P, Q and R are nodes in G2 and A is in G1. Which mapping should we keep? Add additional details 1. Edge directionality. Score = incoming edge score + outgoing edge score. 2. Node degrees. Score(u, v i )=score(u, v i )/√degree of v i 3. Eccentricity. It measures how much an item in a set X “stands out” from the rest. If > θ, keep the mapping; otherwise, it is rejected, where θ is a parameter. 4. Switching the input graph, if v gets mapped back to u, the mapping is retained; otherwise, it is rejected. 5. Revisiting nodes. As the number of mapped nodes increases, we need to revisit already mapped nodes. 6. Do the iteration until convergence.

Complexity O((|E1|+|E2|)*d1*d2) where d1 is a bound on the degree of nodes in G1. D2 is a bound on the degree of nodes in G2. Without revisiting nodes and reverse matches O(|E1|*d2) Without reverse matches O(|E1|*d1*d2)

Experiment “follow” relation on Twitter; “contact” relation on Flickr; “friend” relation on LiveJournal. NetworkNodesEdgesAv. Degree Twitter224K8.5M37.7 Flickr3.3M53M32.2 LiveJournal5.3M77M29.3

Seed indentification Test on LiveJournal

Propagation Number of seeds decides whether propagation step dies out or not. The graph is over 100,000 nodes. Node overlap: 25% Edge overlap: 50%

Propagation Imprecision of auxiliary information decrease percentage of correctly re-identified rate Node overlap: 25% Number of seeds: 50

Propagation Auxiliary graph: Flickr. Target graph: Twitter Seed mapping consists of 150 pairs of nodes with the constraints that the degree of each node in auxiliary graph is at least 80. We have around 27,000 mappings.

Accuracy(acc) on mappings acc = error = G1G2 v u

Result of accuracy 30.8% of mappings(27,000) were re-identified correctly, 12.1% were identified incorrectly, and 57% were not identified. 41% of the incorrectly identified mappings were mapped to nodes which are at a distance 1 from the true mapping. 55% of the incorrectly identified mappings were mapped to the nodes where the same location was reported. The above two categories overlap; 27% of incorrect mappings fall into completely erroneous.