Xiaowei Ying, Leting Wu, Xintao Wu University of North Carolina at Charlotte Privacy and Spectral Analysis on Social Network Randomization
Framework 2 Background & Motivation Privacy in Randomized Graph Link privacy (3 method to quantify link privacy) Node privacy Feature Preserving Randomization Spectrum preserving randomization General feature preserving randomization (Markov chain based) Attacks to feature preserving randomization Reconstruction from Randomized Graphs Spectrum Based Fraud Detection A spectral framework to quantify non-randomness of social networks Spectrum based fraud detection Future Work
Background & Motivation 3
Social Network 4 Friendship in Karate club [Zachary, 77] Biological association network of dolphins [Lusseau et al., 03] Collaboration network of scientists [Newman, 06] Network of US political books (105 nodes, 441 edges) Books about US politics sold by Amazon.com. Edges represent frequent co-purchasing of books by the same buyers. Nodes have been given colors of blue, white, or red to indicate whether they are "liberal", "neutral", or "conservative".
7 Public/ Third party/ Research Inst. Data Owner The original graph data release Background & Motivation Publish/outsource data for mining/analysis Data miner: discover patterns/features of the data (utility) -- find central nodes, community partition, link prediction Attacker: breach sensitive information the data (privacy) -- identity of nodes (and sensitive attributes), sensitive relation between two individuals
Privacy issues in publishing social network data: Anonymization is not enough for protecting the privacy. Active/passive attacks[1], subgraph attacks [2]. [1] L. Backstrom, et. al., Wherefore art thou r3579x?: anonymized social networks, hidden patterns, and structural steganography. WWW07 [2] M. Hay et. al. Resisting Structural Reidentification in Anonymized Social Networks, VLDB08 Background & Motivation 8
Privacy Preserving Social Network Publishing Node-anonymization cannot guarantee identity/link privacy due to subgraph queries. K-anonymity generalization The released graph has at least k nodes with the same degree/subgraph/neighorhood [Liu&Terzi SIGMOD08, Zhou&Pei ICDE08, Chen VLDB09] Graph (edge) randomization Random Add/Del & Random Switch Utility preserving randomization Super graph generalization Generate nodes into supper nodes, and edges into supper edges 9
Background & Motivation Graph Randomization/Perturbation: 1. Random Add/Del edges (no. of edges unchanged) 2. Random Switch edges (nodes’ degree unchanged) 10
Background & Motivation Graph Randomization/Perturbation: Data privacy: How graph randomization prevents privacy disclosure? Data utility: How will the graph structure change due to randomization? How to preserve graph structural features better? 11
Background & Motivation Numerous topological measures of networks Harmonic mean of shortest distance Transitivity(cluster coefficient) Subgraph centrality Modularity (community structure); And many others 12
Background & Motivation Spectral measures – adjacency matrix Adjacency Matrix A (symmetric) Adjacency Spectrum 13
Laplacian Matrix and Spectrum: Normal Matrix and Spectrum 14 Background & Motivation
Many topological features are related to spectral measures: No. of triangles: Subgraph centrality: Graph diameter: k disconnected parts in the graph ⇔ k 0’s in the Laplacian spectrum. 15
Background & Motivation Two important eigenvalues: and 1. The maximum degree, chromatic number, clique number etc. are related to ; 2. Epidemic threshold for virus propagates in the network is related to [Wang et al., KDD03]; 3. indicates the community structure of the graph: clear community structure ⇔ ≈ 0. 16
The Laplacian eigenvalues Basic Facts of Graph Spectrum Graph from: A. Capocci , et. al., Detecting communities in large networks
Basic Facts of Graph Spectrum The Laplacian eigenvectors
Privacy in Randomized Graph 19
Framework 20 Background & Motivation Privacy in Randomized Graph Link privacy (3 method to quantify link privacy) Node privacy Feature Preserving Randomization Spectrum preserving randomization General feature preserving randomization (Markov chain based) Attacks to feature preserving randomization Spectrum Based Fraud Detection A spectral framework to quantify non-randomness of social networks Spectrum based fraud detection Future Work
Link Privacy: Prior & Posterior Beliefs Quantify attacker’s belief (assume that node identities are known) Prior probabilities: Posterior probability for node pair (i, j): Serious jeopardize the privacy when 21
Link Privacy: Prior & Posterior Beliefs Method I [Ying, Wu, SDM08] Add & Del k links Switch k times 22
Link Privacy: Prior & Posterior Beliefs M ethod II [Ying, Wu, PAKDD09] A common phenomenon: in real-world graphs similar nodes tend to connect to each other 23
Link Privacy: Prior & Posterior Beliefs M ethod II [Ying, Wu, PAKDD09] Even after moderate randomization, the phenomenon still exists: 24
Link Privacy: Prior & Posterior Beliefs M ethod II [Ying, Wu, PAKDD09] 25
Add/Del: 1. True links are deleted w.p. 2. False links are added w.p. Link Privacy: Prior & Posterior Beliefs M ethod II [Ying, Wu, PAKDD09] 26 With Bayes’ theorem
Link Privacy: Prior & Posterior Beliefs Method II [Ying, Wu, PAKDD09] Evaluation (add/del 50% true links) 27
Link Privacy: Prior & Posterior Beliefs M ethod II [Ying, Wu, PAKDD09] The total sum of prior and posterior probabilities is the same: 28 prior prob.posterior prob. Iposterior prob. II
Link Privacy: Prior & Posterior Beliefs M ethod III [Ying, Wu, SDM09] Intuition: degree sequence specifies a graph space, and the true graph is just one member of the space. 29 Example: switch graph with degree sequence {3,2,2,2,3} Is node 1 and 5 connected?
Link Privacy: Prior & Posterior Beliefs M ethod III [Ying, Wu, SDM09] Graph space = {G: with a given degree sequence} Impractical to enumerate all members in the space Sample the graph space through Markov chain: 30
Link Privacy: Prior & Posterior Beliefs M ethod III [Ying, Wu, SDM09] Evaluation Polbooks (r=8%) Enron (r=8%) 31
Identity Privacy: Re-identify nodes in the anonymous graphs based on some background information (e.g. degree) Randomization reduces attackers’ beliefs Node Identity Privacy 32 Polbooks: degree distribution After randomization
Node Identity Privacy Nodes’ prior and posterior risks Given an individual α with degree d α and a randomized graph Prior risk: Posterior risks 33
Node Identity Privacy Ongoing work: Compare randomization and k-anonymity approach: -- to achieve the same privacy protection level, which approach can achieve better utility? Combine identity privacy and node privacy. Node identity privacy issue under different background information (e.g., sub-graph, neighborhood). 34 K-degree generalization [Liu et. al.]
Feature Preserving Randomization 35
Framework 36 Background & Motivation Privacy in Randomized Graph Link privacy (3 method to quantify link privacy) Node privacy Feature Preserving Randomization Spectrum preserving randomization General feature preserving randomization (Markov chain based) Attacks to feature preserving randomization Reconstruction from Randomized Graphs Spectrum Based Fraud Detection A spectral framework to quantify non-randomness of social networks Spectrum based fraud detection Future Work
Feature Preserving Randomization Topological and spectral features change a lot along the perturbation. 37 (Networks of US political books, 105 nodes and 441 edges) Can we better preserve the network structure?
Features in Social Network Data Two important eigenvalues: and 1. The maximum degree, chromatic number, clique number etc. are related to ; 2. Epidemic threshold for virus propagates in the network is related to [Wang et al., KDD03]; 3. indicates the community structure of the graph: clear community structure ⇔ ≈ 0. 38
Spectrum Preserving Randomization Spectrum preserving approach [Ying, Wu, SDM08] Intuition: since spectrum is related to many graph topological features, can we preserve more structural features by controlling the movement of eigenvalues? 39
Spectrum Preserving Randomization Spectral Switch (apply to adjacency matrix): To increase the eigenvalue: To decrease the eigenvalue: 40
Spectrum Preserving Randomization Spectral Switch (apply to Laplacian matrix): To decrease the eigenvalue: To increase the eigenvalue: 41
Spectrum Preserving Randomization Evaluation: 42 (Networks of US political books, 105 nodes and 441 edges)
Markov Chain Based Feature Preserving Randomization Markov chain generation [Ying, Wu, SDM09] Data owner puts feature range constrains in switching Feature range constrains: The data owner publish the feature range constraint. 43
Markov chain generation [Ying, Wu, SDM09] Markov chain with feature range constraint (uniformity for accessible graphs) Markov Chain Based Feature Preserving Randomization 44
Markov chain generation [Ying, Wu, SDM09] Problem: accessibility is not guaranteed We propose the relaxed algorithm with feature range constraint (accessibility, approximate uniformity) The relaxed algorithm also has applications in testing the significance data mining results Markov Chain Based Feature Preserving Randomization 45
Data owner puts feature range constrains in switching Feature range constrains: Can attackers utilize the feature constrains to breach link privacy? 46 Attacks to Feature Preserving Randomization
Markov chain approach [Ying, Wu, SDM09] Markov chain with feature range constraint Graph space = {G: with a given deg. seq. & S(G) in R} 1. Starting with the randomized data, repeat the switch procedure many times and get one sample graph 2. Generate N graphs Attacks in Feature Preserving Randomization 47
Attacks in Utility Preserving Randomization Markov chain approach [Ying, Wu, SDM09] Evaluation Polbooks (r=8%) Enron (r=8%) 48 Future work: what cause the difference? What features will (not) release privacy?
Reconstruction from Randomized Graphs 49 Motivation Low Rank Approximation on Graph Data Reconstruction from Randomized Graph Privacy Issue SDM10 paper
Motivation We focus on whether we can reconstruct a grpah from s.t. 50 Our Focus
Revisit of LRA in Numerical Data Spectral Filter derive estimation of U from perturbed data Calculate covariance matrix which is symmetric and positive definite Apply spectral decomposition to Derive the eigenvalues information from the covariance matrix of noise V and choose a proper number of dimensions, r Let and, obtain the estimated data set using 51
52 Why it works Original data are correlated Noise are not correlated noise 2 nd principal vector 1 st principal vector original signal perturbed + = 2-d estimation 1-d estimation
53 Determining r Strategy 1: (Huang and Du SIGMOD05 ) Strategy 2: (Guo, Wu and Li, PKDD 2006) The estimated data using is approximate optimal
Graph Data Matrix Representation of Network Adjacency Matrix A (symmetric) Adjacency Spectrum 54
Low Rank Approximation Low Rank Approximation by eigen-docomposition: This provide a best r rank approximation to A To keep the structure of adjacency matrix, discretize 55
New Challenges A is a 0-1 adjacency matrix whereas U is a numerical matrix and is positive covariance matrix has only non-negative eigenvalues whereas A has both positive and negative eigenvalues. Can not define the covariance matrix for graph data The strategy of determining the number of eigen components to use in numerical data does not work for graph data since the first eigenvalue of the noise matrix could be very large. 56
Leading Eigenpairs vs. Graph Topology Here we examine the role of positive and negative eigenvalues in graph topology Without loss of generality, we partition the node set into two groups and the adjacency matrix can be partitioned as where and represent the edges within the two groups and represents the edges between the groups 57
Leading Eigenpairs vs. Graph Topology 58 r = 1 r = 2 Original
Leading Eigenpairs vs. Graph Topology 59 Original r = 1 r = 2
Leading Eigenpairs vs. Graph Topology 60 Originalr = 1 r = 4 r = 2
Algorithm 61
Reconstructed Features (Political Blogs 40% Noise) 62
Determine Number of Eigenpairs It is essential to find a best number of r with the randomized graph and the perturbation magnitude. Choose as the indicator since it is closely related to the other features and there exists an explicit moment estimator 63
Data Sets Political Blogs Based on incoming and outgoing links and posts during the time of 2004 presidential election links among 1222 US political blogs Political Books Based on the political books sold by Amazon.com where nodes represent the books and edges represent the co-purchasing of books 105 nodes and 441 edges Enron Based on corpus of a real organization covering 3 years period where an edge represents there are at least 5 s sent between two people 151 nodes and 869 edges 64
Effect of Noise (Political Blogs) The method works well to a certain level of noise Even with high level of noise, the reconstructed features are still closer to the original than the randomized ones 65
Reconstructed Features on 3 real network data 66 Reconstruction Quality When, the reconstructed features are closer to the original ones than the randomized ones All positive for the three data sets
Privacy Issue Question 1: Can this reconstruction be used by attackers? Define the normalized Frobenius distance between A and as 67 Political Books Enron Political Blogs Normalized F Norm
Privacy Issue Question 2: Which type of graphs would have privacy breached? For low rank graphs which have, the distance between the reconstructed graph and the original graph can be very small 68 Randomizing Social Network: a Spectrum Preserving Approach, SDM08
Synthetic Low Rank Graphs Here is a set of synthetic low rank graphs generated from Political Blogs and you can see that the reconstruction works on both the distance and features 69
Conclusion We have shown the close relationship between graph topological structure and spectral spaces determined by eigen-pairs of the adjacency matrix We have presented a low rank approximation based reconstruction algorithm and a novel solution to determine the optimal rank in reconstruction We find for most social networks, the reconstructed networks do not incur further disclosure risks of individual privacy than the released randomized graphs, only networks with low ranks or a small number of dominant eigenvalues may incur further privacy disclosure due to reconstruction 70
Spectrum Based Fraud Detection 71
Framework 72 Background & Motivation Privacy in Randomized Graph Link privacy (3 method to quantify link privacy) Node privacy Feature Preserving Randomization Spectrum preserving randomization General feature preserving randomization (Markov chain based) Attacks to feature preserving randomization Reconstruction from Randomized Graphs Spectrum Based Fraud Detection A spectral framework to quantify non-randomness of social networks Spectrum based fraud detection Future Work
A Spectral Framework to Quantify Graph Non-randomness Adjacency Matrix A (symmetric) Adjacency Spectrum 73
A Spectral Framework on Quantifying Graph Non-randomness 74 Graph non-randomness [Ying, Wu, SDM09] Spectral coordinates: Link non-randomness: Node non-randomness: Graph non-randomness:
A Spectral Framework to Quantify Graph Non-randomness 75 Graph non-randomness [Ying, Wu, SDM09] Spectral coordinates:
Background & Motivation 76 Laplacian spectral spaceNormal spectral space
Graph randomness [Ying, Wu, SDM09] Link non-randomness: A Spectral Framework to Quantify Graph Non-randomness 77
Graph randomness [Ying, Wu, SDM09] Node non-randomness: A Spectral Framework to Quantify Graph Non-randomness 78
Graph randomness [Ying, Wu, SDM09] Graph non-randomness: A Spectral Framework to Quantify Graph Non-randomness 79 Property Normally distributed with mean equals to ER-graph; The complete and regular graph reach the positive and negative extreme values; Randomization reduces the non- randomness value. Normalized by the mean and standard deviation for ER-graphs
A Spectral Framework to Quantify Graph Non-randomness Application: spectral switch (apply to adjacency matrix): To preserve the non-randomness of the whole graph (eigenvalues), deleted edges and added fake edges has comparable edge non- randomness values. 80
81 Collaborative Attacks Some attackers join the social network Attackers create links to regular users (victims) Attacks form some inner structure among themselves
Graph Perturbation 82
Collaborative Attacks 83
first order second order 84 Regular nodes are approximately unchanged Collaborative Attacks Approximate the entries in the eigenvector
Collaborative Attacks 85 Regular nodes are approximately unchanged first order second order The entry is expressed by the victims approximately Inner structure among attackers affects the eigenvector in the second order term Approximate the entries in the eigenvector
Problem We do not know attackers/victims in advance, hence their specific spectral coordinates are unknown. For Random Link Attacks, we can derive the distribution of attacking nodes’ spectral coordinates. 86
87 The attacker creates some fake nodes, and control the fake nodes to connect to randomly selected regular nodes; Fake nodes can mimic the real graph structure among themselves to evade detection. Random Link Attacks
88 Idea count out triangles around nodes --- regular connections produce many triangles, random connections do not create many triangles Algorithm Detecting suspects clustering test and neighborhood independence test Detecting RLAs GREEDY and TRWALK Limitation difficult to detect when attackers create a dense subgraph among them Too many parameters Topology approach -- Shirvastava et al. icde08
For Random Link Attacks (RLA): has the normal distribution with mean and variance bounded by: We can get the region in the spectral space where RLA attackers appear in high probability Spectrum based RLA detection 89 Inner structure of attackers does not affect the region!!!
For Random Link Attacks (RLA): has the normal distribution with mean and variance bounded by: We can get the region in the spectral space where RLA attackers appears in high probability Spectrum based RLA detection 90 Inner structure of attackers does not affect the region!!! 20 attackers, each attacks 30 victims averagely
Combine k dimensions together: We can get the upper bounds of mean and variance of R and get the decision line: 91 Using node non-randomness Nodes below the decision line are suspects
Example I 92 Spectral properties of normal nodes and attackers 20 attackers join the Polblogs network. Each attacker connects 50 randomly selected victims. Attackers form a random graph among themselves
Example II 93 Spectral properties of normal nodes and attackers 40 attackers join the Polblogs network. They totally attack 1000 randomly selected victims. Attackers mimic real network structure among themselves
Comparison Topology based RLA detection approach – Shrivastava et al. ICDE08 clustering test and neighborhood independence test GREEDY and TRWALK Experimental Setting Web Spam Challenge data (114K nodes and 1.8M links) Add 8 RLAs with varied sizes and connection patterns. 94
Accuracy 95
Execution time 96
Distributed Denial Of Service Attacks 97 Spectral properties of victim nodes Attacker controls 200 normal nodes to attack one victim node.
Fraud Detection: Bipartite Core Attacks Attacker creates two type of nodes: Accomplices: connect to normal nodes and pretend to be normal. Accomplices also connect to fraudsters (and enhance fraudsters’ rating). Fraudsters: nodes that actually do frauds, mostly connect to accomplices Figure from: Duen Horng Chau et. al., Detecting Fraudulent Personalities in Networks of Online Auctioneers 98 Bipartite core
Future work Compare randomization and k-anonymity Combine link privacy and node privacy Link and node privacy issue for feature preserving randomization Spectral based fraud detection for various random attacks 99
Thank you! Questions? X. Wu, X.Ying, K. Liu and L. Chen. "A Survey of Algorithms for Privacy-Preservation of Graphs and Social Networks". Invited book chapter. Managing and Mining Graph Data. August X. Ying, X. Wu, K.Pan, and L. Guo. "On the Quantification of Identity and Link Disclosures in Randomizing Social Networks". Invited book chapter. Advances in Information & Intelligent Systems. Springer, X. Wu, X. Ying and L. Wu. "Analyzing Socio-technical Networks: a Spectrum Perspective". Invited book chapter. Socio-technical Networks: Science and Engineering Design, X. Ying, K. Pan,X. Wu and L. Guo. "Comparisons of Randomization and K-degree Anonymization Schemes for Privacy Preserving Social Network Publishing ", (SNA-KDD09). X. Ying and X. Wu. “Graph Generation with Prescribed Feature Constraints”, (SDM09). X. Ying and X. Wu. "On Randomness Measures for Social Networks", (SDM09). X. Ying and X. Wu. "On Link Privacy in Randomizing Social Networks". (PAKDD09, Best Student Paper Runner-up Award) X. Ying and X. Wu. "Randomizing Social Networks: a Spectrum Preserving Approach". (SDM08). 100
Evaluation 101
Node randomness: Future Work: Random Attack Detection 102
Fraud Detection: Bipartite Attacks 103 Algorithm outline: Find the suspect according to node non-randomness measure; Compute the common neighbor (CN) matrix of suspects: Susp_CN(i,j) = # CN of i and j Susp_CN is a weighted undirected graph! Find dense subgraphs in Susp_CN graph.
Fraud Detection: Bipartite Attacks 104 Spectral space of Susp_CN graph Polblogs network, 20 accomplices, and 15 fraudsters
Future Work: Node Identity Privacy Re-identification risks reduces as k increases; Add/Del strategy can efficiently reduce the risk. 105
Link Privacy: Prior & Posterior Beliefs 106 M ethod III [Ying, Wu, SDM09] 1. Uniform switch procedure [Taylor, 1981] 2. Starting with the randomized data, repeat the uniform switch procedure many times and get one sample graph 3. Generate N graphs
Link Privacy: Prior & Posterior Beliefs 107 M ethod III [Ying, Wu, SDM09] 1. Uniform switch procedure [Taylor, 1981] 2. Starting with the randomized data, repeat the uniform switch procedure many times and get one sample graph 3. Generate N graphs