Andreas Papadopoulos - [DEXA 2015] Clustering Attributed Multi-graphs with Information Ranking 26th International Conference on Database and Expert Systems Applications Sep. 1-4, 2015 Valencia, Spain Andreas Papadopoulos, Dimitrios Rafailidis, George Pallis, Marios D. Dikaiakos
Slide 2 of 35 Andreas Papadopoulos - [DEXA 2015] The Real World: Information Networks Friendship Coauthor Friendship Coauthor Friendship Coauthor
Slide 3 of 35 Andreas Papadopoulos - [DEXA 2015] The Real World: Information Networks Friendship Coauthor Friendship Coauthor Friendship Coauthor
Slide 4 of 35 Andreas Papadopoulos - [DEXA 2015] Challenges Identify importance of each edge- type/attribute property For instance, clustering a bibliography network Attribute ‘area of interest’ is important Attributes ‘name’ and ‘gender’ may introduce noise and reduce the clustering accuracy Combine the attribute and structural vertex properties Edges and attributes are of different type
Slide 5 of 35 Andreas Papadopoulos - [DEXA 2015] Related Work Limited attention to the different importance of attributes/edge-types Weights are mainly updated at each iteration Ignore the existence of multiple edge-types Increases computational cost and complexity Spectral clustering is not used for clustering attributed graphs Used to identify dense clusters in attribute subspaces Model-Based BAGC [SIGMOD ‘12, TKDD ‘14] CESNA [ICDM ‘13] Distance-Based SACluster [VLDB ‘09, TKDD ‘11] PICS [SDM ‘12] HASCOP [WI ‘13]
Slide 6 of 35 Andreas Papadopoulos - [DEXA 2015] Proposed Approach: CAMIR C lustering Attributed Multi-graphs with Information Ranking: CAMIR 1.Rank edge-type and attribute properties 2.Construct a unified similarity matrix 3.Adopt spectral clustering technique to generate the final clusters
Andreas Papadopoulos - [DEXA 2015] Presentation Outline Motivation Problem Definition Related Work Background Proposed Approach: CAMIR Evaluation Summary
Andreas Papadopoulos - [DEXA 2015] Presentation Outline Motivation Problem Definition Related Work Background Proposed Approach: CAMIR Evaluation Summary
Slide 9 of 35 Andreas Papadopoulos - [DEXA 2015] An edge represents the similarity of the two connected vertices Find the minimum cut of a graph Minimizes inter-cluster similarities Identifies an optimal partitioning of the graph Identifying a minimum cut is computationally difficult Efficient approximations using linear algebra Background: Graph Partitioning
Slide 10 of 35 Andreas Papadopoulos - [DEXA 2015] Based on the graph Laplacian, or Laplacian matrix Given a similarity matrix The normalized symmetric Laplacian L is defined as The eigenvectors corresponding to top k eigenvalues are the projection of the graph into R |V| x k Data is easily separable into clusters, i.e. using k-means Background: Spectral Clustering
Slide 11 of 35 Andreas Papadopoulos - [DEXA 2015] Background: Spectral Clustering Adjacency Matrix Laplacian Matrix Top 3 eigenvectors U1U2U
Slide 12 of 35 Andreas Papadopoulos - [DEXA 2015] How do we define the similarity matrix for an attributed multi-graph?
Slide 13 of 35 Andreas Papadopoulos - [DEXA 2015] Background: Similarity Matrices IR DM AI IR [0,1] N X N Gaussian Kernel [0,1] N X N Edges [0,1] N X N #Edge types + #Attributes Symmetric Non-negative Similarity Matrices How do we efficiently combine the similarity matrices?
Slide 14 of 35 Andreas Papadopoulos - [DEXA 2015] Presentation Outline Motivation Problem Definition Related Work Background Proposed Approach: CAMIR Evaluation Summary
Slide 15 of 35 Andreas Papadopoulos - [DEXA 2015] CAMIR Overview 1.Rank vertex properties and calculate their weights accordingly By considering the agreement among vertex properties 2.Compute a unified similarity matrix By combining all vertex properties based on their ranking 3.Generate the final clusters By adopting a spectral clustering approach
Slide 16 of 35 Andreas Papadopoulos - [DEXA 2015] Presentation Outline Motivation Problem Definition Related Work Background Proposed Approach: CAMIR 1. Information Ranking 2. Unified Similarity Matrix 3. Generate the final clusters Evaluation Summary
Slide 17 of 35 Andreas Papadopoulos - [DEXA 2015] Most informative property [NIPS ’11]: Has the highest ‘agreement’ with other properties ‘agree’ assign vertices the same cluster labels when used individually Information Ranking Rank attribute and edge type properties Iteratively select from the set of unranked properties the most informative property
Slide 18 of 35 Andreas Papadopoulos - [DEXA 2015] Information Ranking From the set of properties ( ), the most informative property is p [NIPS ‘11] The highest rank (| |) is assigned to the most informative property i.e. best separates the vertices The lowest rank (1.0) is assigned to the property that is selected last i.e. does not ‘agree’ with the rest of properties Rank attribute and edge type properties Iteratively select from the set of unranked properties the most informative property
Slide 19 of 35 Andreas Papadopoulos - [DEXA 2015] Presentation Outline Motivation Problem Definition Related Work Background Proposed Approach: CAMIR 1. Information Ranking 2. Unified Similarity Matrix 3. Generate the final clusters Evaluation Summary
Slide 20 of 35 Andreas Papadopoulos - [DEXA 2015] Unified Similarity Matrix Combines the multiple edge-type and attribute properties with respect to identified ranking Defined as the weighted sum of the individual similarity matrices Weights are defined by normalizing the rankings Contains all the similarity information about the network under study
Slide 21 of 35 Andreas Papadopoulos - [DEXA 2015] Presentation Outline Motivation Problem Definition Related Work Background Proposed Approach: CAMIR 1. Information Ranking 2. Unified Similarity Matrix 3. Generate the final clusters Evaluation Summary
Slide 22 of 35 Andreas Papadopoulos - [DEXA 2015] Generating the Final Clusters Calculate normalized Laplacian of Unified Similarity Matrix Perform Eigen decomposition Apply k-means to the eigenspace of top k eigenvectors Generate the final clusters
Slide 23 of 35 Andreas Papadopoulos - [DEXA 2015] CAMIR Clustering Process Diagram Properties ranking Unified Similarity Matrix Generate the final clusters Cluster 1 Cluster 2 … Cluster k Iteratively Select the Most Informative Property Apply Spectral Clustering Normalize Rankings and Compute the Unified Similarity Matrix Step 1. Identify importance of vertex properties Step 2. Efficiently combine vertex properties Step 3. Cluster the attributed multi-graph
Slide 24 of 35 Andreas Papadopoulos - [DEXA 2015] Presentation Outline Motivation Problem Definition Related Work Background Proposed Approach: CAMIR Evaluation Summary
Slide 25 of 35 Andreas Papadopoulos - [DEXA 2015] Evaluation - Datasets Real-World Datasets DBLP: Bibliography Networks GoogleSP23: Google Software Packages DatasetDBLP-1KDBLP-10KGoogleSP-23 Nodes Edges Attributes225 Edge Types112 Total Vertex Properties 337 Synthetic Datasets {100, 500, 1 000, 5 000, }1 000 {1 000 – }~ {2, 4, 8, 16, 32} 11 5{3, 5, 9, 17, 33}
Slide 26 of 35 Andreas Papadopoulos - [DEXA 2015] Entropy Low entropy equals to high attribute homogeneity Normalized Mutual Information (NMI) High NMI is equivalent to high similarity between the resulted clustering and the ground-truth NMI of value 1 indicates perfect match Runtime Quad-core i7 2.8Ghz, 8 Gb RAM Evaluation Measures
Slide 27 of 35 Andreas Papadopoulos - [DEXA 2015] SACluster [VLDB 2009] Similarity is defined as the Random Walk distance in the augmented graph BAGC [SIGMOD 2012] Uses Bayesian inference to update the parameters of the clusters distributions PICS [SDM 2012] Compresses adjacency and attribute matrices HASCOP [WI 2013] Heuristic distance-based Applies to attributed multi-graphs State-of-the-Art Competitors
Slide 28 of 35 Andreas Papadopoulos - [DEXA 2015] Evaluation - Synthetic Datasets CAMIR Entropy is always less than 0.5 High Attribute homogeneity CAMIR NMI is at least 0.8 on all experiments High quality results Similar behavior as we increase the number of attributes
Slide 29 of 35 Andreas Papadopoulos - [DEXA 2015] Evaluation - Synthetic Datasets CAMIR is the 2nd fastest algorithm Less than 10 secs for up to 5000 vertices CAMIR on average outperforms almost all its competitors
Slide 30 of 35 Andreas Papadopoulos - [DEXA 2015] Evaluation - Real-world Datasets DBLP-1K DBLP-10K CAMIR achieves the lowest entropy among its competitors Efficiently ranks and combines vertex properties Identifies clusters of arbitrary shapes and sizes (Spectral clustering)
Slide 31 of 35 Andreas Papadopoulos - [DEXA 2015] Evaluation - Real-world Datasets GoogleSP-23 CAMIR achieves low entropy CAMIR achieves high NMI Identifies a high percentage of software packages
Slide 32 of 35 Andreas Papadopoulos - [DEXA 2015] Evaluation – Runtime and Entropy AlgorithmDBLP-1KDBLP-10KGoogleSP23 Runtime (secs) Entropy Runtime (secs) Entropy Runtime (secs) Entropy CAMIR BAGC SACluster PICS HASCOP CAMIR requires: Less than 6 secs for ~1000 vertices About 8 minutes for vertices CAMIR achieves on average 55% time and 60% entropy improvement BAGC is the fastest method, but achieved limited clustering quality HASCOP achieved slightly better results than CAMIR, but it is the slowest method
Slide 33 of 35 Andreas Papadopoulos - [DEXA 2015] Presentation Outline Motivation Problem Definition Related Work Background Proposed Approach: CAMIR Evaluation Summary
Slide 34 of 35 Andreas Papadopoulos - [DEXA 2015] Summary A new approach for Clustering Attributed Multi-graphs with Information Ranking: CAMIR A new mechanism to rank and weigh vertex properties Identifies the importance of each attribute and edge-type property A unified similarity matrix for attributed multi-graphs Efficiently combines vertex properties Identify clusters of arbitrary sizes and shapes Effective in terms of clustering accuracy and computational time
Andreas Papadopoulos - [DEXA 2015] Clustering Attributed Multi-graphs with Information Ranking Andreas Papadopoulos, Dimitrios Rafailidis, George Pallis, Marios D. Dikaiakos Department of Computer Science University of Cyprus Thank You!