A General and Scalable Approach to Mixed Membership Clustering Frank Lin ∙ William W. Cohen School of Computer Science ∙ Carnegie Mellon University December.

Slides:



Advertisements
Similar presentations
Partitional Algorithms to Detect Complex Clusters
Advertisements

Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Distributed Approximate Spectral Clustering for Large- Scale Datasets FEI GAO, WAEL ABD-ALMAGEED, MOHAMED HEFEEDA PRESENTED BY : BITA KAZEMI ZAHRANI 1.
Nonlinear Dimension Reduction Presenter: Xingwei Yang The powerpoint is organized from: 1.Ronald R. Coifman et al. (Yale University) 2. Jieping Ye, (Arizona.
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Dimensionality Reduction PCA -- SVD
Online Social Networks and Media. Graph partitioning The general problem – Input: a graph G=(V,E) edge (u,v) denotes similarity between u and v weighted.
Clustering II CMPUT 466/551 Nilanjan Ray. Mean-shift Clustering Will show slides from:
Power Iteration Clustering Frank Lin and William W. Cohen School of Computer Science, Carnegie Mellon University ICML , Haifa, Israel.
Clustering and Dimensionality Reduction Brendan and Yifang April
10/11/2001Random walks and spectral segmentation1 CSE 291 Fall 2001 Marina Meila and Jianbo Shi: Learning Segmentation by Random Walks/A Random Walks View.
1cs542g-term High Dimensional Data  So far we’ve considered scalar data values f i (or interpolated/approximated each component of vector values.
A Very Fast Method for Clustering Big Text Datasets Frank Lin and William W. Cohen School of Computer Science, Carnegie Mellon University ECAI ,
Lecture 21: Spectral Clustering
Communities in Heterogeneous Networks Chapter 4 1 Chapter 4, Community Detection and Mining in Social Media. Lei Tang and Huan Liu, Morgan & Claypool,
Spectral Clustering 指導教授 : 王聖智 S. J. Wang 學生 : 羅介暐 Jie-Wei Luo.
Normalized Cuts and Image Segmentation Jianbo Shi and Jitendra Malik, Presented by: Alireza Tavakkoli.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Cluster Analysis.  What is Cluster Analysis?  Types of Data in Cluster Analysis  A Categorization of Major Clustering Methods  Partitioning Methods.
Segmentation Graph-Theoretic Clustering.
Semi-Supervised Learning in Gigantic Image Collections Rob Fergus (NYU) Yair Weiss (Hebrew U.) Antonio Torralba (MIT) TexPoint fonts used in EMF. Read.
An introduction to iterative projection methods Eigenvalue problems Luiza Bondar the 23 rd of November th Seminar.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
A Global Geometric Framework for Nonlinear Dimensionality Reduction Joshua B. Tenenbaum, Vin de Silva, John C. Langford Presented by Napat Triroj.
Lecture 10: Robust fitting CS4670: Computer Vision Noah Snavely.
CS Instance Based Learning1 Instance Based Learning.
Diffusion Maps and Spectral Clustering
Clustering Unsupervised learning Generating “classes”
Image Segmentation Image segmentation is the operation of partitioning an image into a collection of connected sets of pixels. 1. into regions, which usually.
Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.
Domain decomposition in parallel computing Ashok Srinivasan Florida State University COT 5410 – Spring 2004.
Manifold learning: Locally Linear Embedding Jieping Ye Department of Computer Science and Engineering Arizona State University
Presented By Wanchen Lu 2/25/2013
Presenter : Kuang-Jui Hsu Date : 2011/5/3(Tues.).
Segmentation using eigenvectors
Segmentation using eigenvectors Papers: “Normalized Cuts and Image Segmentation”. Jianbo Shi and Jitendra Malik, IEEE, 2000 “Segmentation using eigenvectors:
Chapter 14: SEGMENTATION BY CLUSTERING 1. 2 Outline Introduction Human Vision & Gestalt Properties Applications – Background Subtraction – Shot Boundary.
DATA MINING LECTURE 13 Absorbing Random walks Coverage.
Vladyslav Kolbasin Stable Clustering. Clustering data Clustering is part of exploratory process Standard definition:  Clustering - grouping a set of.
Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin PhD Thesis Oral ∙ July 24, 2012 Committee ∙ William W. Cohen ∙ Christos.
Ground Truth Free Evaluation of Segment Based Maps Rolf Lakaemper Temple University, Philadelphia,PA,USA.
Spectral Analysis based on the Adjacency Matrix of Network Data Leting Wu Fall 2009.
Andreas Papadopoulos - [DEXA 2015] Clustering Attributed Multi-graphs with Information Ranking 26th International.
Data Structures & Algorithms Graphs
Fast Effective Clustering for Graphs and Document Collections William W. Cohen Machine Learning Dept. and Language Technologies Institute School of Computer.
Jan Kamenický.  Many features ⇒ many dimensions  Dimensionality reduction ◦ Feature extraction (useful representation) ◦ Classification ◦ Visualization.
Spectral Sequencing Based on Graph Distance Rong Liu, Hao Zhang, Oliver van Kaick {lrong, haoz, cs.sfu.ca {lrong, haoz, cs.sfu.ca.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Power Iteration Clustering Speaker: Xiaofei Di
Domain decomposition in parallel computing Ashok Srinivasan Florida State University.
Clustering – Part III: Spectral Clustering COSC 526 Class 14 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,
Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.
 In the previews parts we have seen some kind of segmentation method.  In this lecture we will see graph cut, which is a another segmentation method.
Unsupervised Learning on Graphs. Spectral Clustering: Graph = Matrix A B C F D E G I H J ABCDEFGHIJ A111 B11 C1 D11 E1 F111 G1 H111 I111 J11.
CSC321: Lecture 25: Non-linear dimensionality reduction Geoffrey Hinton.
Spectral Clustering Shannon Quinn (with thanks to William Cohen of Carnegie Mellon University, and J. Leskovec, A. Rajaraman, and J. Ullman of Stanford.
Mesh Segmentation via Spectral Embedding and Contour Analysis Speaker: Min Meng
A Tutorial on Spectral Clustering Ulrike von Luxburg Max Planck Institute for Biological Cybernetics Statistics and Computing, Dec. 2007, Vol. 17, No.
Spectral clustering of graphs
Graph Analysis by Persistent Homology
CSE 4705 Artificial Intelligence
Degree and Eigenvector Centrality
Segmentation Graph-Theoretic Clustering.
Grouping.
Spectral Clustering Eric Xing Lecture 8, August 13, 2010
Cluster Validity For supervised classification we have a variety of measures to evaluate how good our model is Accuracy, precision, recall For cluster.
Topic models for corpora and for graphs
Spectral clustering methods
“Traditional” image segmentation
Presentation transcript:

A General and Scalable Approach to Mixed Membership Clustering Frank Lin ∙ William W. Cohen School of Computer Science ∙ Carnegie Mellon University December 11, 2012 ∙ International Conference on Data Mining 1

Mixed Membership Clustering 2

Motivation Spectral clustering is nice But two drawbacks: ◦ Computationally expensive ◦ No mixed-membership clustering 3

Our Solution Convert a node-centric representation of the graph to an edge-centric one Adapt this representation to work with a scalable clustering method - Power Iteration Clustering 4

Mixed Membership Clustering 5

Perspective Since ◦ an edge represents a relationship between two entities. ◦ an entity can belong to as many groups as its relationships Why don’t we group the relationships instead of the entities? 6

Edge Clustering 7

Assumptions: ◦ An edge represents relationship between two nodes ◦ A node can belong to multiple clusters, but an edge can only belong to one 8 Quite general – we can allow parallel edges if needed

Edge Clustering How to cluster edges? Need a edge-centric view of the graph G ◦ Traditionally: a line graph L(G) Problem: potential (and likely) size blow up! size(L(G))=O(size(G) 2 ) ◦ Our solution: a bipartite feature graph B(G) Space-efficient size(B(G))=O(size(G)) Transform edges into nodes! Side note: can also be used to represent tensors efficiently! 9

Edge Clustering The original graph G The line graph L(G) BFG - the bipartite feature graph B(G) Costly for star-shaped structure! Only use twice the space of G 10 a b c d e ab ac bc cd ce ab ac cd bc ce a ab ac ce cb c b de

Edge Clustering A general recipe: 1.Transform affinity matrix A into B(A) 2.Run cluster method and get edge clustering 3.For each node, determine mixed membership based on the membership of its incident edges The matrix dimensions of B(A) is very big – can only use sparse methods on large datasets Perfect for PIC and implicit manifolds! ☺ 11

Edge Clustering 12 What are the dimensions of the matrix that represent B(A)? If A is a |V| x |V| matrix… Then B(A) is a (|V|+|E|) x (|V|+|E|) matrix! Need a clustering method that takes full advantage of the sparsity of B(A)!

Power Iteration Clustering: Quick Overview Spectral clustering methods are nice, a natural choice for graph data But they are expensive (slow) Power iteration clustering (PIC) can provide a similar solution at a very low cost (fast)! 13

The Power Iteration 14 Begins with a random vector Ends with a piece- wise constant vector! Overall absolute distance between points decreases, here we show relative distance

Implication We know: the 2 nd to k th eigenvectors of W=D - 1 A are roughly piece-wise constant with respect to the underlying clusters, each separating a cluster from the rest of the data (Meila & Shi 2001) Then: a linear combination of piece-wise constant vectors is also piece-wise constant! 15

Spectral Clustering value index 1 2 3cluster 16 datasets 2 nd smallest eigenvector 3 rd smallest eigenvector clustering space

Linear Combination… a· b· + = 17

Power Iteration Clustering PIC results vtvt 18

Power Iteration Clustering The algorithm: 19

We just need the clusters to be separated in some space Key idea: To do clustering, we may not need all the information in a full spectral embedding (e.g., distance between clusters in a k- dimension eigenspace) Power Iteration Clustering 20

Mixed Membership Clustering with PIC Now we have ◦ a sparse matrix representation, and ◦ a fast clustering method that works on sparse matrices We’re good to go! 21 Not so fast!! Iterative methods like PageRank and power iteration don’t work on bipartite graphs, B(A) is a bipartite graph! Solution: convert it to a unipartite (aperiodic) graph!

Mixed Membership Clustering with PIC Define a similarity function: 22 Similarity between edges I and j… is proportional to the product of the incident nodes they have in common… and inversely proportional to the number of edges this node is incident to Then we simply just use a matrix S where S(I,j)=s(I,j) in place of B(A)!

Mixed Membership Clustering with PIC Now we have ◦ a sparse matrix representation, and ◦ a fast clustering method that works on sparse matrices, and ◦ an unipartite graph We’re good to go? 23 Similar to line graphs, matrix S may no longer be sparse (e.g., star-shapes)! Back to where we started?

Mixed Membership Clustering with PIC 24 Observations: SNF FTFT

Mixed Membership Clustering with PIC Simply replace one line: 25

Mixed Membership Clustering with PIC Simply replace one line: 26 We get the exact same result, but with all sparse matrix operations

That’s pretty cool. But how well does it work? 27

Experiments Compare: ◦ NCut ◦ Node-PIC (single membership) ◦ MM-PIC using different cluster label schemes: Max - pick the most frequent edge cluster (single membership) - pick edge clusters with at least 40% frequency - pick edge clusters with at least 20% frequency - pick edge clusters with at least 10% frequency All - use all incident edge clusters 28 1 cluster label many labels ?

Experiments Data source: ◦ BlogCat1 10,312 blogs and links 39 overlapping category labels ◦ BlogCat2 88,784 blogs and link 60 overlapping category labels Datasets: ◦ Pick pairs of categories with enough overlap ◦ BlogCat1: 86 category pair datasets ◦ BlogCat2: 158 category pair datasets 29 At least 1%

Result F1 scores for clustering category pairs from the BlogCat1 dataset: Max is better than Node! Generally a lower threshold is better, but not All 30

Result 31 Important - MM-PIC wins where it matters: y-axis: difference in F1 score when the method “wins” x-axis: ratio of mixed membership instances Each point is a two- cluster dataset When MM-PIC does better, it does much better MM-PIC does better on datasets with more mixed membership instances # of datasets where the method “wins”

MM-PIC Result F1 scores for clustering category pairs from the (bigger) BlogCat2 dataset: More differences between thresholds Did not use NCut because the datasets are too big Threshold matters!

Result 33 Again, MM-PIC wins where it matters:

Questions ch2+3 PIC icml 2010 ClusteringClassification ch4+5 MRW asonam 2010 ch6 Implicit Manifolds ch6.1 IM-PIC ecai 2010 ch6.2 IM-MRW mlg 2011 ch7 MM-PIC in submission ch8 GK SSL in submission ch9 Future Work ? 34

Additional Slides + 35

Power Iteration Clustering Spectral clustering methods are nice, a natural choice for graph data But they are expensive (slow) Power iteration clustering (PIC) can provide a similar solution at a very low cost (fast)! 36

Background: Spectral Clustering Normalized Cut algorithm (Shi & Malik 2000): 1.Choose k and similarity function s 2.Derive A from s, let W=I-D -1 A, where D is a diagonal matrix D (i,i) =Σ j A (i,j) 3.Find eigenvectors and corresponding eigenvalues of W 4.Pick the k eigenvectors of W with the 2 nd to k th smallest corresponding eigenvalues 5.Project the data points onto the space spanned by these eigenvectors 6.Run k-means on the projected data points 37

Background: Spectral Clustering datasets 2 nd smallest eigenvector 3 rd smallest eigenvector value index 1 2 3cluster clustering space 38

Background: Spectral Clustering Normalized Cut algorithm (Shi & Malik 2000): 1.Choose k and similarity function s 2.Derive A from s, let W=I-D -1 A, where D is a diagonal matrix D (i,i) =Σ j A (i,j) 3.Find eigenvectors and corresponding eigenvalues of W 4.Pick the k eigenvectors of W with the 2 nd to k th smallest corresponding eigenvalues 5.Project the data points onto the space spanned by these eigenvectors 6.Run k-means on the projected data points Finding eigenvectors and eigenvalues of a matrix is slow in general Can we find a similar low-dimensional embedding for clustering without eigenvectors? 39 There are more efficient approximation methods* Note: the eigenvectors of I-D -1 A corresponding to the smallest eigenvalues are the eigenvectors of D -1 A corresponding to the largest

The Power Iteration The power iteration is a simple iterative method for finding the dominant eigenvector of a matrix: W : a square matrix v t : the vector at iteration t; v 0 typically a random vector c : a normalizing constant to keep v t from getting too large or too small Typically converges quickly; fairly efficient if W is a sparse matrix 40

The Power Iteration The power iteration is a simple iterative method for finding the dominant eigenvector of a matrix: What if we let W=D -1 A (like Normalized Cut)? 41 i.e., a row- normalized affinity matrix

The Power Iteration 42 Begins with a random vector Ends with a piece- wise constant vector! Overall absolute distance between points decreases, here we show relative distance

Implication We know: the 2 nd to k th eigenvectors of W=D - 1 A are roughly piece-wise constant with respect to the underlying clusters, each separating a cluster from the rest of the data (Meila & Shi 2001) Then: a linear combination of piece-wise constant vectors is also piece-wise constant! 43

Spectral Clustering value index 1 2 3cluster 44 datasets 2 nd smallest eigenvector 3 rd smallest eigenvector clustering space

Linear Combination… a· b· + = 45

Power Iteration Clustering PIC results vtvt 46

We just need the clusters to be separated in some space Key idea: To do clustering, we may not need all the information in a full spectral embedding (e.g., distance between clusters in a k- dimension eigenspace) Power Iteration Clustering 47

When to Stop The power iteration with its components: If we normalize: At the beginning, v changes fast, “accelerating” to converge locally due to “noise terms” with small λ When “noise terms” have gone to zero, v changes slowly (“constant speed”) because only larger λ terms (2…k) are left, where the eigenvalue ratios are close to 1 48 Because they are raised to the power t, the eigenvalue ratios determines how fast v converges to e 1

Power Iteration Clustering A basic power iteration clustering (PIC) algorithm: Input: A row-normalized affinity matrix W and the number of clusters k Output: Clusters C 1, C 2, …, C k 1.Pick an initial vector v 0 2.Repeat Set v t+1 ← Wv t Set δ t+1 ← |v t+1 – v t | Increment t Stop when |δ t – δ t-1 | ≈ 0 3.Use k-means to cluster points on v t and return clusters C 1, C 2, …, C k 49

Evaluating Clustering for Network Datasets 50 Each dataset is an undirected, weighted, connected graph Every node is labeled by human to belong to one of k classes Clustering methods are only given k and the input graph Clusters are matched to classes using the Hungarian algorithm We use classification metrics such as accuracy, precision, recall, and F1 score; we also use clustering metrics such as purity and normalized mutual information (NMI)

PIC Runtime Normalized Cut Normalized Cut, faster eigencomputation Ran out of memory (24GB) 51

PIC Accuracy on Network Datasets Upper triangle: PIC does better Lower triangle: NCut or NJW does better 52

Multi-Dimensional PIC One robustness question for vanilla PIC as data size and complexity grow: How many (noisy) clusters can you fit in one dimension without them “colliding”? 53 Cluster signals cleanly separated A little too close for comfort?

Multi-Dimensional PIC Solution: ◦ Run PIC d times with different random starts and construct a d-dimension embedding ◦ Unlikely any pair of clusters collide on all d dimensions 54

Multi-Dimensional PIC Results on network classification datasets: 55 RED: PIC using 1 random start vector GREEN: PIC using 1 degree start vector BLUE: PIC using 4 random start vectors 1-D PIC embeddings lose on accuracy at higher k’s compared to NCut and NJW (# of clusters) But using a 4 random vectors instead helps! Note # of vectors << k

PIC Related Work Related clustering methods: PIC is the only one using a reduced dimensionality – a critical feature for graph data! 56

Multi-Dimensional PIC Results on name disambiguation datasets: 57 Again using a 4 random vectors seems to work! Again note # of vectors << k

PIC: Versus Popular Fast Sparse Eigencomputation Methods For Symmetric Matrices For General Matrices Improvement Successive Power Method Basic; numerically unstable, can be slow Lanczos MethodArnoldi Method More stable, but require lots of memory Implicitly Restarted Lanczos Method (IRLM) Implicitly Restarted Arnoldi Method (IRAM) More memory- efficient 58 MethodTimeSpace IRAM(O(m 3 )+(O(nm)+O(e))×O(m-k))×(# restart)O(e)+O(nm) PICO(e)x(# iterations)O(e) Randomized sampling methods are also popular

PIC: Another View PIC’s low-dimensional embedding, which we will call a power iteration embedding (PIE), is related to diffusion maps: (Coifman & Lafon 2006) 59

PIC: Another View PIC’s low-dimensional embedding, which we will call a power iteration embedding (PIE), is related to diffusion maps: 60 (Coifman & Lafon 2006)

PIC: Another View PIC’s low-dimensional embedding, which we will call a power iteration embedding (PIE), is related to diffusion maps: 61 (Coifman & Lafon 2006)

PIC: Another View Result: PIE is a random projection of the data in the diffusion space W with scale parameter t We can use results from diffusion maps for applying PIC! We can also use results from random projection for applying PIC! 62

PIC Extension: Hierarchical Clustering Real, large-scale data may not have a “flat” clustering structure A hierarchical view may be more useful 63 Good News: The dynamics of a PIC embedding display a hierarchically convergent behavior!

PIC Extension: Hierarchical Clustering Why? Recall PIC embedding at time t: 64 Less significant eigenvectors / structures go away first, one by one More salient structure stick around e’s – eigenvectors (structure) Small Big There may not be a clear eigengap - a gradient of cluster saliency

PIC Extension: Hierarchical Clustering 65 PIC already converged to 8 clusters… But let’s keep on iterating… “N” still a part of the “2009” cluster… Similar behavior also noted in matrix-matrix power methods (diffusion maps, mean-shift, multi-resolution spectral clustering) Same dataset you’ve seen Yes (it might take a while)

Distributed / Parallel Implementations Distributed / parallel implementations of learning methods are necessary to support large-scale data given the direction of hardware development PIC, MRW, and their path folding variants have at their core sparse matrix-vector multiplications Sparse matrix-vector multiplication lends itself well to a distributed / parallel computing framework We propose to use A lternatives: 66 Existing graph analysis tool:

Adjacency Matrix vs. Similarity Matrix Adjacency matrix: Similarity matrix: Eigenanalysis: 67 Same eigenvectors and same ordering of eigenvalues! What about the normalized versions?

Adjacency Matrix vs. Similarity Matrix Normalized adjacency matrix: Normalized similarity matrix: Eigenanalysis: 68 Eigenvectors the same if degree is the same Recent work on degree- corrected Laplacian (Chaudhuri 2012) suggests that it is advantageous to tune α for clustering graphs with a skewed degree distribution and does further analysis