Fast Effective Clustering for Graphs and Document Collections William W. Cohen Machine Learning Dept. and Language Technologies Institute School of Computer.

Slides:

Advertisements

Similar presentations

05/11/2005 Carnegie Mellon School of Computer Science Aladdin Lamps 05 Combinatorial and algebraic tools for multigrid Yiannis Koutis Computer Science.

Advertisements

Partitional Algorithms to Detect Complex Clusters

Statistical perturbation theory for spectral clustering Harrachov, 2007 A. Spence and Z. Stoyanov.

Overview of this week Debugging tips for ML algorithms

Albert Gatt Corpora and Statistical Methods Lecture 13.

Tighter and Convex Maximum Margin Clustering Yu-Feng Li (LAMDA, Nanjing University, China) Ivor W. Tsang.

Online Social Networks and Media. Graph partitioning The general problem – Input: a graph G=(V,E) edge (u,v) denotes similarity between u and v weighted.

Power Iteration Clustering Frank Lin and William W. Cohen School of Computer Science, Carnegie Mellon University ICML , Haifa, Israel.

10/11/2001Random walks and spectral segmentation1 CSE 291 Fall 2001 Marina Meila and Jianbo Shi: Learning Segmentation by Random Walks/A Random Walks View.

DATA MINING LECTURE 12 Link Analysis Ranking Random walks.

Relational Learning with Gaussian Processes By Wei Chu, Vikas Sindhwani, Zoubin Ghahramani, S.Sathiya Keerthi (Columbia, Chicago, Cambridge, Yahoo!) Presented.

Computational problems, algorithms, runtime, hardness

A Very Fast Method for Clustering Big Text Datasets Frank Lin and William W. Cohen School of Computer Science, Carnegie Mellon University ECAI ,

Lecture 21: Spectral Clustering

Communities in Heterogeneous Networks Chapter 4 1 Chapter 4, Community Detection and Mining in Social Media. Lei Tang and Huan Liu, Morgan & Claypool,

Graph Based Semi- Supervised Learning Fei Wang Department of Statistical Science Cornell University.

Learning in Spectral Clustering Susan Shortreed Department of Statistics University of Washington joint work with Marina Meilă.

Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.

© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.

A Unified View of Kernel k-means, Spectral Clustering and Graph Cuts

Semantic text features from small world graphs Jure Leskovec, IJS + CMU John Shawe-Taylor, Southampton.

Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.

Semi-Supervised Learning Using Randomized Mincuts Avrim Blum, John Lafferty, Raja Reddy, Mugizi Rwebangira.

Unsupervised Learning

Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University

1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006

Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.

Topic models for corpora and for graphs. Motivation Social graphs seem to have –some aspects of randomness small diameter, giant connected components,..

Online Stacked Graphical Learning Zhenzhen Kou +, Vitor R. Carvalho *, and William W. Cohen + Machine Learning Department + / Language Technologies Institute.

Radial Basis Function Networks

Diffusion Maps and Spectral Clustering

Domain decomposition in parallel computing Ashok Srinivasan Florida State University COT 5410 – Spring 2004.

Graph clustering Jin Chen CSE Fall 2012 MSU 1.

Cao et al. ICML 2010 Presented by Danushka Bollegala.

Google’s Billion Dollar Eigenvector Gerald Kruse, PhD. John ‘54 and Irene ‘58 Dale Professor of MA, CS and I T Interim Assistant Provost Juniata.

Text Clustering.

Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.

Mining Social Networks for Personalized Prioritization Shinjae Yoo, Yiming Yang, Frank Lin, II-Chul Moon [KDD ’09] 1 Advisor: Dr. Koh Jia-Ling Reporter:

Spectral Analysis based on the Adjacency Matrix of Network Data Leting Wu Fall 2009.

Andreas Papadopoulos - [DEXA 2015] Clustering Attributed Multi-graphs with Information Ranking 26th International.

Semi-Supervised Learning With Graphs William Cohen.

Detecting Communities Via Simultaneous Clustering of Graphs and Folksonomies Akshay Java Anupam Joshi Tim Finin University of Maryland, Baltimore County.

Paired Sampling in Density-Sensitive Active Learning Pinar Donmez joint work with Jaime G. Carbonell Language Technologies Institute School of Computer.

Learning Spectral Clustering, With Application to Speech Separation F. R. Bach and M. I. Jordan, JMLR 2006.

A General and Scalable Approach to Mixed Membership Clustering Frank Lin ∙ William W. Cohen School of Computer Science ∙ Carnegie Mellon University December.

Analysis of Social Media MLD , LTI William Cohen

Analysis of Social Media MLD , LTI William Cohen

Clustering – Part III: Spectral Clustering COSC 526 Class 14 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,

Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.

CPS Computational problems, algorithms, runtime, hardness (a ridiculously brief introduction to theoretical computer science) Vincent Conitzer.

Community structure in graphs Santo Fortunato. More links “inside” than “outside” Graphs are “sparse” “Communities”

Unsupervised Learning on Graphs. Spectral Clustering: Graph = Matrix A B C F D E G I H J ABCDEFGHIJ A111 B11 C1 D11 E1 F111 G1 H111 I111 J11.

Spectral Clustering Shannon Quinn (with thanks to William Cohen of Carnegie Mellon University, and J. Leskovec, A. Rajaraman, and J. Ullman of Stanford.

A Tutorial on Spectral Clustering Ulrike von Luxburg Max Planck Institute for Biological Cybernetics Statistics and Computing, Dec. 2007, Vol. 17, No.

Hierarchical clustering approaches for high-throughput data Colin Dewey BMI/CS 576 Fall 2015.

Normalized Cuts and Image Segmentation Patrick Denis COSC 6121 York University Jianbo Shi and Jitendra Malik.

Document Clustering with Prior Knowledge Xiang Ji et al. Document Clustering with Prior Knowledge. SIGIR 2006 Presenter: Suhan Yu.

Graph-based WSD の続き DMLA /7/10 小町守.

Motoki Shiga, Ichigaku Takigawa, Hiroshi Mamitsuka

Semi-Supervised Clustering

Clustering Usman Roshan.

Spectral Clustering. Spectral Clustering k-means spectral after transformation.

Section 7.12: Similarity By: Ralucca Gera, NPS.

Learning with information of features

Topic models for corpora and for graphs

Spectral Clustering Eric Xing Lecture 8, August 13, 2010

3.3 Network-Centric Community Detection

Topic models for corpora and for graphs

Spectral clustering methods

“Traditional” image segmentation

Presentation transcript:

Fast Effective Clustering for Graphs and Document Collections William W. Cohen Machine Learning Dept. and Language Technologies Institute School of Computer Science Carnegie Mellon University Joint work with: Frank Lin

Introduction: trends in machine learning 1 Supervised learning: given data (x 1,y 1 ),…,(x n,y n ), learn to predict y from x –y is a real number or member of small set –x is a (sparse) vector Semi-supervised learning: given data (x 1,y 1 ),…,(x k,y k ),x k+1,…,x n learn to predict y from x Unsupervised learning: given data x 1,…,x n find a “natural” clustering

Introduction: trends in machine learning 2 Supervised learning: given data (x 1,y 1 ),…,(x n,y n ), learn to predict y from x –y is a real number or member of small set –x is a (sparse) vector –x’s are all i.i.d., independent of each other –y depends only on the corresponding x Structured learning: x’s and/or y’s are related to each other

Introduction: trends in machine learning Structured learning: x’s and/or y’s are related to each other –General: x and y are in two parallel 1d arrays x’s are words in a document, y is POS tag x’s are words, y=1 if x is part of a company name x’s are DNA codons, y=1 if x is part of a gene … –More general: x’s are nodes in a graph, y’s are labels for these nodes Introduction: trends in machine learning 2

Examples of classification in graphs x is a web page, edge is hyperlink, y is topic x is a word, edge is co-occurrence in similar contexts, y is semantics (distributional clustering) x is a protein, edge is interaction, y is subcellular location x is a person, edge is message, y is organization x is a person, edge is friendship, y=1 if x smokes … x,y are anything, edge from x 1 to x 2 indicates similarity between x 1 and x 2 (manifold learning)

Examples: Zachary’s karate club, political books

Political blog network Adamic & Glance “Divided They Blog:…” 2004

This talk: Unsupervised learning in graphs –aka: community detection, network modeling, … –~= low-dimensional matrix approximation, … –Spectral clustering Experiments: –For networks with known “true” labels … –can unsupervised learning can recover these labels? –(Usually) node identifiers and topology of graph is all that’s used by these methods…

Outline Introduction Background on spectral clustering “Power Iteration Clustering” –Motivation –Experimental results Analysis: PIC vs spectral methods PIC for sparse bipartite graphs –Motivation & Method –Experimental Results [Lin & Cohen, ICML 2010] [Lin & Cohen, ECAI 2010]

Spectral Clustering: Graph = Matrix A B C F D E G I H J ABCDEFGHIJ A111 B11 C1 D11 E1 F111 G1 H111 I111 J11

Spectral Clustering: Graph = Matrix Transitively Closed Components = “Blocks” A B C F D E G I H J ABCDEFGHIJ A_111 B1_1 C11_ D_11 E1_1 F111_ G_11 H_11 I11_1 J111_ Of course we can’t see the “blocks” unless the nodes are sorted by cluster…

Spectral Clustering: Graph = Matrix Vector = Node  Weight H ABCDEFGHIJ A_111 B1_1 C11_ D_11 E1_1 F111_ G_11 H_11 I11_1 J111_ A B C F D E G I J A A3 B2 C3 D E F G H I J M M v

Sp ectral Clustering: Graph = Matrix M*v 1 = v 2 “propogates weights from neighbors” H ABCDEFGHIJ A_111 B1_1 C11_ D_11 E1_1 F11_ G_11 H_11 I11_1 J111_ A B C F D E G I J A3 B2 C3 D E F G H I J M M v1v1 A2*1+3*1+0* 1 B3*1+3*1 C3*1+2*1 D E F G H I J v2v2 *=

Spectral Clustering: Graph = Matrix W*v 1 = v 2 “propogates weights from neighbors” H ABCDEFGHIJ A _.5.3 B _.5 C.3.5_ D _.3 E.5_.3 F.5 _ G _.3 H _ I.5 _.3 J.5.3_ A B C F D E G I J A3 B2 C3 D E F G H I J W v1v1 A 2*.5+3*.5+0*.3 B 3*.3+3*.5 C 3*.33+2*.5 D E F G H I J v2v2 * = W: normalized so columns sum to 1

Spectral Clustering: Graph = Matrix W*v 1 = v 2 “propogates weights from neighbors” Q: How do I pick v to be an eigenvector for a block- stochastic matrix?

Spectral Clustering: Graph = Matrix W*v 1 = v 2 “propogates weights from neighbors” How do I pick v to be an eigenvector for a block- stochastic matrix?

Spectral Clustering: Graph = Matrix W*v 1 = v 2 “propogates weights from neighbors ” [Shi & Meila, 2002] λ2λ2 λ3λ3 λ4λ4 λ 5, 6,7,…. λ1λ1 e1e1 e2e2 e3e3 “eigengap”

Spectral Clustering: Graph = Matrix W*v 1 = v 2 “propogates weights from neighbors ” [Shi & Meila, 2002] e2e2 e3e x x x x x x y yy y y x x x x x x z z z z z z z z z z z e1e1 e2e2

Spectral Clustering: Graph = Matrix W*v 1 = v 2 “propogates weights from neighbors” M If W is connected but roughly block diagonal with k blocks then the top eigenvector is a constant vector the next k eigenvectors are roughly piecewise constant with “pieces” corresponding to blocks

Spectral Clustering: Graph = Matrix W*v 1 = v 2 “propogates weights from neighbors” M If W is connected but roughly block diagonal with k blocks then the “top” eigenvector is a constant vector the next k eigenvectors are roughly piecewise constant with “pieces” corresponding to blocks Spectral clustering: Find the top k+1 eigenvectors v 1,…,v k+1 Discard the “top” one Replace every node a with k-dimensional vector x a = Cluster with k-means

Spectral Clustering: Pros and Cons Elegant, and well-founded mathematically Tends to avoid local minima –Optimal solution to relaxed version of mincut problem (Normalized cut, aka NCut) Works quite well when relations are approximately transitive (like similarity, social connections) Expensive for very large datasets –Computing eigenvectors is the bottleneck –Approximate eigenvector computation not always useful Noisy datasets sometimes cause problems –Picking number of eigenvectors and k is tricky –“Informative” eigenvectors need not be in top few –Performance can drop suddenly from good to terrible

Experimental results: best-case assignment of class labels to clusters

Spectral Clustering: Graph = Matrix M*v 1 = v 2 “propogates weights from neighbors” H ABCDEFGHIJ A_111 B1_1 C11_ D_11 E1_1 F11_ G_11 H_11 I11_1 J111_ A B C F D E G I J A3 B2 C3 D E F G H I J M M v1v1 A5 B6 C5 D E F G H I J v2v2 *=

Repeated averaging with neighbors as a clustering method Pick a vector v 0 (maybe at random) Compute v 1 = Wv 0 –i.e., replace v 0 [x] with weighted average of v 0 [y] for the neighbors y of x Plot v 1 [x] for each x Repeat for v 2, v 3, … Variants widely used for semi-supervised learning –clamping of labels for nodes with known labels Without clamping, will converge to constant v t What are the dynamics of this process?

Repeated averaging with neighbors on a sample problem… Create a graph, connecting all points in the 2-D initial space to all other points Weighted by distance Run power iteration for 10 steps Plot node id x vs v 10 (x) nodes are ordered by actual cluster number b b b b b gg g g g gg g g rr rr r rr … bluegreen___red___

Repeated averaging with neighbors on a sample problem… bluegreen___red___bluegreen___red___bluegreen___red___ smaller larger

Repeated averaging with neighbors on a sample problem… bluegreen___red___bluegreen___red___bluegreen___red___ bluegreen___red___bluegreen___red___

Repeated averaging with neighbors on a sample problem… very small

PIC: Power Iteration Clustering run power iteration (repeated averaging w/ neighbors) with early stopping –V 0 : random start, or “degree matrix” D, or … –Easy to implement and efficient –Very easily parallelized –Experimentally, often better than traditional spectral methods –Surprising since the embedded space is 1-dimensional!

Experiments “Network” problems: natural graph structure –PolBooks: 105 political books, 3 classes, linked by copurchaser –UMBCBlog: 404 political blogs, 2 classes, blogroll links –AGBlog: 1222 political blogs, 2 classes, blogroll links “ Manifold” problems: cosine distance between classification instances –Iris: 150 flowers, 3 classes –PenDigits01,17: 200 handwritten digits, 2 classes (0-1 or 1-7) –20ngA: 200 docs, misc.forsale vs soc.religion.christian –20ngB: 400 docs, misc.forsale vs soc.religion.christian –20ngC: 20ngB docs from talk.politics.guns –20ngD: 20ngC docs from rec.sport.baseball

Experimental results: best-case assignment of class labels to clusters

Experiments: run time and scalability Time in millisec

Outline Introduction Background on spectral clustering “Power Iteration Clustering” –Motivation –Experimental results Analysis: PIC vs spectral methods PIC for sparse bipartite graphs –Motivation & Method –Experimental Results

Analysis: why is this working?

“noise” terms L2 distance differences might cancel? scaling?

Analysis: why is this working? If –eigenvectors e 2,…,e k are approximately piecewise constant on blocks; –λ 2,…, λ k are “large” and λ k+1,… are “small”; e.g., if matrix is block-stochastic –the c i ’s for v 0 are bounded; –for any a,b from distinct blocks there is at least one e i with e i (a)-e i (b) “large” Then exists an R so that –spec(a,b) small  R*pic(a,b) small

Analysis: why is this working? Sum of differences vs sum-of-squared differences “soft” eigenvector selection

Ncut with top k eigenvectors Ncut with top 10 eigenvectors: weighted

PIC

Summary of results so far Both PIC and Ncut embed each graph node in a space where distance is meaningful Distances in “PIC space” and Eigenspace are closely related –At least for many graphs suited to spectral clustering PIC does “soft” selection of eigenvectors –Strong eigenvalues give high weights PIC gives comparable-quality clusters –But is much faster

Outline Background on spectral clustering “Power Iteration Clustering” –Motivation –Experimental results Analysis: PIC vs spectral methods PIC for sparse bipartite graphs –“Lazy” Distance Computation –“Lazy” Normalization –Experimental Results

Motivation: Experimental Datasets are… “Network” problems: natural graph structure –PolBooks: 105 political books, 3 classes, linked by copurchaser –UMBCBlog: 404 political blogs, 2 classes, blogroll links –AGBlog: 1222 political blogs, 2 classes, blogroll links –Also: Zachary’s karate club, citation networks,... “ Manifold” problems: cosine distance between all pairs of classification instances –Iris: 150 flowers, 3 classes –PenDigits01,17: 200 handwritten digits, 2 classes (0-1 or 1-7) –20ngA: 200 docs, misc.forsale vs soc.religion.christian –20ngB: 400 docs, misc.forsale vs soc.religion.christian –… Gets expensive fast

Lazy computation of distances and normalizers Recall PIC’s update is –v t = W * v t-1 = = D -1 A * v t-1 –…where D is the [diagonal] degree matrix: D=A*1 My favorite distance metric for text is length- normalized TFIDF: –Def’n: A(i,j)= /||v i ||*||v j || –Let N(i,i)=||v i || … and N(i,j)=0 for i!=j –Let F(i,k)=TFIDF weight of word w k in document v i –Then: A = N -1 F T FN -1 =inner product ||u|| is L2-norm 1 is a column vector of 1’s

Lazy computation of distances and normalizers Recall PIC’s update is –v t = W * v t-1 = = D -1 A * v t-1 –…where D is the [diagonal] degree matrix: D=A*1 –Let F(i,k)=TFIDF weight of word w k in document v i –Compute N(i,i)=||v i || … and N(i,j)=0 for i!=j –Don’t compute A = N -1 F T FN -1 –Let D(i,i)= N -1 F T FN -1 *1 where 1 is an all-1’s vector Computed as D=N -1( F T( F(N -1 * 1))) for efficiency –New update: v t = D -1 A * v t-1 = D -1 N -1 F T FN -1 * v t-1 Equivalent to using TFIDF/cosine on all pairs of examples but requires only sparse matrices

Experimental results RCV1 text classification dataset –800k + newswire stories –Category labels from industry vocabulary –Took single-label documents and categories with at least 500 instances –Result: 193,844 documents, 103 categories Generated 100 random category pairs –Each is all documents from two categories –Range in size and difficulty –Pick category 1, with m 1 examples –Pick category 2 such that 0.5m 1 <m 2 <2m 1

Results NCUTevd: Ncut with exact eigenvectors NCUTiram: Implicit restarted Arnoldi method No stat. signif. diffs between NCUTevd and PIC

Results

Linear run-time implies constant number of iterations Number of iterations to “acceleration- convergence” is hard to analyze: –Faster than a single complete run of power iteration to convergence –On our datasets iterations is typical is exceptional

Related work Fowlkes et al, PAMI 2004; Yan et al, KDD 2009; … –Faster spectral clustering via sampling (vs using all data here) Zelnik-Manor & Perona; NIPS 2005; Xiang & Gong, Pattern Recog 2008 –Eigenvalue selection heuristics (hard, vs “soft” ones here) Dhillon, PAMAI 2007 –Fast multilevel kernel k-means to solve NCUT optimization problem –But: much more complex method, local minima issues Tishby & Slonim, NIPS 2000; Zhou & Woodruff, PODS 2004;… –Also use dynamics of power iteration process for distance metrics in clustering –But: matrix-matrix vs matrix-vector multiplication, more expensive

Summary/conclusion Unsupervised network-based clustering: –Very general setting for learning Especially if you consider manifold-based learning settings –Robustness and scalability is important 144.5M pairs of NPs that co-occur nearby in English ClueWeb (2B sentences) PIC is –robust and effective –~ linear-time (up to 1000x faster than Ncut)

Thanks to… NIH/NIGMS NSF Google Microsoft LiveLabs