A Very Fast Method for Clustering Big Text Datasets Frank Lin and William W. Cohen School of Computer Science, Carnegie Mellon University ECAI 2010 2010-08-18,

Slides:



Advertisements
Similar presentations
05/11/2005 Carnegie Mellon School of Computer Science Aladdin Lamps 05 Combinatorial and algebraic tools for multigrid Yiannis Koutis Computer Science.
Advertisements

Partitional Algorithms to Detect Complex Clusters
Chapter 9 Approximating Eigenvalues
Chapter Outline 3.1 Introduction
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
Nonlinear Dimension Reduction Presenter: Xingwei Yang The powerpoint is organized from: 1.Ronald R. Coifman et al. (Yale University) 2. Jieping Ye, (Arizona.
Dimensionality Reduction PCA -- SVD
Clustering II CMPUT 466/551 Nilanjan Ray. Mean-shift Clustering Will show slides from:
Power Iteration Clustering Frank Lin and William W. Cohen School of Computer Science, Carnegie Mellon University ICML , Haifa, Israel.
Clustering and Dimensionality Reduction Brendan and Yifang April
10/11/2001Random walks and spectral segmentation1 CSE 291 Fall 2001 Marina Meila and Jianbo Shi: Learning Segmentation by Random Walks/A Random Walks View.
6/26/2006CGI'06, Hangzhou China1 Sub-sampling for Efficient Spectral Mesh Processing Rong Liu, Varun Jain and Hao Zhang GrUVi lab, Simon Fraser University,
Two Technique Papers on High Dimensionality Allan Rempel December 5, 2005.
Principal Component Analysis CMPUT 466/551 Nilanjan Ray.
Lecture 21: Spectral Clustering
DIMENSIONALITY REDUCTION BY RANDOM PROJECTION AND LATENT SEMANTIC INDEXING Jessica Lin and Dimitrios Gunopulos Ângelo Cardoso IST/UTL December
Spectral Clustering 指導教授 : 王聖智 S. J. Wang 學生 : 羅介暐 Jie-Wei Luo.
Dimensionality reduction. Outline From distances to points : – MultiDimensional Scaling (MDS) – FastMap Dimensionality Reductions or data projections.
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
Normalized Cuts and Image Segmentation Jianbo Shi and Jitendra Malik, Presented by: Alireza Tavakkoli.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Semi-Supervised Learning in Gigantic Image Collections Rob Fergus (NYU) Yair Weiss (Hebrew U.) Antonio Torralba (MIT) TexPoint fonts used in EMF. Read.
Lecture 4 Unsupervised Learning Clustering & Dimensionality Reduction
The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.
Three Algorithms for Nonlinear Dimensionality Reduction Haixuan Yang Group Meeting Jan. 011, 2005.
Computing Sketches of Matrices Efficiently & (Privacy Preserving) Data Mining Petros Drineas Rensselaer Polytechnic Institute (joint.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006
Multimedia Databases LSI and SVD. Text - Detailed outline text problem full text scanning inversion signature files clustering information filtering and.
E.G.M. PetrakisDimensionality Reduction1  Given N vectors in n dims, find the k most important axes to project them  k is user defined (k < n)  Applications:
MATH 685/ CSI 700/ OR 682 Lecture Notes Lecture 6. Eigenvalue problems.
Radial Basis Function Networks
Diffusion Maps and Spectral Clustering
Image Segmentation Image segmentation is the operation of partitioning an image into a collection of connected sets of pixels. 1. into regions, which usually.
Domain decomposition in parallel computing Ashok Srinivasan Florida State University COT 5410 – Spring 2004.
Manifold learning: Locally Linear Embedding Jieping Ye Department of Computer Science and Engineering Arizona State University
Presented By Wanchen Lu 2/25/2013
Presenter : Kuang-Jui Hsu Date : 2011/5/3(Tues.).
Algorithms for a large sparse nonlinear eigenvalue problem Yusaku Yamamoto Dept. of Computational Science & Engineering Nagoya University.
Non Negative Matrix Factorization
Segmentation using eigenvectors Papers: “Normalized Cuts and Image Segmentation”. Jianbo Shi and Jitendra Malik, IEEE, 2000 “Segmentation using eigenvectors:
Region Segmentation Readings: Chapter 10: 10.1 Additional Materials Provided K-means Clustering (text) EM Clustering (paper) Graph Partitioning (text)
Self Organization of a Massive Document Collection Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Author : Teuvo Kohonen et al.
Low-Rank Kernel Learning with Bregman Matrix Divergences Brian Kulis, Matyas A. Sustik and Inderjit S. Dhillon Journal of Machine Learning Research 10.
Spectral Analysis based on the Adjacency Matrix of Network Data Leting Wu Fall 2009.
Andreas Papadopoulos - [DEXA 2015] Clustering Attributed Multi-graphs with Information Ranking 26th International.
Fast Effective Clustering for Graphs and Document Collections William W. Cohen Machine Learning Dept. and Language Technologies Institute School of Computer.
Geometry of Shape Manifolds
Spectral Sequencing Based on Graph Distance Rong Liu, Hao Zhang, Oliver van Kaick {lrong, haoz, cs.sfu.ca {lrong, haoz, cs.sfu.ca.
A General and Scalable Approach to Mixed Membership Clustering Frank Lin ∙ William W. Cohen School of Computer Science ∙ Carnegie Mellon University December.
Analysis of Social Media MLD , LTI William Cohen
Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.
Math 285 Project Diffusion Maps Xiaoyan Chong Department of Mathematics and Statistics San Jose State University.
Graphs, Vectors, and Matrices Daniel A. Spielman Yale University AMS Josiah Willard Gibbs Lecture January 6, 2016.
 In the previews parts we have seen some kind of segmentation method.  In this lecture we will see graph cut, which is a another segmentation method.
DATA MINING LECTURE 8 Sequence Segmentation Dimensionality Reduction.
Unsupervised Learning on Graphs. Spectral Clustering: Graph = Matrix A B C F D E G I H J ABCDEFGHIJ A111 B11 C1 D11 E1 F111 G1 H111 I111 J11.
Spectral Clustering Shannon Quinn (with thanks to William Cohen of Carnegie Mellon University, and J. Leskovec, A. Rajaraman, and J. Ullman of Stanford.
A Tutorial on Spectral Clustering Ulrike von Luxburg Max Planck Institute for Biological Cybernetics Statistics and Computing, Dec. 2007, Vol. 17, No.
Incremental Reduced Support Vector Machines Yuh-Jye Lee, Hung-Yi Lo and Su-Yun Huang National Taiwan University of Science and Technology and Institute.
Laplacian Matrices of Graphs: Algorithms and Applications ICML, June 21, 2016 Daniel A. Spielman.
Document Clustering with Prior Knowledge Xiang Ji et al. Document Clustering with Prior Knowledge. SIGIR 2006 Presenter: Suhan Yu.
Laplacian Matrices of Graphs: Algorithms and Applications ICML, June 21, 2016 Daniel A. Spielman.
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Spectral clustering of graphs
Spectral Methods for Dimensionality
Singular Value Decomposition and its applications
A segmentation and tracking algorithm
Principal Component Analysis
Spectral Clustering Eric Xing Lecture 8, August 13, 2010
Spectral clustering methods
Presentation transcript:

A Very Fast Method for Clustering Big Text Datasets Frank Lin and William W. Cohen School of Computer Science, Carnegie Mellon University ECAI , Lisbon, Portugal

Overview Preview The Problem with Text Data Power Iteration Clustering (PIC) – Spectral Clustering – The Power Iteration – Power Iteration Clustering Path Folding Results Related Work

Overview Preview The Problem with Text Data Power Iteration Clustering (PIC) – Spectral Clustering – The Power Iteration – Power Iteration Clustering Path Folding Results Related Work

Preview Spectral clustering methods are nice We want to use them for clustering text data (A lot of)

Preview However, two obstacles: 1. Spectral clustering computes eigenvectors – in general very slow 2. Spectral clustering uses similarity matrix – too big to store & run Our solution: Power iteration clustering with path folding!

Preview An accuracy result: Upper triangle: we win Lower triangle: spectral clustering wins Each point is accuracy for a 2-cluster text dataset Diagonal: tied (most datasets) No statistically significant difference – same accuracy

Preview A scalability result: y: algorithm runtime (log scale) x: data size (log scale) Linear curve Quadratic curve Spectral clustering (red & blue) Our method (green)

Overview Preview The Problem with Text Data Power Iteration Clustering (PIC) – Spectral Clustering – The Power Iteration – Power Iteration Clustering Path Folding Results Related Work

The Problem with Text Data Documents are often represented as feature vectors of words: The importance of a Web page is an inherently subjective matter, which depends on the readers… In this paper, we present Google, a prototype of a large-scale search engine which makes heavy use… You're not cool just because you have a lot of followers on twitter, get over yourself… coolwebsearchmakeoveryou

The Problem with Text Data Feature vectors are often sparse But similarity matrix is not! coolwebsearchmakeoveryou Mostly zeros - any document contains only a small fraction of the vocabulary Mostly non-zero - any two documents are likely to have a word in common

The Problem with Text Data A similarity matrix is the input to many clustering methods, including spectral clustering Spectral clustering requires the computation of the eigenvectors of a similarity matrix of the data In general O(n 3 ); approximation methods still not very fast O(n 2 ) time to construct O(n 2 ) space to store > O(n 2 ) time to operate on Too expensive! Does not scale up to big datasets!

The Problem with Text Data We want to use the similarity matrix for clustering (like spectral clustering), but: – Without calculating eigenvectors – Without constructing or storing the similarity matrix Power Iteration Clustering + Path Folding

Overview Preview The Problem with Text Data Power Iteration Clustering (PIC) – Spectral Clustering – The Power Iteration – Power Iteration Clustering Path Folding Results Related Work

Spectral Clustering Idea: instead of clustering data points in their original (Euclidean) space, cluster them in the space spanned by the “significant” eigenvectors of a (Laplacian) similarity matrix A popular spectral clustering method: normalized cuts (NCut) An unnormalized similarity matrix A ij is the similarity between data points i and j. NCut uses a normalized matrix W=I-D -1 A; I is the identity matrix and D is defined D ii =Σ j A ij

Spectral Clustering dataset and normalized cut results 2 nd smallest eigenvector 3 rd smallest eigenvector value index 1 2 3cluster clustering space

Spectral Clustering A typical spectral clustering algorithm: 1.Choose k and similarity function s 2.Derive affinity matrix A from s, transform A to a (normalized) Laplacian matrix W 3.Find eigenvectors and corresponding eigenvalues of W 4.Pick the k eigenvectors of W with the smallest corresponding eigenvalues as “significant” eigenvectors 5.Project the data points onto the space spanned by these vectors 6.Run k-means on the projected data points

Spectral Clustering Normalized Cut algorithm (Shi & Malik 2000): 1.Choose k and similarity function s 2.Derive A from s, let W=I-D -1 A, where I is the identity matrix and D is a diagonal square matrix D ii =Σ j A ij 3.Find eigenvectors and corresponding eigenvalues of W 4.Pick the k eigenvectors of W with the 2 nd to k th smallest corresponding eigenvalues as “significant” eigenvectors 5.Project the data points onto the space spanned by these vectors 6.Run k-means on the projected data points Finding eigenvectors and eigenvalues of a matrix is very slow in general: O(n 3 ) Can we find a similar low- dimensional embedding for clustering without eigenvectors?

Overview Preview The Problem with Text Data Power Iteration Clustering (PIC) – Spectral Clustering – The Power Iteration – Power Iteration Clustering Path Folding Results Related Work

The Power Iteration Or the power method, is a simple iterative method for finding the dominant eigenvector of a matrix: W : a square matrix v t : the vector at iteration t; v 0 typically a random vector c : a normalizing constant to keep v t from getting too large or too small Typically converges quickly; fairly efficient if W is a sparse matrix

The Power Iteration Or the power method, is a simple iterative method for finding the dominant eigenvector of a matrix: What if we let W=D -1 A (similar to Normalized Cut)?

The Power Iteration It turns out that, if there is some underlying cluster in the data, PI will quickly converge locally within clusters then slowly converge globally to a constant vector. The locally converged vector, which is a linear combination of the top eigenvectors, will be nearly piece-wise constant with each piece corresponding to a cluster

The Power Iteration smaller larger colors correspond to what k-means would “think” to be clusters in this one-dimension embedding

Overview Preview The Problem with Text Data Power Iteration Clustering (PIC) – Spectral Clustering – The Power Iteration – Power Iteration Clustering Path Folding Results Related Work

Power Iteration Clustering The 2 nd to k th eigenvectors of W=D -1 A are roughly piece-wise constant with respect to the underlying clusters, each separating a cluster from the rest of the data (Meila & Shi 2001) The linear combination of piece-wise constant vectors is also piece-wise constant!

Spectral Clustering dataset and normalized cut results 2 nd smallest eigenvector 3 rd smallest eigenvector value index 1 2 3cluster clustering space

a· b· + =

Power Iteration Clustering dataset and PIC results vtvt we just need the clusters to be separated in some space. Key idea: to do clustering, we may not need all the information in a spectral embedding (e.g., distance between clusters in a k-dimension eigenspace)

Power Iteration Clustering A basic power iteration clustering (PIC) algorithm: Input: A row-normalized affinity matrix W and the number of clusters k Output: Clusters C 1, C 2, …, C k 1.Pick an initial vector v 0 2.Repeat Set v t+1 ← Wv t Set δ t+1 ← |v t+1 – v t | Increment t Stop when |δ t – δ t-1 | ≈ 0 3.Use k-means to cluster points on v t and return clusters C 1, C 2, …, C k For more details on power iteration clustering, like how to stop, please refer to Lin & Cohen 2010 (ICML) Okay, we have a fast clustering method – but there’s the W that requires O(n 2 ) storage space and construction and operation time!

Overview Preview The Problem with Text Data Power Iteration Clustering (PIC) – Spectral Clustering – The Power Iteration – Power Iteration Clustering Path Folding Results Related Work

Path Folding A basic power iteration clustering (PIC) algorithm: Input: A row-normalized affinity matrix W and the number of clusters k Output: Clusters C 1, C 2, …, C k 1.Pick an initial vector v 0 2.Repeat Set v t+1 ← Wv t Set δ t+1 ← |v t+1 – v t | Increment t Stop when |δ t – δ t-1 | ≈ 0 3.Use k-means to cluster points on v t and return clusters C 1, C 2, …, C k Okay, we have a fast clustering method – but there’s the W that requires O(n 2 ) storage space and construction and operation time! Key operation in PIC Note: matrix-vector multiplication!

Path Folding What’s so good about matrix-vector multiplication? If we can decompose the matrix… Then we arrive at the same solution doing a series of matrix-vector multiplications! Wait! Isn’t this more time- consuming then before? What’s the point? Well, not if this matrix is dense… And these are sparse!

Path Folding As long as we can decompose the matrix into a series of sparse matrices, we can turn a dense matrix-vector multiplication into a series of sparse matrix-vector multiplications. This means that we can turn an operation that requires O(n 2 ) storage and runtime into one that requires O(n) storage and runtime! This is exactly the case for text data And most other kinds of data as well!

Path Folding But how does this help us? Don’t we still need to construct the matrix and then decompose it? No – for many similarity functions, the decomposition can be obtained for at almost no cost.

Path Folding Example – inner product similarity: The original feature matrix The feature matrix transposed Diagonal matrix that normalizes W so rows sum to 1 Construction: given Storage: O(n) Construction: given Storage: just use F Construction: let d=FF T 1 and let D(i,i)=d(i); O(n) Storage: n

Path Folding Example – inner product similarity: Iteration update: Construction: O(n) Storage: O(n) Operation: O(n) Okay…how about a similarity function we actually use for text data?

Path Folding Example – cosine similarity: Iteration update: Diagonal cosine normalizing matrix Construction: O(n) Storage: O(n) Operation: O(n) Compact storage: we don’t need a cosine- normalized version of the feature vectors

Path Folding We refer to this technique as path folding due to its connections to “folding” a bipartite graph into a unipartite graph. More details on bipartite graph in the paper

Overview Preview The Problem with Text Data Power Iteration Clustering (PIC) – Spectral Clustering – The Power Iteration – Power Iteration Clustering Path Folding Results Related Work

Results 100 datasets, each a combination of documents from 2 categories from Reuters (RCV1) text collection, chosen at random This way we have cluster datasets of varying size and difficulty Compared proposed method (PIC) to: – k-means – Normalized Cut using eigendecomposition (NCUTevd) – Normalized Cut using fast approximation (NCUTiram)

Results Each point represents the accuracy result from a dataset Lower triangle: k-means wins Upper triangle: PIC wins

Results Each point represents the accuracy result from a dataset Lower triangle: k-means wins Upper triangle: PIC wins

Results Two methods have almost the same behavior Overall, one method not statistically significantly better than the other

Results Not sure why NCUTiram did not work as well as NCUTevd Lesson: Approximate eigen- computation methods may require expertise to work well

Results How fast do they run? y: algorithm runtime (log scale) x: data size (log scale) Linear curve Quadratic curve PIC (green) NCUTiram (blue) NCUTevd (red)

Results PIC is O(n) per iteration and the runtime curve looks linear… But I don’t like eyeballing curves, and perhaps the number of iteration increases with size or difficulty of the dataset? Correlation plot Correlation statistic (0=none, 1=correlated)

Overview Preview The Problem with Text Data Power Iteration Clustering (PIC) – Spectral Clustering – The Power Iteration – Power Iteration Clustering Path Folding Results Related Work

Faster spectral clustering – Approximate eigendecomposition (Lanczos, IRAM) – Sampled eigendecomposition (Nyström) Sparser matrix – Sparse construction k-nearest-neighbor graph k-matching – graph sampling / reduction Not O(n) time methods Still require O(n 2 ) construction in general Not O(n) space methods

Conclusion PIC + path folding: a fast, space-efficient, scalable method for clustering text data, using the power of a full pair-wise similarity matrix Does not “engineer” the input; no sampling or sparsification needed! Scales well for most other kinds of data too – as long as the number of features are much less than number of data points!

Perguntas?

Results

Linear run-time implies constant number of iterations. Number of iterations to “acceleration- convergence” is hard to analyze: – Faster than a single complete run of power iteration to convergence. – On our datasets iterations is typical is exceptional

Related Clustering Work Spectral Clustering – (Roxborough & Sen 1997, Shi & Malik 2000, Meila & Shi 2001, Ng et al. 2002) Kernel k-Means (Dhillon et al. 2007) Modularity Clustering (Newman 2006) Matrix Powering – Markovian relaxation & the information bottleneck method (Tishby & Slonim 2000) – matrix powering (Zhou & Woodruff 2004) – diffusion maps (Lafon & Lee 2006) – Gaussian blurring mean-shift (Carreira-Perpinan 2006) Mean-Shift Clustering – mean-shift (Fukunaga & Hostetler 1975, Cheng 1995, Comaniciu & Meer 2002) – Gaussian blurring mean-shift (Carreira-Perpinan 2006)

Some “Powering” Methods at a Glance How far can we go with a one- or low-dimensional embedding?