1 Efficient Clustering of High-Dimensional Data Sets Andrew McCallum WhizBang! Labs & CMU Kamal Nigam WhizBang! Labs Lyle Ungar UPenn
2 Large Clustering Problems Many examples Many clusters Many dimensions Example Domains Text Images Protein Structure
3 The Citation Clustering Data Over 1,000,000 citations About 100,000 unique papers About 100,000 unique vocabulary words Over 1 trillion distance calculations
4 Reduce number of distance calculations [Bradley, Fayyad, Reina KDD-98] –Sample to find initial starting points for k-means or EM [Moore 98] –Use multi-resolution kd-trees to group similar data points [Omohundro 89] –Balltrees
5 The Canopies Approach Two distance metrics: cheap & expensive First Pass –very inexpensive distance metric –create overlapping canopies Second Pass –expensive, accurate distance metric –canopies determine which distances calculated
6 Illustrating Canopies
7 Overlapping Canopies
8 Creating canopies with two thresholds Put all points in D Loop: –Pick a point X from D –Put points within K loose of X in canopy –Remove points within K tight of X from D loose tight
9 Canopies Two distance metrics –cheap and approximate –expensive and accurate Two-pass clustering –create overlapping canopies –full clustering with limited distances Canopy property –points in same cluster will be in same canopy
10 Using canopies with GAC Calculate expensive distances between points in the same canopy All other distances default to infinity Sort finite distances and iteratively merge closest
11 Computational Savings inexpensive metric << expensive metric number of canopies: c (large) canopies overlap: each point in f canopies roughly f*n/c points per canopy O(f 2 *n 2 /c) expensive distance calculations complexity reduction: O(f 2 /c) n=10 6 ; k=10 4 ; c=1000; f small: computation reduced by factor of 1000
12 Experimental Results Complete GAC Canopies GAC MinutesF1
13 Preserving Good Clustering Small, disjoint canopies big time savings Large, overlapping canopies original accurate clustering Goal: fast and accurate –requires good, cheap distance metric
14 Reduced Dimension Representations
15 Clustering finds groups of similar objects Understanding clusters can be difficult Important to understand/interpret results Patterns waiting to be discovered
16 A picture is worth 1000 clusters
17 Feature Subset Selection Find n features that work best for prediction Find n features such that distance on them best correlates with distance on all features Minimize:
18 Feature Subset Selection Suppose all features relevant Does that mean dimensionality can’t be reduced? No! Manifold in feature space is what counts, not relevance of individual features Manifold can be lower dimension than feats
19 PCA: Principal Component Analysis Given data in d dimensions Compute: – d-dim mean vector M –dxd-dim covariance matrix C –eigenvectors and eigenvalues –Sort by eigenvalues –Select top k<d eigenvalues –Project data onto k eigenvectors
20 PCA Mean vector M:
21 PCA Covariance C:
22 PCA Eigenvectors –Unit vectors in directions of maximum variance Eigenvalues –Magnitude of the variance in the direction of each eigenvector
23 PCA Find largest eigenvalues and corresponding eigenvectors Project points onto k principal components where A is a d x k matrix whose columns are the k principal components of each point
24 PCA via Autoencoder ANN
25 Non-Linear PCA by Autoencoder
26 PCA need vector representation 0-d:sample mean 1-d:y = mx + b 2-d:y 1 = mx + b; y 2 = m`x + b`
27 MDS: Multidimensional Scaling PCA requires vector representation Given pairwise distances between n points? Find coordinates for points in d dimensional space s.t. distances are preserved “best”
28
29
30 MDS Assign points to coords x i in d-dim space –random coordinate values –principal components –dimensions with greatest variance Do gradient descent on coordinates x i of each point j until distortion is minimzed
31 Distortion
32 Distortion
33 Distortion
34 Gradient Descent on Coordinates
35 Subjective Distances Brazil USA Egypt Congo Russia France Cuba Yugoslavia Israel China
36
37
38 How Many Dimensions? D too large –perfect fit, no distortion –not easy to understand/visualize D too small –poor fit, much distortion –easyto visualize, but pattern may be misleading D just right?
39
40
41
42 Agglomerative Clustering of Proteins
43