1 Efficient Clustering of High-Dimensional Data Sets Andrew McCallum WhizBang! Labs & CMU Kamal Nigam WhizBang! Labs Lyle Ungar UPenn.

1 Efficient Clustering of High-Dimensional Data Sets Andrew McCallum WhizBang! Labs & CMU Kamal Nigam WhizBang! Labs Lyle Ungar UPenn

2 Large Clustering Problems Many examples Many clusters Many dimensions Example Domains Text Images Protein Structure

3 The Citation Clustering Data Over 1,000,000 citations About 100,000 unique papers About 100,000 unique vocabulary words Over 1 trillion distance calculations

4 Reduce number of distance calculations [Bradley, Fayyad, Reina KDD-98] –Sample to find initial starting points for k-means or EM [Moore 98] –Use multi-resolution kd-trees to group similar data points [Omohundro 89] –Balltrees

5 The Canopies Approach Two distance metrics: cheap & expensive First Pass –very inexpensive distance metric –create overlapping canopies Second Pass –expensive, accurate distance metric –canopies determine which distances calculated

6 Illustrating Canopies

7 Overlapping Canopies

8 Creating canopies with two thresholds Put all points in D Loop: –Pick a point X from D –Put points within K loose of X in canopy –Remove points within K tight of X from D loose tight

9 Canopies Two distance metrics –cheap and approximate –expensive and accurate Two-pass clustering –create overlapping canopies –full clustering with limited distances Canopy property –points in same cluster will be in same canopy

10 Using canopies with GAC Calculate expensive distances between points in the same canopy All other distances default to infinity Sort finite distances and iteratively merge closest

11 Computational Savings inexpensive metric << expensive metric number of canopies: c (large) canopies overlap: each point in f canopies roughly f*n/c points per canopy O(f 2 *n 2 /c) expensive distance calculations complexity reduction: O(f 2 /c) n=10 6 ; k=10 4 ; c=1000; f small: computation reduced by factor of 1000

12 Experimental Results 134.090.835Complete GAC 7.650.838Canopies GAC MinutesF1

13 Preserving Good Clustering Small, disjoint canopies big time savings Large, overlapping canopies original accurate clustering Goal: fast and accurate –requires good, cheap distance metric

14 Reduced Dimension Representations

15 Clustering finds groups of similar objects Understanding clusters can be difficult Important to understand/interpret results Patterns waiting to be discovered

16 A picture is worth 1000 clusters

17 Feature Subset Selection Find n features that work best for prediction Find n features such that distance on them best correlates with distance on all features Minimize:

18 Feature Subset Selection Suppose all features relevant Does that mean dimensionality can’t be reduced? No! Manifold in feature space is what counts, not relevance of individual features Manifold can be lower dimension than feats

19 PCA: Principal Component Analysis Given data in d dimensions Compute: – d-dim mean vector M –dxd-dim covariance matrix C –eigenvectors and eigenvalues –Sort by eigenvalues –Select top k<d eigenvalues –Project data onto k eigenvectors

20 PCA Mean vector M:

21 PCA Covariance C:

22 PCA Eigenvectors –Unit vectors in directions of maximum variance Eigenvalues –Magnitude of the variance in the direction of each eigenvector

23 PCA Find largest eigenvalues and corresponding eigenvectors Project points onto k principal components where A is a d x k matrix whose columns are the k principal components of each point

24 PCA via Autoencoder ANN

25 Non-Linear PCA by Autoencoder

26 PCA need vector representation 0-d:sample mean 1-d:y = mx + b 2-d:y 1 = mx + b; y 2 = m`x + b`

27 MDS: Multidimensional Scaling PCA requires vector representation Given pairwise distances between n points? Find coordinates for points in d dimensional space s.t. distances are preserved “best”

30 MDS Assign points to coords x i in d-dim space –random coordinate values –principal components –dimensions with greatest variance Do gradient descent on coordinates x i of each point j until distortion is minimzed

31 Distortion

32 Distortion

33 Distortion

34 Gradient Descent on Coordinates

35 Subjective Distances Brazil USA Egypt Congo Russia France Cuba Yugoslavia Israel China

38 How Many Dimensions? D too large –perfect fit, no distortion –not easy to understand/visualize D too small –poor fit, much distortion –easyto visualize, but pattern may be misleading D just right?

42 Agglomerative Clustering of Proteins

1 Efficient Clustering of High-Dimensional Data Sets Andrew McCallum WhizBang! Labs & CMU Kamal Nigam WhizBang! Labs Lyle Ungar UPenn.

Similar presentations

Presentation on theme: "1 Efficient Clustering of High-Dimensional Data Sets Andrew McCallum WhizBang! Labs & CMU Kamal Nigam WhizBang! Labs Lyle Ungar UPenn."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Efficient Clustering of High-Dimensional Data Sets Andrew McCallum WhizBang! Labs & CMU Kamal Nigam WhizBang! Labs Lyle Ungar UPenn.

Similar presentations

Presentation on theme: "1 Efficient Clustering of High-Dimensional Data Sets Andrew McCallum WhizBang! Labs & CMU Kamal Nigam WhizBang! Labs Lyle Ungar UPenn."— Presentation transcript:

Similar presentations

About project

Feedback