Download presentation
Presentation is loading. Please wait.
1
1 Efficient Clustering of High-Dimensional Data Sets Andrew McCallum WhizBang! Labs & CMU Kamal Nigam WhizBang! Labs Lyle Ungar UPenn
2
2 Large Clustering Problems Many examples Many clusters Many dimensions Example Domains Text Images Protein Structure
3
3 The Citation Clustering Data Over 1,000,000 citations About 100,000 unique papers About 100,000 unique vocabulary words Over 1 trillion distance calculations
4
4 Reduce number of distance calculations [Bradley, Fayyad, Reina KDD-98] –Sample to find initial starting points for k-means or EM [Moore 98] –Use multi-resolution kd-trees to group similar data points [Omohundro 89] –Balltrees
5
5 The Canopies Approach Two distance metrics: cheap & expensive First Pass –very inexpensive distance metric –create overlapping canopies Second Pass –expensive, accurate distance metric –canopies determine which distances calculated
6
6 Illustrating Canopies
7
7 Overlapping Canopies
8
8 Creating canopies with two thresholds Put all points in D Loop: –Pick a point X from D –Put points within K loose of X in canopy –Remove points within K tight of X from D loose tight
9
9 Canopies Two distance metrics –cheap and approximate –expensive and accurate Two-pass clustering –create overlapping canopies –full clustering with limited distances Canopy property –points in same cluster will be in same canopy
10
10 Using canopies with GAC Calculate expensive distances between points in the same canopy All other distances default to infinity Sort finite distances and iteratively merge closest
11
11 Computational Savings inexpensive metric << expensive metric number of canopies: c (large) canopies overlap: each point in f canopies roughly f*n/c points per canopy O(f 2 *n 2 /c) expensive distance calculations complexity reduction: O(f 2 /c) n=10 6 ; k=10 4 ; c=1000; f small: computation reduced by factor of 1000
12
12 Experimental Results 134.090.835Complete GAC 7.650.838Canopies GAC MinutesF1
13
13 Preserving Good Clustering Small, disjoint canopies big time savings Large, overlapping canopies original accurate clustering Goal: fast and accurate –requires good, cheap distance metric
14
14 Reduced Dimension Representations
15
15 Clustering finds groups of similar objects Understanding clusters can be difficult Important to understand/interpret results Patterns waiting to be discovered
16
16 A picture is worth 1000 clusters
17
17 Feature Subset Selection Find n features that work best for prediction Find n features such that distance on them best correlates with distance on all features Minimize:
18
18 Feature Subset Selection Suppose all features relevant Does that mean dimensionality can’t be reduced? No! Manifold in feature space is what counts, not relevance of individual features Manifold can be lower dimension than feats
19
19 PCA: Principal Component Analysis Given data in d dimensions Compute: – d-dim mean vector M –dxd-dim covariance matrix C –eigenvectors and eigenvalues –Sort by eigenvalues –Select top k<d eigenvalues –Project data onto k eigenvectors
20
20 PCA Mean vector M:
21
21 PCA Covariance C:
22
22 PCA Eigenvectors –Unit vectors in directions of maximum variance Eigenvalues –Magnitude of the variance in the direction of each eigenvector
23
23 PCA Find largest eigenvalues and corresponding eigenvectors Project points onto k principal components where A is a d x k matrix whose columns are the k principal components of each point
24
24 PCA via Autoencoder ANN
25
25 Non-Linear PCA by Autoencoder
26
26 PCA need vector representation 0-d:sample mean 1-d:y = mx + b 2-d:y 1 = mx + b; y 2 = m`x + b`
27
27 MDS: Multidimensional Scaling PCA requires vector representation Given pairwise distances between n points? Find coordinates for points in d dimensional space s.t. distances are preserved “best”
28
28
29
29
30
30 MDS Assign points to coords x i in d-dim space –random coordinate values –principal components –dimensions with greatest variance Do gradient descent on coordinates x i of each point j until distortion is minimzed
31
31 Distortion
32
32 Distortion
33
33 Distortion
34
34 Gradient Descent on Coordinates
35
35 Subjective Distances Brazil USA Egypt Congo Russia France Cuba Yugoslavia Israel China
36
36
37
37
38
38 How Many Dimensions? D too large –perfect fit, no distortion –not easy to understand/visualize D too small –poor fit, much distortion –easyto visualize, but pattern may be misleading D just right?
39
39
40
40
41
41
42
42 Agglomerative Clustering of Proteins
43
43
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.