1 Efficient Clustering of High-Dimensional Data Sets Andrew McCallum WhizBang! Labs & CMU Kamal Nigam WhizBang! Labs Lyle Ungar UPenn.

Slides:



Advertisements
Similar presentations
Self-Organizing Maps Projection of p dimensional observations to a two (or one) dimensional grid space Constraint version of K-means clustering –Prototypes.
Advertisements

Dimensionality Reduction PCA -- SVD
PCA + SVD.
1er. Escuela Red ProTIC - Tandil, de Abril, 2006 Principal component analysis (PCA) is a technique that is useful for the compression and classification.
Presented by: Mingyuan Zhou Duke University, ECE April 3, 2009
Non-linear Dimensionality Reduction CMPUT 466/551 Nilanjan Ray Prepared on materials from the book Non-linear dimensionality reduction By Lee and Verleysen,
Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching ANSHUL VARMA FAISAL QURESHI.
Clustering and Dimensionality Reduction Brendan and Yifang April
Computer Vision Spring ,-685 Instructor: S. Narasimhan Wean 5403 T-R 3:00pm – 4:20pm Lecture #20.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Dimensionality reduction. Outline From distances to points : – MultiDimensional Scaling (MDS) – FastMap Dimensionality Reductions or data projections.
Dimensionality R e d u c t i o n. Another unsupervised task Clustering, etc. -- all forms of data modeling Trying to identify statistically supportable.
Principal Component Analysis
DNA Microarray Bioinformatics - #27611 Program Normalization exercise (from last week) Dimension reduction theory (PCA/Clustering) Dimension reduction.
Dimensionality Reduction and Embeddings
Dimensionality Reduction
Dimensional reduction, PCA
Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 9(b) Principal Components Analysis Martin Russell.
Dimension reduction : PCA and Clustering by Agnieszka S. Juncker
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman.
1 Numerical geometry of non-rigid shapes Spectral Methods Tutorial. Spectral Methods Tutorial 6 © Maks Ovsjanikov tosca.cs.technion.ac.il/book Numerical.
Face Recognition Jeremy Wyatt.
Lecture 4 Unsupervised Learning Clustering & Dimensionality Reduction
The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.
Three Algorithms for Nonlinear Dimensionality Reduction Haixuan Yang Group Meeting Jan. 011, 2005.
Microarray analysis Algorithms in Computational Biology Spring 2006 Written by Itai Sharon.
Unsupervised Learning
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
A Global Geometric Framework for Nonlinear Dimensionality Reduction Joshua B. Tenenbaum, Vin de Silva, John C. Langford Presented by Napat Triroj.
Atul Singh Junior Undergraduate CSE, IIT Kanpur.  Dimension reduction is a technique which is used to represent a high dimensional data in a more compact.
Dimensionality Reduction
Principal Component Analysis. Consider a collection of points.
Lightseminar: Learned Representation in AI An Introduction to Locally Linear Embedding Lawrence K. Saul Sam T. Roweis presented by Chan-Su Lee.
Dimensionality Reduction. Multimedia DBs Many multimedia applications require efficient indexing in high-dimensions (time-series, images and videos, etc)
CS 485/685 Computer Vision Face Recognition Using Principal Components Analysis (PCA) M. Turk, A. Pentland, "Eigenfaces for Recognition", Journal of Cognitive.
The Tutorial of Principal Component Analysis, Hierarchical Clustering, and Multidimensional Scaling Wenshan Wang.
Empirical Modeling Dongsup Kim Department of Biosystems, KAIST Fall, 2004.
Summarized by Soo-Jin Kim
Principle Component Analysis (PCA) Networks (§ 5.8) PCA: a statistical procedure –Reduce dimensionality of input vectors Too many features, some of them.
Chapter 2 Dimensionality Reduction. Linear Methods
Presented By Wanchen Lu 2/25/2013
Principal Components Analysis BMTRY 726 3/27/14. Uses Goal: Explain the variability of a set of variables using a “small” set of linear combinations of.
Feature extraction 1.Introduction 2.T-test 3.Signal Noise Ratio (SNR) 4.Linear Correlation Coefficient (LCC) 5.Principle component analysis (PCA) 6.Linear.
1 Recognition by Appearance Appearance-based recognition is a competing paradigm to features and alignment. No features are extracted! Images are represented.
Principal Component Analysis Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Data Reduction. 1.Overview 2.The Curse of Dimensionality 3.Data Sampling 4.Binning and Reduction of Cardinality.
Computer Vision Lab. SNU Young Ki Baik Nonlinear Dimensionality Reduction Approach (ISOMAP, LLE)
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman modified by Hanne Jarmer.
ISOMAP TRACKING WITH PARTICLE FILTER Presented by Nikhil Rane.
CSE 185 Introduction to Computer Vision Face Recognition.
Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara.
Data Projections & Visualization Rajmonda Caceres MIT Lincoln Laboratory.
CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.
EIGENSYSTEMS, SVD, PCA Big Data Seminar, Dedi Gadot, December 14 th, 2014.
Principal Component Analysis and Linear Discriminant Analysis for Feature Reduction Jieping Ye Department of Computer Science and Engineering Arizona State.
The Canopies Algorithm from “Efficient Clustering of High-Dimensional Data Sets with Application to Reference Matching” Andrew McCallum, Kamal Nigam, Lyle.
Advanced Artificial Intelligence Lecture 8: Advance machine learning.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Chun Kai Chen Author : Andrew.
Principal Components Analysis ( PCA)
Out of sample extension of PCA, Kernel PCA, and MDS WILSON A. FLORERO-SALINAS DAN LI MATH 285, FALL
Machine Learning Supervised Learning Classification and Regression K-Nearest Neighbor Classification Fisher’s Criteria & Linear Discriminant Analysis Perceptron:
Principal Component Analysis (PCA)
Dimensionality Reduction
Unsupervised Learning: Principle Component Analysis
Principal Component Analysis (PCA)
Machine Learning Dimensionality Reduction
Principal Component Analysis
Outline H. Murase, and S. K. Nayar, “Visual learning and recognition of 3-D objects from appearance,” International Journal of Computer Vision, vol. 14,
Principal Component Analysis
INTRODUCTION TO Machine Learning
Presentation transcript:

1 Efficient Clustering of High-Dimensional Data Sets Andrew McCallum WhizBang! Labs & CMU Kamal Nigam WhizBang! Labs Lyle Ungar UPenn

2 Large Clustering Problems Many examples Many clusters Many dimensions Example Domains Text Images Protein Structure

3 The Citation Clustering Data Over 1,000,000 citations About 100,000 unique papers About 100,000 unique vocabulary words Over 1 trillion distance calculations

4 Reduce number of distance calculations [Bradley, Fayyad, Reina KDD-98] –Sample to find initial starting points for k-means or EM [Moore 98] –Use multi-resolution kd-trees to group similar data points [Omohundro 89] –Balltrees

5 The Canopies Approach Two distance metrics: cheap & expensive First Pass –very inexpensive distance metric –create overlapping canopies Second Pass –expensive, accurate distance metric –canopies determine which distances calculated

6 Illustrating Canopies

7 Overlapping Canopies

8 Creating canopies with two thresholds Put all points in D Loop: –Pick a point X from D –Put points within K loose of X in canopy –Remove points within K tight of X from D loose tight

9 Canopies Two distance metrics –cheap and approximate –expensive and accurate Two-pass clustering –create overlapping canopies –full clustering with limited distances Canopy property –points in same cluster will be in same canopy

10 Using canopies with GAC Calculate expensive distances between points in the same canopy All other distances default to infinity Sort finite distances and iteratively merge closest

11 Computational Savings inexpensive metric << expensive metric number of canopies: c (large) canopies overlap: each point in f canopies roughly f*n/c points per canopy O(f 2 *n 2 /c) expensive distance calculations complexity reduction: O(f 2 /c) n=10 6 ; k=10 4 ; c=1000; f small: computation reduced by factor of 1000

12 Experimental Results Complete GAC Canopies GAC MinutesF1

13 Preserving Good Clustering Small, disjoint canopies big time savings Large, overlapping canopies original accurate clustering Goal: fast and accurate –requires good, cheap distance metric

14 Reduced Dimension Representations

15 Clustering finds groups of similar objects Understanding clusters can be difficult Important to understand/interpret results Patterns waiting to be discovered

16 A picture is worth 1000 clusters

17 Feature Subset Selection Find n features that work best for prediction Find n features such that distance on them best correlates with distance on all features Minimize:

18 Feature Subset Selection Suppose all features relevant Does that mean dimensionality can’t be reduced? No! Manifold in feature space is what counts, not relevance of individual features Manifold can be lower dimension than feats

19 PCA: Principal Component Analysis Given data in d dimensions Compute: – d-dim mean vector M –dxd-dim covariance matrix C –eigenvectors and eigenvalues –Sort by eigenvalues –Select top k<d eigenvalues –Project data onto k eigenvectors

20 PCA Mean vector M:

21 PCA Covariance C:

22 PCA Eigenvectors –Unit vectors in directions of maximum variance Eigenvalues –Magnitude of the variance in the direction of each eigenvector

23 PCA Find largest eigenvalues and corresponding eigenvectors Project points onto k principal components where A is a d x k matrix whose columns are the k principal components of each point

24 PCA via Autoencoder ANN

25 Non-Linear PCA by Autoencoder

26 PCA need vector representation 0-d:sample mean 1-d:y = mx + b 2-d:y 1 = mx + b; y 2 = m`x + b`

27 MDS: Multidimensional Scaling PCA requires vector representation Given pairwise distances between n points? Find coordinates for points in d dimensional space s.t. distances are preserved “best”

28

29

30 MDS Assign points to coords x i in d-dim space –random coordinate values –principal components –dimensions with greatest variance Do gradient descent on coordinates x i of each point j until distortion is minimzed

31 Distortion

32 Distortion

33 Distortion

34 Gradient Descent on Coordinates

35 Subjective Distances Brazil USA Egypt Congo Russia France Cuba Yugoslavia Israel China

36

37

38 How Many Dimensions? D too large –perfect fit, no distortion –not easy to understand/visualize D too small –poor fit, much distortion –easyto visualize, but pattern may be misleading D just right?

39

40

41

42 Agglomerative Clustering of Proteins

43