Lecture 4 Unsupervised Learning Clustering & Dimensionality Reduction

Slides:



Advertisements
Similar presentations
Clustering. How are we doing on the pass sequence? Pretty good! We can now automatically learn the features needed to track both people But, it sucks.
Advertisements

Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
Machine Learning and Data Mining Clustering
Clustering and Dimensionality Reduction Brendan and Yifang April
Computer Vision Spring ,-685 Instructor: S. Narasimhan Wean 5403 T-R 3:00pm – 4:20pm Lecture #20.
1cs542g-term High Dimensional Data  So far we’ve considered scalar data values f i (or interpolated/approximated each component of vector values.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Principal Component Analysis
Unsupervised Learning: Clustering Rong Jin Outline  Unsupervised learning  K means for clustering  Expectation Maximization algorithm for clustering.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Face Recognition Jeremy Wyatt.
Sample Midterm question. Sue want to build a model to predict movie ratings. She has a matrix of data, where for M movies and U users she has collected.
Clustering.
The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.
Unsupervised Learning
KNN, LVQ, SOM. Instance Based Learning K-Nearest Neighbor Algorithm (LVQ) Learning Vector Quantization (SOM) Self Organizing Maps.
What is Cluster Analysis?
Principal Components Analysis (PCA) 273A Intro Machine Learning.
Dimensionality Reduction
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Dimensionality Reduction. Multimedia DBs Many multimedia applications require efficient indexing in high-dimensions (time-series, images and videos, etc)
Clustering & Dimensionality Reduction 273A Intro Machine Learning.
1cs542g-term Notes  Extra class next week (Oct 12, not this Friday)  To submit your assignment: me the URL of a page containing (links to)
Radial Basis Function Networks
CS 485/685 Computer Vision Face Recognition Using Principal Components Analysis (PCA) M. Turk, A. Pentland, "Eigenfaces for Recognition", Journal of Cognitive.
Summarized by Soo-Jin Kim
Dimensionality Reduction: Principal Components Analysis Optional Reading: Smith, A Tutorial on Principal Components Analysis (linked to class webpage)
Presented By Wanchen Lu 2/25/2013
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
CSC321: Neural Networks Lecture 12: Clustering Geoffrey Hinton.
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 16 Nov, 3, 2011 Slide credit: C. Conati, S.
1 Recognition by Appearance Appearance-based recognition is a competing paradigm to features and alignment. No features are extracted! Images are represented.
CHAPTER 7: Clustering Eick: K-Means and EM (modified Alpaydin transparencies and new transparencies added) Last updated: February 25, 2014.
MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:
CSE 185 Introduction to Computer Vision Face Recognition.
Lecture 6 Spring 2010 Dr. Jianjun Hu CSCE883 Machine Learning.
Cluster Analysis Potyó László. Cluster: a collection of data objects Similar to one another within the same cluster Similar to one another within the.
INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.
Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara.
Lecture 2: Statistical learning primer for biologists
Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.
Clustering Instructor: Max Welling ICS 178 Machine Learning & Data Mining.
CSE 446 Dimensionality Reduction and PCA Winter 2012 Slides adapted from Carlos Guestrin & Luke Zettlemoyer.
Chapter 13 (Prototype Methods and Nearest-Neighbors )
Elements of Pattern Recognition CNS/EE Lecture 5 M. Weber P. Perona.
Feature Extraction 主講人:虞台文. Content Principal Component Analysis (PCA) PCA Calculation — for Fewer-Sample Case Factor Analysis Fisher’s Linear Discriminant.
Advanced Artificial Intelligence Lecture 8: Advance machine learning.
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
Face detection and recognition Many slides adapted from K. Grauman and D. Lowe.
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
Machine Learning Supervised Learning Classification and Regression K-Nearest Neighbor Classification Fisher’s Criteria & Linear Discriminant Analysis Perceptron:
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Machine Learning and Data Mining Clustering
Dimensionality Reduction
University of Ioannina
Data Mining K-means Algorithm
Classification of unlabeled data:
Recognition: Face Recognition
Principal Component Analysis (PCA)
Principal Component Analysis
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
PCA is “an orthogonal linear transformation that transfers the data to a new coordinate system such that the greatest variance by any projection of the.
Clustering 77B Recommender Systems
Outline H. Murase, and S. K. Nayar, “Visual learning and recognition of 3-D objects from appearance,” International Journal of Computer Vision, vol. 14,
Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models
Feature space tansformation methods
CS4670: Intro to Computer Vision
Machine Learning and Data Mining Clustering
Machine Learning and Data Mining Clustering
Presentation transcript:

Lecture 4 Unsupervised Learning Clustering & Dimensionality Reduction 273A Intro Machine Learning

What is Unsupervised Learning? In supervised learning we were given attributes & targets (e.g. class labels). In unsupervised learning we are only given attributes. Our task is to discover structure in the data. Example I: the data may be structured in clusters: Example II: the data may live on a lower dimensional manifold: Is this a good clustering?

Why Discover Structure ? Data compression: If you have a good model you can encode the data more cheaply. Example: To encode the data I have to encode the x and y position of each data-case. However, I could also encode the offset and angle of the line plus the deviations from the line. Small numbers can be encoded more cheaply than large numbers with the same precision. This idea is the basis for model selection: The complexity of your model (e.g. the number of parameters) should be such that you can encode the data-set with the fewest number of bits (up to a certain precision). Homework: Argue why a larger dataset will require a more complex model to achieve maximal compression.

Why Discover Structure ? the a on ... 5 4 7 1 3 ... Why Discover Structure ? Often, the result of an unsupervised learning algorithm is a new representation for the same data. This new representation should be more meaningful and could be used for further processing (e.g. classification). Example I: Clustering. The new representation is now given by the label of a cluster to which the data-point belongs. This tells us how similar data-cases are. Example II: Dimensionality Reduction. Instead of a 100 dimensional vector of real numbers, the data are now represented by a 2 dimensional vector which can be drawn in the plane. The new representation is smaller and hence more convenient computationally. Example I: A text corpus has about 1M documents. Each document is represented as a 20,000 dimensional count vector for each word in the vocabulary. Dimensionality reduction turns this into a (say) 50 dimensional vector for each doc. However: in the new representation documents which are on the same topic, but do not necessarily share keywords have moved closer together!

Clustering: K-means We iterate two operations: 1. Update the assignment of data-cases to clusters 2. Update the location of the cluster. Denote the assignment of data-case “i” to cluster “c”. Denote the position of cluster “c” in a d-dimensional space. Denote the location of data-case i Then iterate until convergence: 1. For each data-case, compute distances to each cluster and the closest one: 2. For each cluster location, compute the mean location of all data-cases assigned to it: Nr. of data-cases in cluster c Set of data-cases assigned to cluster c

K-means Cost function: Each step in k-means decreases this cost function. Often initialization is very important since there are very many local minima in C. Relatively good initialization: place cluster locations on K randomly chosen data-cases. How to choose K? Add complexity term: and minimize also over K Or X-validation Or Bayesian methods Homework: Derive the k-means algorithm by showing that: step 1 minimizes C over z, keeping the cluster locations fixed. step 2 minimizes C over cluster locations, keeping z fixed.

Vector Quantization K-means divides the space up in a Voronoi tesselation. Every point on a tile is summarized by the code-book vector “+”. This clearly allows for data compression !

Mixtures of Gaussians K-means assigns each data-case to exactly 1 cluster. But what if clusters are overlapping? Maybe we are uncertain as to which cluster it really belongs. The mixtures of Gaussians algorithm assigns data-cases to cluster with a certain probability.

MoG Clustering Covariance determines the shape of these contours Idea: fit these Gaussian densities to the data, one per cluster.

EM Algorithm: E-step “r” is the probability that data-case “i” belongs to cluster “c”. is the a priori probability of being assigned to cluster “c”. Note that if the Gaussian has high probability on data-case “i” (i.e. the bell-shape is on top of the data-case) then it claims high responsibility for this data-case. The denominator is just to normalize all responsibilities to 1: Homework: Imagine there are only two identical Gaussians and they both have their means equal to Xi (the location of data-case “i”). Compute the responsibilities for data-case “i”. What happens if one Gaussian has much larger variance than the other?

EM Algorithm: M-Step total responsibility claimed by cluster “c” expected fraction of data-cases assigned to this cluster weighted sample mean where every data-case is weighted according to the probability that it belongs to that cluster. weighted sample covariance Homework: show that k-means is a special case of the E and M steps.

EM-MoG EM comes from “expectation maximization”. We won’t go through the derivation. If we are forced to decide, we should assign a data-case to the cluster which claims highest responsibility. For a new data-case, we should compute responsibilities as in the E-step and pick the cluster with the largest responsibility. E and M steps should be iterated until convergence (which is guaranteed). Every step increases the following objective function (which is the total log-probability of the data under the model we are learning):

Dimensionality Reduction Instead of organized in clusters, the data may be approximately lying on a (perhaps curved) manifold. Most information in the data would be retained if we project the data on this low dimensional manifold. Advantages: visualization, extracting meaning attributes, computational efficiency

Principal Components Analysis We search for those directions in space that have the highest variance. We then project the data onto the subspace of highest variance. This structure is encoded in the sample co-variance of the data:

PCA We want to find the eigenvectors and eigenvalues of this covariance: ( in matlab [U,L]=eig(C) ) eigenvalue = variance in direction eigenvector Orthogonal, unit-length eigenvectors.

PCA properties check this (U eigevectors) (u orthonormal  U rotation) (rank-k approximation) (projection) Homework: What projection z has covariance C=I in k dimensions ?

PCA properties is the optimal rank-k approximation of C in Frobenius norm. I.e. it minimizes the cost-function: Note that there are infinite solutions that minimize this norm. If A is a solution, then is also a solution. The solution provided by PCA is unique because U is orthogonal and ordered by largest eigenvalue. Solution is also nested: if I solve for a rank-k+1 approximation, I will find that the first k eigenvectors are those found by an rank-k approximation (etc.)

Homework Imagine I have 1000 20x20 images of faces. Each pixel is an attribute Xi and can take continuous values in the interval [0,1]. Let’s say I am interested in finding the four “eigen-faces” that span most of the variance in the data. Provide pseudo-code of how to find these four eigen-faces.