Distance metric learning, with application to clustering with side-information Eric P. Xing, Andrew Y. Ng, Michael I. Jordan and Stuart Russell University.

Slides:

Advertisements

Similar presentations

Machine Learning for Vision-Based Motion Analysis Learning pullback metrics for linear models Oxford Brookes Vision Group Oxford Brookes University 17/10/2008.

Advertisements

Clustering Categorical Data The Case of Quran Verses

1 Machine Learning: Lecture 10 Unsupervised Learning (Based on Chapter 9 of Nilsson, N., Introduction to Machine Learning, 1996)

Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.

1 A Survey on Distance Metric Learning (Part 1) Gerry Tesauro IBM T.J.Watson Research Center.

A Probabilistic Framework for Semi-Supervised Clustering

One-Shot Multi-Set Non-rigid Feature-Spatial Matching

The value of kernel function represents the inner product of two training points in feature space Kernel functions merge two steps 1. map input data from.

Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.

Slide 1 EE3J2 Data Mining Lecture 16 Unsupervised Learning Ali Al-Shahib.

Face Recognition Based on 3D Shape Estimation

Lecture 4 Unsupervised Learning Clustering & Dimensionality Reduction

Supervised Distance Metric Learning Presented at CMU’s Computer Vision Misc-Read Reading Group May 9, 2007 by Tomasz Malisiewicz.

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.

1 On statistical models of cluster stability Z. Volkovich a, b, Z. Barzily a, L. Morozensky a a. Software Engineering Department, ORT Braude College of.

Unsupervised Learning

Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University

Bioinformatics Challenge  Learning in very high dimensions with very few samples  Acute leukemia dataset: 7129 # of gene vs. 72 samples  Colon cancer.

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.

Proceedings of the 2007 SIAM International Conference on Data Mining.

1 Multiple Kernel Learning Naouel Baili MRL Seminar, Fall 2009.

K-means Clustering. What is clustering? Why would we want to cluster? How would you determine clusters? How can you do this efficiently?

Clustering with Bregman Divergences Arindam Banerjee, Srujana Merugu, Inderjit S. Dhillon, Joydeep Ghosh Presented by Rohit Gupta CSci 8980: Machine Learning.

Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.

Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 11: Clustering Martin Russell.

Nonlinear Dimensionality Reduction by Locally Linear Embedding Sam T. Roweis and Lawrence K. Saul Reference: "Nonlinear dimensionality reduction by locally.

Clustering Unsupervised learning Generating “classes”

Collaborative Filtering Matrix Factorization Approach

Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.

Presented By Wanchen Lu 2/25/2013

SVM by Sequential Minimal Optimization (SMO)

Data Reduction. 1.Overview 2.The Curse of Dimensionality 3.Data Sampling 4.Binning and Reduction of Cardinality.

1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.

Fast Similarity Search for Learned Metrics Prateek Jain, Brian Kulis, and Kristen Grauman Department of Computer Sciences University of Texas at Austin.

CHAPTER 7: Clustering Eick: K-Means and EM (modified Alpaydin transparencies and new transparencies added) Last updated: February 25, 2014.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor ： Dr. Hsu Presenter ： Keng-Wei Chang Author: Yehuda.

Low-Rank Kernel Learning with Bregman Matrix Divergences Brian Kulis, Matyas A. Sustik and Inderjit S. Dhillon Journal of Machine Learning Research 10.

© 2011 Autodesk Freely licensed for use by educational institutions. Reuse and changes require a note indicating that content has been modified from the.

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.

Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara.

CS558 Project Local SVM Classification based on triangulation (on the plane) Glenn Fung.

CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.

1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.

1 Lecture 6 Neural Network Training. 2 Neural Network Training Network training is basic to establishing the functional relationship between the inputs.

Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.

Introduction to Optimization

Distance metric learning Vs. Fisher discriminant analysis.

Optimal Reverse Prediction: Linli Xu, Martha White and Dale Schuurmans ICML 2009, Best Overall Paper Honorable Mention A Unified Perspective on Supervised,

Lecture 6 Ordination Ordination contains a number of techniques to classify data according to predefined standards. The simplest ordination technique is.

Multiple Sequence Alignment Vasileios Hatzivassiloglou University of Texas at Dallas.

ViSOM － A Novel Method for Multivariate Data Projection and Structure Visualization Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Author : Hujun Yin.

CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel:

Learning Kernel Classifiers 1. Introduction Summarized by In-Hee Lee.

DB Seminar Series: The Semi-supervised Clustering Problem By: Kevin Yip 2 Jan 2004 (Happy New Year!)

Clustering Data Streams A presentation by George Toderici.

Nonlinear Dimension Reduction: Semi-Definite Embedding vs. Local Linear Embedding Li Zhang and Lin Liao.

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.

A Self-organizing Semantic Map for Information Retrieval Xia Lin, Dagobert Soergel, Gary Marchionini presented by Yi-Ting.

Spectral Methods for Dimensionality

PREDICT 422: Practical Machine Learning

Semi-Supervised Clustering

ECE 417 Lecture 2: Metric (=Norm) Learning

Unsupervised Riemannian Clustering of Probability Density Functions

کاربرد نگاشت با حفظ تنکی در شناسایی چهره

Metric Learning for Clustering

Clustering (3) Center-based algorithms Fuzzy k-means

Collaborative Filtering Matrix Factorization Approach

Algorithms for Budget-Constrained Survivable Topology Design

Non-Negative Matrix Factorization

Introduction to Machine learning

Presentation transcript:

Distance metric learning, with application to clustering with side-information Eric P. Xing, Andrew Y. Ng, Michael I. Jordan and Stuart Russell University of California, Berkeley Neural Information Processing Systems Neural Information Processing Systems 2002

2004/9 2 Abstract Many algorithms rely critically on being given a good metric over the input space. – Provide a more systematic way for users to indicate what they consider “similar”. If a clustering algorithm fails to find one that is meaningful to a user, the only recourse may be for the user to manually tweak the metric until sufficiently good clusters are found. In this paper, we present an algorithm that, given examples of similar (and, if desired, dissimilar) pairs of points in, learns a distance metric over that respects these relationships.

2004/9 3 Introduction Given a good metrics that reflect reasonably well the important relationships between the data. – K-means, nearest-neighbors, SVMs,and so on. Many learning and datamining algorithm depend on a good metric measurement, this problem is particularly acute in unsupervised settings such as clustering. One important family of algorithms that learn metrics are the unsupervised ones that take an input dataset, and find an embedding of it in some space. Such as: – Multidimensional Scaling (MDS) – Locally Linear Embedding (LLE) Have “no right answer problem” Similar to Principal Components Analysis (PCA)

2004/9 4 Introduction For clustering with similarity information certain pairs are “similar” or “dissimilar”, they search for a clustering that puts the similar pairs into the same, and dissimilar pairs into different, clusters. This gives a way of using similarity side- information to find clusters that reflect a user’s notion of meaningful clusters.

2004/9 5 Learning Distance Metrics Suppose we have some set of points,and are given information that certain pairs of them are “similar”: Consider learning a distance metric of the form How can we learn a distance metric d(x,y) between points x and y that respects this; specifically, so that “similar” points end up close to each other.

2004/9 6 Learning Distance Metrics metric – satisfying non-negativity and the triangle inequality – We require that A be positive semi-definite, – Setting A = givens Euclidean distance, – If we restrict A to be diagonal, this corresponds to learning a metric in which the different axes are given different “weights” – More generally, A parameterizes a family of Mahalanobis distances over. – Learning such a distance metric is also equivalent to finding a rescaling of a data that replaces each point x with and applying the standard Euclidean metric to the rescaled data.

2004/9 7 Learning Distance Metrics A simple way of defining a criterion for the desired metric is to demand that pairs of points (x i, y j ) in S have, say, small squared distance between them : This is trivially solved with A=0, which is not useful, and we add the constraint : – To ensure that A does not collapse the dataset into a single point. – Here, D can be a set of pairs of points known to be “dissimilar”,i.e., all pairs not in S.

2004/9 8 Learning Distance Metrics This givens the optimization problem: The optimization problem is convex, which enables us to derive efficient, local-minima-free algorithm to solve it.

2004/9 9 Note.

2004/9 10 Note. Newton-Raphson Method:

2004/9 11 Note.

2004/9 12 Learning Distance Metrics The case of diagonal – In the case that we want to learn a diagonal A=diag(A 11,A 22,…A nn ), we can derive an efficient algorithm using the Newton-Raphson to efficiently optimize g. Newton update by,where is a step-size parameter optimized via a line-search to give the largest downhill step subject to A ii 0

2004/9 13 Learning Distance Metrics The case of full A – Newton’s method often becomes prohibitively expensive requiring O(n 6 ) time to invert the Hessian over n 2 parameters.

2004/9 14 Learning Distance Metrics

2004/9 15 Experiments and Examples In the experiments with synthetic data, S was a randomly sampled 1% of all pairs of similar points. We can use the fact discussed earlier that learning ||.|| A is equivalent to finding rescaling of the data x -> A 1/2 x

2004/9 16 Experiments and Examples - Examples of learned distance metrics

2004/9 17 Experiments and Examples - Examples of learned distance metrics

2004/9 18 Experiments and Examples - Application to clustering Clustering with side information: – Given S,and told that each pair means x i and x j belong to the same cluster. Let be the cluster to which point x i is assigned by an automatic clustering algorithm,and let c i be some “correct” or desired clustering of the data.

2004/9 19 Experiments and Examples - Application to clustering

2004/9 20 Experiments and Examples - Application to clustering

2004/9 21 Experiments and Examples - Application to clustering

2004/9 22 Experiments and Examples - Application to clustering

2004/9 23 Conclusions We have presented an algorithm that, given examples of similar pairs of points in, learns a distance metric that respects these relationships. The method is based on posing metric learning as a convex optimization problem, which allowed us to derive efficient, local optima free algorithms. We also showed examples of diagonal and full metrics learned from simple artificial examples, and demonstrated on artificial and on UCI datasets how our methods can be used to improve clustering performance.