1 A Survey on Distance Metric Learning (Part 1) Gerry Tesauro IBM T.J.Watson Research Center.

Slides:

Advertisements

Similar presentations

Learning Riemannian metrics for motion classification Fabio Cuzzolin INRIA Rhone-Alpes Computational Imaging Group, Pompeu Fabra University, Barcellona.

Advertisements

Text mining Gergely Kótyuk Laboratory of Cryptography and System Security (CrySyS) Budapest University of Technology and Economics

Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.

Support Vector Machines

Manifold Learning Dimensionality Reduction. Outline Introduction Dim. Reduction Manifold Isomap Overall procedure Approximating geodesic dist. Dijkstra’s.

Presented by: Mingyuan Zhou Duke University, ECE April 3, 2009

Graph Laplacian Regularization for Large-Scale Semidefinite Programming Kilian Weinberger et al. NIPS 2006 presented by Aggeliki Tsoli.

Distance metric learning, with application to clustering with side-information Eric P. Xing, Andrew Y. Ng, Michael I. Jordan and Stuart Russell University.

Non-linear Dimensionality Reduction CMPUT 466/551 Nilanjan Ray Prepared on materials from the book Non-linear dimensionality reduction By Lee and Verleysen,

University of Joensuu Dept. of Computer Science P.O. Box 111 FIN Joensuu Tel fax Isomap Algorithm.

“Random Projections on Smooth Manifolds” -A short summary

ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

Dimensionality reduction. Outline From distances to points : – MultiDimensional Scaling (MDS) – FastMap Dimensionality Reductions or data projections.

Support Vector Machines (SVMs) Chapter 5 (Duda et al.)

LLE and ISOMAP Analysis of Robot Images Rong Xu. Background Intuition of Dimensionality Reduction Linear Approach –PCA(Principal Component Analysis) Nonlinear.

Learning of Pseudo-Metrics. Slide 1 Online and Batch Learning of Pseudo-Metrics Shai Shalev-Shwartz Hebrew University, Jerusalem Joint work with Yoram.

An Introduction to Kernel-Based Learning Algorithms K.-R. Muller, S. Mika, G. Ratsch, K. Tsuda and B. Scholkopf Presented by: Joanna Giforos CS8980: Topics.

Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.

1 Numerical geometry of non-rigid shapes Spectral Methods Tutorial. Spectral Methods Tutorial 6 © Maks Ovsjanikov tosca.cs.technion.ac.il/book Numerical.

Manifold Learning: ISOMAP Alan O'Connor April 29, 2008.

Supervised Distance Metric Learning Presented at CMU’s Computer Vision Misc-Read Reading Group May 9, 2007 by Tomasz Malisiewicz.

The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.

Three Algorithms for Nonlinear Dimensionality Reduction Haixuan Yang Group Meeting Jan. 011, 2005.

1 A Survey on Distance Metric Learning (Part 2) Gerry Tesauro IBM T.J.Watson Research Center.

Distance Metric Learning: A Comprehensive Survey

A Global Geometric Framework for Nonlinear Dimensionality Reduction Joshua B. Tenenbaum, Vin de Silva, John C. Langford Presented by Napat Triroj.

Atul Singh Junior Undergraduate CSE, IIT Kanpur.  Dimension reduction is a technique which is used to represent a high dimensional data in a more compact.

NonLinear Dimensionality Reduction or Unfolding Manifolds Tennenbaum|Silva|Langford [Isomap] Roweis|Saul [Locally Linear Embedding] Presented by Vikas.

Lightseminar: Learned Representation in AI An Introduction to Locally Linear Embedding Lawrence K. Saul Sam T. Roweis presented by Chan-Su Lee.

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

Nonlinear Dimensionality Reduction by Locally Linear Embedding Sam T. Roweis and Lawrence K. Saul Reference: "Nonlinear dimensionality reduction by locally.

Nonlinear Dimensionality Reduction Approaches. Dimensionality Reduction The goal: The meaningful low-dimensional structures hidden in their high-dimensional.

Ch. Eick: Support Vector Machines: The Main Ideas Reading Material Support Vector Machines: 1.Textbook 2. First 3 columns of Smola/Schönkopf article on.

Manifold learning: Locally Linear Embedding Jieping Ye Department of Computer Science and Engineering Arizona State University

Summarized by Soo-Jin Kim

Graph Embedding: A General Framework for Dimensionality Reduction Dong XU School of Computer Engineering Nanyang Technological University

IEEE TRANSSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE

Learning a Kernel Matrix for Nonlinear Dimensionality Reduction By K. Weinberger, F. Sha, and L. Saul Presented by Michael Barnathan.

An Introduction to Support Vector Machines (M. Law)

Computer Vision Lab. SNU Young Ki Baik Nonlinear Dimensionality Reduction Approach (ISOMAP, LLE)

Low-Rank Kernel Learning with Bregman Matrix Divergences Brian Kulis, Matyas A. Sustik and Inderjit S. Dhillon Journal of Machine Learning Research 10.

ISOMAP TRACKING WITH PARTICLE FILTER Presented by Nikhil Rane.

GRASP Learning a Kernel Matrix for Nonlinear Dimensionality Reduction Kilian Q. Weinberger, Fei Sha and Lawrence K. Saul ICML’04 Department of Computer.

Dimensionality Reduction

Manifold learning: MDS and Isomap

Non-Isometric Manifold Learning Analysis and an Algorithm Piotr Dollár, Vincent Rabaud, Serge Belongie University of California, San Diego.

Nonlinear Dimensionality Reduction Approach (ISOMAP)

Jan Kamenický.  Many features ⇒ many dimensions  Dimensionality reduction ◦ Feature extraction (useful representation) ◦ Classification ◦ Visualization.

Non-Linear Dimensionality Reduction

Optimal Dimensionality of Metric Space for kNN Classification Wei Zhang, Xiangyang Xue, Zichen Sun Yuefei Guo, and Hong Lu Dept. of Computer Science &

Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara.

Data Projections & Visualization Rajmonda Caceres MIT Lincoln Laboratory.

Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1.

Data Mining Course 2007 Eric Postma Clustering. Overview Three approaches to clustering 1.Minimization of reconstruction error PCA, nlPCA, k-means clustering.

Large Scale Distributed Distance Metric Learning by Pengtao Xie and Eric Xing PRESENTED BY: PRIYANKA.

June 25-29, 2006ICML2006, Pittsburgh, USA Local Fisher Discriminant Analysis for Supervised Dimensionality Reduction Masashi Sugiyama Tokyo Institute of.

Nonlinear Dimension Reduction: Semi-Definite Embedding vs. Local Linear Embedding Li Zhang and Lin Liao.

Spectral Methods for Dimensionality

Nonlinear Dimensionality Reduction

Dimensionality Reduction

Unsupervised Riemannian Clustering of Probability Density Functions

Metric Learning for Clustering

Dimensionality Reduction

Spectral Methods Tutorial 6 1 © Maks Ovsjanikov

Machine Learning Dimensionality Reduction

Outline Nonlinear Dimension Reduction Brief introduction Isomap LLE

Dimensionality Reduction

Metric Learning by Collapsing Classes

Nonlinear Dimension Reduction:

NonLinear Dimensionality Reduction or Unfolding Manifolds

Presentation transcript:

1 A Survey on Distance Metric Learning (Part 1) Gerry Tesauro IBM T.J.Watson Research Center

2 Acknowledgement Lecture material shamelessly adapted/stolen from the following sources: –Kilian Weinberger: “Survey on Distance Metric Learning” slides IBM summer intern talk slides (Aug. 2006) –Sam Roweis slides (NIPS 2006 workshop on “Learning to Compare Examples”) –Yann LeCun talk slides (NIPS 2006 workshop on “Learning to Compare Examples”)

3 Outline  Motivation and Basic Concepts  ML tasks where it’s useful to learn dist. metric  Overview of Dimensionality Reduction  Mahalanobis Metric Learning for Clustering with Side Info (Xing et al.)  Pseudo-metric online learning (Shalev-Shwartz et al.)  Neighbourhood Components Analysis (Golderberger et al.), Metric Learning by Collapsing Classes (Globerson & Roweis)  Metric Learning for Kernel Regression (Weinberger & Tesauro)  Metric learning for RL basis function construction (Keller et al.)  Similarity learning for image processing (LeCun et al.) Part 1 Part 2

4 Motivation Many ML algorithms and tasks require a distance metric (equivalently, “dissimilarity” metric) –Clustering (e.g. k-means) –Classification & regression: Kernel methods Nearest neighbor methods –Document/text retrieval Find most similar fingerprints in DB to given sample Find most similar web pages to document/keywords –Nonlinear dimensionality reduction methods: Isomap, Maximum Variance Unfolding, Laplacian Eigenmaps, etc.

5 Motivation (2) Many problems may lack a well-defined, relevant distance metric –Incommensurate features  Euclidean distance not meaningful –Side information  Euclidean distance not relevant –Learning distance metrics may thus be desirable A sensible similarity/distance metric may be highly task-dependent or semantic-dependent –What do these data points “mean”? –What are we using the data for?

Which images are most similar?

It depends... centeredleftright

male female It depends...

... what you are looking for student professor

... what you are looking for nature background plain background

Key DML Concept: Mahalanobis distance metric The simplest mapping is a linear transformation

Mahalanobis distance metric The simplest mapping is a linear transformation Algorithms can learn both matrices PSD

>5 Minutes Introduction to Dimensionality Reduction

How can the dimensionality be reduced? eliminate redundant features eliminate irrelevant features extract low dimensional structure

Notation Input: Output: Embedding principle: with Nearby points remain nearby, distant points remain distant. Estimate r.

Two classes of DR algorithms LinearNon-Linear

Linear dimensionality reduction

Principal Component Analysis (Jolliffe 1986) Project data into subspace of maximum variance.

Optimization

Covariance matrix Eigenvalue solution:

Eigenvectors of covariance matrix C Minimizes ssq reconstruction error Dimensionality r can be estimated from eigenvalues of C PCA requires meaningful scaling of input features Facts about PCA

Multidimensional Scaling (MDS) milesNYLAPhoenixChicago NY LA Phoenix Chicago

Multidimensional Scaling (MDS)

inner product matrix

Multidimensional Scaling (MDS) equivalent to PCA use eigenvectors of inner-product matrix requires only pairwise distances

Non-linear dimensionality reduction

From subspace to submanifold We assume the data is sampled from some manifold with lower dimensional degree of freedom. How can we find a truthful embedding?

Approximate manifold with neighborhood graph

Isomap Tenenbaum et al 2000 Compute shortest path between all inputs Create geodesic distance matrix Perform MDS with geodesic distances geodesic distance

Locally Linear Embedding (LLE) Roweis and Saul 2000 Maximize pairwise distances Preserve local distances and angles “Unfolding” by semidefinite programming

Maximum Variance Unfolding (MVU) Weinberger and Saul 2004

Maximum Variance Unfolding (MVU) Weinberger and Saul 2004

Optimization problem unfold data by maximizing pairwise distances Preserve local distances

Optimization problem center output (translation invariance)

Optimization problem Problem: Optimization non-convex multiple local minima

Optimization problem Solution: Change of notation single global minimum

Unfolding the swiss-roll

40 Mahalanobis Metric Learning for Clustering with Side Information (Xing et al. 2003) Exemplars {x i, i=1,…,N} plus two types of side info: – “Similar” set S = { (x i, x j ) } s.t. x i and x j are “similar” (e.g. same class) – “Dissimilar” set D = { (x i, x j ) } s.t. x i and x j are “dissimilar” Learn optimal Mahalanobis matrix M D 2 ij = (x i – x j ) T M (x i – x j ) (global dist. fn.) Goal : keep all pairs of “similar” points close, while separating all “dissilimar” pairs. Formulate as a constrained convex programming problem – minimize the distance between the data pairs in S – Subject to data pairs in D are well separated

41 MMC-SI (Cont’d) Objective of learning: M is positive semi-definite – Ensure non negativity and triangle inequality of the metric The number of parameters is quadratic in the number of features – Difficult to scale to a large number of features – Significant danger of overfitting small datasets

Mahalanobis Metric for Clustering (MMC-SI) Xing et al., NIPS 2002

Move similarly labeled inputs together MMC-SI

Move different labeled inputs apart MMC-SI

Convex optimization problem

target: Mahalanobis matrix

Convex optimization problem pushing differently labeled inputs apart

Convex optimization problem pulling similar points together

Convex optimization problem ensuring positive semi-definiteness

Convex optimization problem CONVEX

Two convex sets Set of all matrices that satisfy constraint 1: Cone of PSD matrices:

Convex optimization problem convex objective convex constraints

Gradient Alternating Projection

Take step along gradient.

Gradient Alternating Projection Take step along gradient. Project onto constraint satisfying sub-space.

Gradient Alternating Projection Take step along gradient. Project onto constraint satisfying sub-space. Project onto PSD cone.

Gradient Alternating Projection Algorithm is guaranteed to converge to optimal solution Take step along gradient. Project onto constraint satisfying sub-space. Project onto PSD cone. REPEAT

58 (a)Data Dist. of the original dataset (b) Data scaled by the global metric Mahalanobis Metric Learning: Example I Keep all the data points within the same classes close Separate all the data points from different classes

59 Mahalanobis Metric Learning: Example II Diagonal distance metric M can simplify computation, but could lead to disastrous results (a)Original data (c) Rescaling by learned diagonal M (b) rescaling by learned full M

Summary of Xing et al 2002 Learns Mahalanobis metric Well suited for clustering Can be kernelized Optimization problem is convex Algorithm is guaranteed to converge Assumes data to be uni-modal

POLA (Pseudo-metric online learning algorithm) Shalev-Shwartz et al, ICML 2004

This time the inputs are accessed two at a time. POLA (Pseudo-metric online learning algorithm)

Differently labeled inputs are separated. POLA (Pseudo-metric online learning algorithm)

POLA (Pseudo-metric online learning algorithm)

Similarly labeled inputs are moved closer. POLA (Pseudo-metric online learning algorithm)

Margin

Convex optimization At each time t, we get two inputs:, Constraint 1: Constraint 2: Both are convex!!

Alternating Projection Initialize inside PSD cone Project onto constraint - satisfying hyperplane and back

Alternating Projection Initialize inside PSD cone Project onto constraint - satisfying hyperplane and back Repeat with new constraints

Alternating Projection Initialize inside PSD cone Project onto constraint - satisfying hyperplane and back Repeat with new constraints If solution exists, algorithm converges inside intersection.

Theoretical Guarantees: Provided global solution exists: Batch-version converges after finite number of passes over data. Online-version has an upper bound on accumulated violation of threshold.

Summary of POLA Learns Mahalanobis metric Online algorithm Can also be kernelized Introduces a margin Algorithm converges if solution exists Assumes data to be unimodal

Neighborhood Component Analysis (Goldberger et. al. 2004) Distance metric for visualization and kNN