Distance Metric Learning: A Comprehensive Survey

Slides:



Advertisements
Similar presentations
ECG Signal processing (2)
Advertisements

Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.
A Geometric Perspective on Machine Learning 何晓飞 浙江大学计算机学院 1.
An Introduction of Support Vector Machine
Support Vector Machines
Machine learning continued Image source:
Semi-supervised Learning Rong Jin. Semi-supervised learning  Label propagation  Transductive learning  Co-training  Active learning.
Presented by: Mingyuan Zhou Duke University, ECE April 3, 2009
1 A Survey on Distance Metric Learning (Part 1) Gerry Tesauro IBM T.J.Watson Research Center.
Graph Laplacian Regularization for Large-Scale Semidefinite Programming Kilian Weinberger et al. NIPS 2006 presented by Aggeliki Tsoli.
Classification and Decision Boundaries
University of Joensuu Dept. of Computer Science P.O. Box 111 FIN Joensuu Tel fax Isomap Algorithm.
Support Vector Machines and Kernel Methods
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
Principal Component Analysis
1 Classification: Definition Given a collection of records (training set ) Each record contains a set of attributes, one of the attributes is the class.
An Introduction to Kernel-Based Learning Algorithms K.-R. Muller, S. Mika, G. Ratsch, K. Tsuda and B. Scholkopf Presented by: Joanna Giforos CS8980: Topics.
Supervised Distance Metric Learning Presented at CMU’s Computer Vision Misc-Read Reading Group May 9, 2007 by Tomasz Malisiewicz.
Three Algorithms for Nonlinear Dimensionality Reduction Haixuan Yang Group Meeting Jan. 011, 2005.
1 Computational Learning Theory and Kernel Methods Tianyi Jiang March 8, 2004.
A Global Geometric Framework for Nonlinear Dimensionality Reduction Joshua B. Tenenbaum, Vin de Silva, John C. Langford Presented by Napat Triroj.
Atul Singh Junior Undergraduate CSE, IIT Kanpur.  Dimension reduction is a technique which is used to represent a high dimensional data in a more compact.
NonLinear Dimensionality Reduction or Unfolding Manifolds Tennenbaum|Silva|Langford [Isomap] Roweis|Saul [Locally Linear Embedding] Presented by Vikas.
Lightseminar: Learned Representation in AI An Introduction to Locally Linear Embedding Lawrence K. Saul Sam T. Roweis presented by Chan-Su Lee.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Nonlinear Dimensionality Reduction by Locally Linear Embedding Sam T. Roweis and Lawrence K. Saul Reference: "Nonlinear dimensionality reduction by locally.
Nonlinear Dimensionality Reduction Approaches. Dimensionality Reduction The goal: The meaningful low-dimensional structures hidden in their high-dimensional.
An Introduction to Support Vector Machines Martin Law.
Ch. Eick: Support Vector Machines: The Main Ideas Reading Material Support Vector Machines: 1.Textbook 2. First 3 columns of Smola/Schönkopf article on.
Manifold learning: Locally Linear Embedding Jieping Ye Department of Computer Science and Engineering Arizona State University
Summarized by Soo-Jin Kim
Outline Separating Hyperplanes – Separable Case
This week: overview on pattern recognition (related to machine learning)
Graph Embedding: A General Framework for Dimensionality Reduction Dong XU School of Computer Engineering Nanyang Technological University
IEEE TRANSSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE
Learning a Kernel Matrix for Nonlinear Dimensionality Reduction By K. Weinberger, F. Sha, and L. Saul Presented by Michael Barnathan.
Classifiers Given a feature representation for images, how do we learn a model for distinguishing features from different classes? Zebra Non-zebra Decision.
An Introduction to Support Vector Machines (M. Law)
Computer Vision Lab. SNU Young Ki Baik Nonlinear Dimensionality Reduction Approach (ISOMAP, LLE)
Computational Intelligence: Methods and Applications Lecture 23 Logistic discrimination and support vectors Włodzisław Duch Dept. of Informatics, UMK Google:
Low-Rank Kernel Learning with Bregman Matrix Divergences Brian Kulis, Matyas A. Sustik and Inderjit S. Dhillon Journal of Machine Learning Research 10.
ISOMAP TRACKING WITH PARTICLE FILTER Presented by Nikhil Rane.
GRASP Learning a Kernel Matrix for Nonlinear Dimensionality Reduction Kilian Q. Weinberger, Fei Sha and Lawrence K. Saul ICML’04 Department of Computer.
Spoken Language Group Chinese Information Processing Lab. Institute of Information Science Academia Sinica, Taipei, Taiwan
Manifold learning: MDS and Isomap
Nonlinear Dimensionality Reduction Approach (ISOMAP)
Jan Kamenický.  Many features ⇒ many dimensions  Dimensionality reduction ◦ Feature extraction (useful representation) ◦ Classification ◦ Visualization.
H. Lexie Yang1, Dr. Melba M. Crawford2
Principal Component Analysis Machine Learning. Last Time Expectation Maximization in Graphical Models – Baum Welch.
Linear Models for Classification
Optimal Dimensionality of Metric Space for kNN Classification Wei Zhang, Xiangyang Xue, Zichen Sun Yuefei Guo, and Hong Lu Dept. of Computer Science &
Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara.
1  Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Chapter 13 (Prototype Methods and Nearest-Neighbors )
June 25-29, 2006ICML2006, Pittsburgh, USA Local Fisher Discriminant Analysis for Supervised Dimensionality Reduction Masashi Sugiyama Tokyo Institute of.
Nonlinear Dimension Reduction: Semi-Definite Embedding vs. Local Linear Embedding Li Zhang and Lin Liao.
Manifold Learning JAMES MCQUEEN – UW DEPARTMENT OF STATISTICS.
Spectral Methods for Dimensionality
Nonlinear Dimensionality Reduction
Unsupervised Riemannian Clustering of Probability Density Functions
Metric Learning for Clustering
Outline Nonlinear Dimension Reduction Brief introduction Isomap LLE
Learning with information of features
COSC 4335: Other Classification Techniques
Feature space tansformation methods
Generally Discriminant Analysis
NonLinear Dimensionality Reduction or Unfolding Manifolds
SVMs for Document Ranking
Presentation transcript:

Distance Metric Learning: A Comprehensive Survey Liu Yang Advisor: Rong Jin May 8th, 2006

Outline Introduction Supervised Global Distance Metric Learning Supervised Local Distance Metric Learning Unsupervised Distance Metric Learning Distance Metric Learning based on SVM Kernel Methods for Distance Metrics Learning Conclusions

Introduction Definition Distance Metric learning is to learn a distance metric for the input space of data from a given collection of pair of similar/dissimilar points that preserves the distance relation among the training data pairs. Importance Many machine learning algorithms, heavily rely on the distance metric for the input data patterns. e.g. kNN A learned metric can significantly improve the performance in classification, clustering and retrieval tasks: e.g. KNN classifier, spectral clustering, content-based image retrieval (CBIR).

Contributions of this Survey Review distance metric learning under different learning conditions supervised learning vs. unsupervised learning learning in a global sense vs. in a local sense distance matrix based on linear kernel vs. nonlinear kernel Discuss central techniques of distance metric learning K nearest neighbor dimension reduction semidefinite programming kernel learning large margin classification

Global Distance Metric Learning by Convex Programming Local Adaptive Distance Metric Learning Supervised Distance Metric Learning Relevant Component Analysis Local Neighborhood Components Analysis Linear embedding PCA, MDS Unsupervised Distance Metric Learning Nonlinear embedding LLE, ISOMAP, Laplacian Eigenmaps Large Margin Nearest Neighbor Based Distance Metric Learning Distance Metric Learning based on SVM Cast Kernel Margin Maximization into a SDP problem Kernel Alignment with SDP Kernel Methods for Distance Metrics Learning Learning with Idealized Kernel

Outline Introduction Supervised Global Distance Metric Learning Supervised Local Distance Metric Learning Unsupervised Distance Metric Learning Distance Metric Learning based on SVM Kernel Methods for Distance Metrics Learning

Supervised Global Distance Metric Learning (Xing et al. 2003) Goal : keep all the data points within the same classes close, while separating all the data points from different classes. Formulate as a constrained convex programming problem minimize the distance between the data pairs in S Subject to data pairs in D are well separated

Global Distance Metric Learning (Cont’d) A is positive semi-definite Ensure the negativity and the triangle inequality of the metric The number of parameters is quadratic in the number of features Difficult to scale to a large number of features Simplify the computation

Global Distance Metric Learning: Example I (b) Data scaled by the global metric Data Dist. of the original dataset Keep all the data points within the same classes close Separate all the data points from different classes

Global Distance Metric Learning: Example II Original data (b) rescaling by learned full A (c) Rescaling by learned diagonal A Diagonalize distance metric A can simplify computation, but could lead to disastrous results

Problems with Global Distance Metric Learning Data Dist. of the original dataset (b) Data scaled by the global metric Multimodal data distributions prevent global distance metrics from simultaneously satisfying constraints on within-class compactness and between-class separability.

Outline Introduction Supervised Global Distance Metric Learning Supervised Local Distance Metric Learning Unsupervised Distance Metric Learning Distance Metric Learning based on SVM Kernel Methods for Distance Metrics Learning Conclusions

Supervised Local Distance Metric Learning Local Adaptive Distance Metric Learning Local Feature Relevance Locally Adaptive Feature Relevance Analysis Local Linear Discriminative Analysis Neighborhood Components Analysis Relevant Component Analysis

Local Adaptive Distance Metric Learning K Nearest Neighbor Classifier

Local Adaptive Distance Metric Learning Assumption of KNN Pr(y|x) in the local NN is constant or smooth However, this is not necessarily true! Near class boundaries Irrelevant dimensions Modified local neighborhood by a distance metric Elongate the distance along the dimensions where the class labels change rapidly Squeeze the distance along the dimensions that are almost independent from the class labels

Local Feature Relevance [J. Friedman,1994] Assume least-squared estimate for predicting f(x) is Conditioned at , then the least-squared estimate of f(x) The improvement in prediction error with knowing Consider , a measure of relative influence of the ith input variable to the variation of f(x) at is given by

Locally Adaptive Feature Relevance Analysis [C. Domeniconi, 2002] Use a Chi-squared distance analysis to compute metric for producing a neighborhood, in which The posterior probabilities are approximately constant Highly adaptive to query locations Chi-squared distance between the true and estimated posterior at the test point Use the Chi-squared distance for feature relevance: ---- to tell to which extent the ith dimension can be relied on for predicting p(j| )

Local Relevance Measure in ith Dimension measures the distance between Pr(j|z) and the conditional expectation of Pr(j|x) at location z Calculate for each point z in the neighborhood of is a conditional expectation of p(j|x) The closer is to p(j|z), the more information the ith dimension provides for predicting p(j|z)

Locally Adaptive Feature Relevance Analysis A local relevance measure in dimension i Relative relevance Weighted distance is the neighborhood of

Local Linear Discriminative Analysis [T. Hastie et al. 1996] LDA finds principle eigenvectors of matrix to keep patterns from the same class close separate patterns from different classes apart LDA metric : stacking principle eigenvectors of T together

Discriminative Analysis Local Linear Discriminative Analysis Need local adaptation of the nearest neighbor metric Initialize as identical matrix Given a testing point , iterate below two steps: Estimate Sb and Sw based on the local neighbor of measured by Form a local metric behaving like LDA metric is a small tuning parameter to prevent neighborhoods extending to infinity

Local Linear Discriminative Analysis Local Sb shows the inconsistency of the class centriods The estimated metric shrinks the neighborhood in directions in which the local class centroids differ to produce a neighborhood in which the class centriod coincide shrinks neighborhoods in directions orthogonal to these local decision boundaries, and elongates them parallel to the boundaries.

Neighborhood Components Analysis [J. Goldberger et al. 2005] NCA learns a Mahalanobis distance metric for the KNN classifier by maximizing the leave-one-out cross validation. The probability of classifying correctly, weighted counting involving pairwise distance The expected number of correctly classification points: Overfitting, Scalability problem, # parameters is quadratic in #features.

RCA [N. Shen et al. 2002] Constructs a Mahalanobis distance metric based on a sum of in-chunklet covariance matrices Chunklet : data have same but unknown class labels Sum of in-chunklet covariance matrices for p points in k chunklets: Apply linear transformation unlabeled data chuklet data labeled data

Information maximization under chunklet constraints [A Information maximization under chunklet constraints [A. Bar-Hillel etal, 2003] Maximizes the mutual information I(X,Y) Constraints: within-chunklet compactness

RCA algorithm applied to synthetic Gaussian data (a) The fully labeled data set with 3 classes. (b) Same data unlabeled; classes' structure is less evident. (c) The set of chunklets (d) The centered chunklets, and their empirical covariance. (e) The RCA transformation applied to the chunklets. (centered) (f) The original data after applying the RCA transformation.

Outline Introduction Supervised Global Distance Metric Learning Supervised Local Distance Metric Learning Unsupervised Distance Metric Learning Distance Metric Learning based on SVM Kernel Methods for Distance Metrics Learning Conclusions

Unsupervised Distance Metric Learning Most dimension reduction approaches are to learn a distance metric without label information. e.g. PCA I will present five methods for dimensionality reduction. A Unified Framework for Dimension Reduction Solution 1 Solution 2 linear nonlinear Global PCA, MDS ISOMAP Local LLE, Laplacian Eigenmap

Dimensionality Reduction Algorithms PCA finds the subspace that best preserves the variance of the data. MDS finds the subspace that best preserves the interpoint distances. Isomap finds the subspace that best preserves the geodesic interpoint distances. [Tenenbaum et al, 2000]. LLE finds the subspace that best preserves the local linear structure of the data [Roweis and Saul, 2000]. Laplacian Eigenmap finds the subspace that best preserves local neighborhood information in the adjacency graph [M. Belkin and P. Niyogi,2003].

Multidimensional Scaling (MDS) MDS finds the rank m projection that best preserves the inter-point distance given by matrix D Converts distances to inner products Calculate X Rank m projections Y closet to X Given the distance matrix among cities, MDS produces this map:

PCA (Principal Component Analysis) PCA finds the subspace that best preserves the data variance. PCA projection of X with rank m PCA vs. MDS In the Euclidean case, MDS only differs from PCA by starting with D and calculating X.

Isometric Feature Mapping (ISOMAP) [Tenenbaum et al, 2000] Geodesic :the shortest curve on a manifold that connects two points on the manifold e.g. on a sphere, geodesics are great circles Geodesic distance: length of the geodesic Points far apart measured by geodesic dist. appear close measured by Euclidean dist. A B

ISOMAP Take a distance matrix as input Construct a weighted graph G based on neighborhood relations Estimate pairwise geodesic distance by “a sequence of short hops” on G Apply MDS to the geodesic distance matrix

Locally Linear Embedding (LLE) [Roweis and Saul, 2000] LLE finds the subspace that best preserves the local linear structure of the data Assumption: manifold is locally “linear” Each sample in the input space is a linearly weighted average of its neighbors. A good projection should best preserve this geometric locality property

LLE W: a linear representation of every data point by its neighbors Choose W by minimized the reconstruction error Calculate a neighborhood preserving mapping Y, by minimizing the reconstruction error Y is given by the eigenvectors of the m lowest nonzero eigenvalues of matrix

Laplacian Eigenmap [M. Belkin and P. Niyogi,2003] Laplacian Eigenmap finds the subspace that best preserves local neighborhood information in adjacency graph Graph Laplacian: Given a graph G with weight matrix W D is a diagonal matrix with L =D –W is the graph Laplacian Detailed steps: Construct adjacency graph G. Weight the edges: Generalized eigen-decomposition of Embedding : eigenvectors with top m nonzero eigenvalues

A Unified Framework for Dimension Reduction Algorithms All use an eigendecomposition to obtain a lower-dimensional embedding of data lying on a non-linear manifold. Normalize affinity matrix The embedding of has two alternative solutions Solution 1 : (MDS & Isomap) is the best approximation of in the squared error sense. Solution 2 : (LLE & Laplacian Eigenmap)

Outline Introduction Supervised Global Distance Metric Learning Supervised Local Distance Metric Learning Unsupervised Distance Metric Learning Distance Metric Learning based on SVM Kernel Methods for Distance Metrics Learning Conclusions

Distance Metric Learning based on SVM Large Margin Nearest Neighbor Based Distance Metric Learning Objective Function Reformulation as SDP Cast Kernel Margin Maximization into a SDP Problem Maximum Margin Cast into SDP problem Apply to Hard Margin and Soft Margin

Large Margin Nearest Neighbor Based Distance Metric Learning [K. Weinberger et al., 2006] Learns a Mahanalobis distance metric in the kNN classification setting by SDP, that Enforces the k-nearest neighbors belong to the same class examples from different classes are separated by a large margin After training k=3 target neighbors lie within a smaller radius differently labeled inputs lie outside this smaller radius with a margin of at least one unit distance.

Large Margin Nearest Neighbor Based Distance Metric Learning Cost function: Penalize large distances between each input and its target neighbors The hinge loss is incurred by differently labeled inputs whose distances do not exceed the distance from input to any of its target neighbors by one absolute unit of distance -> do not threaten to invade each other’s neighborhoods

Reformulation as SDP

Cast Kernel Margin Maximization into a SDP Problem [G. R. G Cast Kernel Margin Maximization into a SDP Problem [G. R. G. Lanckriet et al, 2004] Maximum margin : the decision boundary has the maximum minimum distance from the closest training point. Hard Margin: linearly separable Soft Margin: nonlinearly separable The performance measure, generalized from dual solution of different maximizing margin problem

Cast into SDP Problem Hard Margin 1-norm soft margin

Outline Introduction Supervised Global Distance Metric Learning Supervised Local Distance Metric Learning Unsupervised Distance Metric Learning Distance Metric Learning based on SVM Kernel Methods for Distance Metrics Learning Conclusions

Kernel Methods for Distance Metrics Learning Learning a good kernel is equivalent to distance metric learning Kernel Alignment Kernel Alignment with SDP Learning with Idealized Kernel Ideal Kernel The Idealized Kernel

Kernel Alignment [N. Cristianini,2001] A measure of similarity between two kernel functions or between a kernel and a target function The inner product between two kernel matrices based on kernel k1 and k2. The alignment of K1 and K2 w.r.t S: Measure the degree of agreement between a kernel and a given learning task.

Kernel Alignment with SDP [G. R. G. Lanckriet et al, 2004] Optimizing the alignment between a set of labels and a kernel matrix using SDP in a transductive setting. Optimizing an objective function over the training data block -> automatic tuning of testing data block Introduce A with , this reduces to

Learning with Idealized Kernel [J. T. Kwok and I.W. Tsang,2003] Idealize a given kernel by making it more similar to the ideal kernel matrix. Ideal kernel: Idealized kernel: The alignment of will be greater than k, if are the number of positive and negative samples. Under the original distance metric M:

Idealized kernel We modify Search for a matrix A under which different classes : pulled apart by an amount of at least same class :getting close together. Introduce slack variables for error tolerance

Conclusions A comprehensive review, covers: Supervised distance metric learning Unsupervised distance metric learning Maximum margin based distance metric learning approaches Kernel methods towards distance metrics Challenge: Unsupervised distance metric learning. Going local in a principle manner. Learn an explicit nonlinear distance metric in the local sense. Efficiency issue.