Dimensionality Reduction and Embeddings

Slides:

Advertisements

Similar presentations

Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.

Advertisements

Dimensionality reduction. Outline From distances to points : – MultiDimensional Scaling (MDS) Dimensionality Reductions or data projections Random projections.

Dimensionality Reduction PCA -- SVD

3D Shape Histograms for Similarity Search and Classification in Spatial Databases. Mihael Ankerst,Gabi Kastenmuller, Hans-Peter-Kriegel,Thomas Seidl Univ.

Multimedia DBs. Multimedia dbs A multimedia database stores text, strings and images Similarity queries (content based retrieval) Given an image find.

Dimensionality reduction. Outline From distances to points : – MultiDimensional Scaling (MDS) – FastMap Dimensionality Reductions or data projections.

DIMENSIONALITY REDUCTION BY RANDOM PROJECTION AND LATENT SEMANTIC INDEXING Jessica Lin and Dimitrios Gunopulos Ângelo Cardoso IST/UTL December

Dimensionality reduction. Outline From distances to points : – MultiDimensional Scaling (MDS) – FastMap Dimensionality Reductions or data projections.

Dimensionality Reduction and Embeddings

L15:Microarray analysis (Classification) The Biological Problem Two conditions that need to be differentiated, (Have different treatments). EX: ALL (Acute.

Dimensionality Reduction

Dimensionality Reduction

1 Learning Embeddings for Similarity-Based Retrieval Vassilis Athitsos Computer Science Department Boston University.

L15:Microarray analysis (Classification). The Biological Problem Two conditions that need to be differentiated, (Have different treatments). EX: ALL (Acute.

Lecture 4 Unsupervised Learning Clustering & Dimensionality Reduction

The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.

Unsupervised Learning

Aprendizagem baseada em instâncias (K vizinhos mais próximos)

KNN, LVQ, SOM. Instance Based Learning K-Nearest Neighbor Algorithm (LVQ) Learning Vector Quantization (SOM) Self Organizing Maps.

Nearest Neighbor Retrieval Using Distance-Based Hashing Michalis Potamias and Panagiotis Papapetrou supervised by Prof George Kollios A method is proposed.

Support Vector Machines

A Global Geometric Framework for Nonlinear Dimensionality Reduction Joshua B. Tenenbaum, Vin de Silva, John C. Langford Presented by Napat Triroj.

E.G.M. PetrakisDimensionality Reduction1  Given N vectors in n dims, find the k most important axes to project them  k is user defined (k < n)  Applications:

Dimensionality Reduction

Dimensionality Reduction. Multimedia DBs Many multimedia applications require efficient indexing in high-dimensions (time-series, images and videos, etc)

Footer Here1 Feature Selection Copyright, 1996 © Dale Carnegie & Associates, Inc. David Mount For CMSC 828K: Algorithms and Data Structures for Information.

Summarized by Soo-Jin Kim

ISOMAP TRACKING WITH PARTICLE FILTER Presented by Nikhil Rane.

GRASP Learning a Kernel Matrix for Nonlinear Dimensionality Reduction Kilian Q. Weinberger, Fei Sha and Lawrence K. Saul ICML’04 Department of Computer.

CSE 185 Introduction to Computer Vision Face Recognition.

Query Sensitive Embeddings Vassilis Athitsos, Marios Hadjieleftheriou, George Kollios, Stan Sclaroff.

Project 11: Determining the Intrinsic Dimensionality of a Distribution Okke Formsma, Nicolas Roussis and Per Løwenborg.

Principal Component Analysis Machine Learning. Last Time Expectation Maximization in Graphical Models – Baum Welch.

Fast Similarity Search in Image Databases CSE 6367 – Computer Vision Vassilis Athitsos University of Texas at Arlington.

Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara.

An Approximate Nearest Neighbor Retrieval Scheme for Computationally Intensive Distance Measures Pratyush Bhatt MS by Research(CVIT)

Chapter 13 (Prototype Methods and Nearest-Neighbors )

Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.

Lecture notes for Stat 231: Pattern Recognition and Machine Learning 1. Stat 231. A.L. Yuille. Fall Perceptron Rule and Convergence Proof Capacity.

Math 285 Project Diffusion Maps Xiaoyan Chong Department of Mathematics and Statistics San Jose State University.

1 Learning Embeddings for Similarity-Based Retrieval Vassilis Athitsos Computer Science Department Boston University.

Machine Learning Supervised Learning Classification and Regression K-Nearest Neighbor Classification Fisher’s Criteria & Linear Discriminant Analysis Perceptron:

Spectral Methods for Dimensionality

Deep Feedforward Networks

University of Ioannina

LECTURE 10: DISCRIMINANT ANALYSIS

Face Recognition and Feature Subspaces

Dipartimento di Ingegneria «Enzo Ferrari»,

Spectral Methods Tutorial 6 1 © Maks Ovsjanikov

Machine Learning Dimensionality Reduction

LSI, SVD and Data Management

Outline Peter N. Belhumeur, Joao P. Hespanha, and David J. Kriegman, “Eigenfaces vs. Fisherfaces: Recognition Using Class Specific Linear Projection,”

In summary C1={skin} C2={~skin} Given x=[R,G,B], is it skin or ~skin?

Singular Value Decomposition

Boosting Nearest-Neighbor Classifier for Character Recognition

K Nearest Neighbor Classification

Lecture 10: Sketching S3: Nearest Neighbor Search

Yair Bartal Lee-Ad Gottlieb Hebrew U. Ariel University

Locality Sensitive Hashing

Principal Component Analysis

COSC 4335: Other Classification Techniques

Feature space tansformation methods

Generally Discriminant Analysis

CS4670: Intro to Computer Vision

LECTURE 09: DISCRIMINANT ANALYSIS

Machine Learning: UNIT-4 CHAPTER-1

Dimension versus Distortion a.k.a. Euclidean Dimension Reduction

Nearest Neighbors CSC 576: Data Mining.

Lecture 15: Least Square Regression Metric Embeddings

Marios Mattheakis and Pavlos Protopapas

Presentation transcript:

Dimensionality Reduction and Embeddings

Term Project Progress Report, April 10,2017: Name, Problem definition, progress so far, plan to continue Final Report, May 3, 2017: Name, Title, Intro, Problem, Solution, Experimental Results, Conclusions

Embeddings Given a metric distance matrix D, embed the objects in a k-dimensional vector space using a mapping F such that D(i,j) is close to D’(F(i),F(j)) Isometric mapping: exact preservation of distance Contractive mapping: D’(F(i),F(j)) <= D(i,j) D’ is some Lp measure

PCA Intuition: find the axis that shows the greatest variation, and project all points into this axis f2 e1 e2 f1

SVD: The mathematical formulation Normalize the dataset by moving the origin to the center of the dataset Find the eigenvectors of the data (or covariance) matrix These define the new space Sort the eigenvalues in “goodness” order f2 e1 e2 f1

SVD Cont’d Advantages: Optimal dimensionality reduction (for linear projections) Disadvantages: Computationally expensive… but can be improved with random sampling Sensitive to outliers and non-linearities

Multi-Dimensional Scaling (MDS) Map the items in a k-dimensional space trying to minimize the stress Steepest Descent algorithm: Start with an assignment Minimize stress by moving points But the running time is O(N2) and O(N) to add a new item Another method: iterative majorization

FastMap What if we have a finite metric space (X, d )? Faloutsos and Lin (1995) proposed FastMap as metric analogue to the PCA. Imagine that the points are in a Euclidean space. Select two pivot points xa and xb that are far apart. Compute a pseudo-projection of the remaining points along the “line” xaxb . “Project” the points to an orthogonal subspace and recurse.

FastMap We want to find e1 first f2 e1 e2 f1

Selecting the Pivot Points The pivot points should lie along the principal axes, and hence should be far apart. Select any point x0. Let x1 be the furthest from x0. Let x2 be the furthest from x1. Return (x1, x2). x2 x0 x1

Pseudo-Projections xb Given pivots (xa , xb ), for any third point y, we use the law of cosines to determine the relation of y along xaxb . The pseudo-projection for y is This is first coordinate. db,y da,b y cy da,y xa

“Project to orthogonal plane” xb cz-cy Given distances along xaxb we can compute distances within the “orthogonal hyperplane” using the Pythagorean theorem. Using d ’(.,.), recurse until k features chosen. dy,z z y xa y’ z’ d’y’,z’

Compute the next coordinate Now, we have projected all objects into a subspace orthogonal to first dimension (line xa,xb) We can apply recursively FastMap on the new projected dataset: FastMap(k-1, d’, D)

Embedding using ML We can try to use some learning techniques to “learn” the best mapping. Works for general metric spaces, not only “Euclidean spaces” Vassilis Athitsos, Jonathan Alon, Stan Sclaroff, George Kollios: BoostMap: An Embedding Method for Efficient Nearest Neighbor Retrieval. IEEE Trans. Pattern Anal. Mach. Intell. 30(1): 89-104 (2008)

BoostMap Embedding database x1 x2 x3 xn

Embeddings database Rd x1 x2 x2 x1 embedding F x3 xn x4 x3 xn

Embeddings database Rd x1 x2 x2 x1 embedding F x3 xn x4 x3 query xn q

Embeddings database Rd x1 x2 x2 x1 x3 xn x4 q x3 query xn q embedding F x3 xn x4 q x3 query xn q

Embeddings database Rd x1 x2 x2 x1 x3 xn x4 q x3 query xn q embedding Measure distances between vectors (typically much faster). Embeddings database Rd x1 x2 x2 x1 embedding F x3 xn x4 q x3 query xn q

Embeddings database Rd x1 x2 x2 x1 x3 xn x4 q x3 query xn q embedding Measure distances between vectors (typically much faster). Caveat: the embedding must preserve similarity structure. Embeddings database Rd x1 x2 x2 x1 embedding F x3 xn x4 q x3 query xn q

Reference Object Embeddings original space X

Reference Object Embeddings original space X r: reference object

Reference Object Embeddings original space X r: reference object Embedding: F(x) = D(x,r) D: distance measure in X.

Reference Object Embeddings original space X F Real line r: reference object Embedding: F(x) = D(x,r) D: distance measure in X.

Reference Object Embeddings F(r) = D(r,r) = 0 r original space X F Real line r: reference object Embedding: F(x) = D(x,r) D: distance measure in X.

Reference Object Embeddings F(r) = D(r,r) = 0 If a and b are similar, their distances to r are also similar (usually). r b a original space X F Real line r: reference object Embedding: F(x) = D(x,r) D: distance measure in X.

Reference Object Embeddings F(r) = D(r,r) = 0 If a and b are similar, their distances to r are also similar (usually). r b a original space X F Real line r: reference object Embedding: F(x) = D(x,r) D: distance measure in X.

F(x) = D(x, Lincoln) F(Sacramento)....= 1543 F(Las Vegas).....= 1232 Oklahoma city – DC: 1320 true distance, 1595 embedded distance (L2) Oklahoma city – Jacksonville, 1200 true distance, 1827 embedded distance). Oklahoma city – las vegas, 1115 true distance, 1730 embedded distance) jacksonville – DC: 711 miles F(Sacramento)....= 1543 F(Las Vegas).....= 1232 F(Oklahoma City).= 437 F(Washington DC).= 1207 F(Jacksonville)..= 1344

F(x) = (D(x, LA), D(x, Lincoln), D(x, Orlando)) Oklahoma city – DC: 1320 true distance, 1595 embedded distance (L2) Oklahoma city – Jacksonville, 1200 true distance, 1827 embedded distance). Oklahoma city – las vegas, 1115 true distance, 1730 embedded distance) jacksonville – DC: 711 miles F(Sacramento)....= ( 386, 1543, 2920) F(Las Vegas).....= ( 262, 1232, 2405) F(Oklahoma City).= (1345, 437, 1291) F(Washington DC).= (2657, 1207, 853) F(Jacksonville)..= (2422, 1344, 141)

F(x) = (D(x, LA), D(x, Lincoln), D(x, Orlando)) Oklahoma city – DC: 1320 true distance, 1595 embedded distance (L2) Oklahoma city – Jacksonville, 1200 true distance, 1827 embedded distance). Oklahoma city – las vegas, 1115 true distance, 1730 embedded distance) jacksonville – DC: 711 miles F(Sacramento)....= ( 386, 1543, 2920) F(Las Vegas).....= ( 262, 1232, 2405) F(Oklahoma City).= (1345, 437, 1291) F(Washington DC).= (2657, 1207, 853) F(Jacksonville)..= (2422, 1344, 141)

Basic Questions What is a good way to optimize an embedding?

Basic Questions F(x) = (D(x, LA), D(x, Denver), D(x, Boston)) What is a good way to optimize an embedding? What are the best reference objects? What distance should we use in R3?

Key Idea Embeddings can be seen as classifiers. Embedding construction can be seen as a machine learning problem. Formulation is natural. We optimize exactly what we want to optimize.

Ideal Embedding Behavior original space X F Rd a q Notation: NN(q) is the nearest neighbor of q. For any q: if a = NN(q), we want F(a) = NN(F(q)).

A Quantitative Measure b a q If b is not the nearest neighbor of q, F(q) should be closer to F(NN(q)) than to F(b). For how many triples (q, NN(q), b) does F fail?

A Quantitative Measure F fails on five triples.

Embeddings Seen As Classifiers q a b Classification task: is q closer to a or to b?

Embeddings Seen As Classifiers q a b Classification task: is q closer to a or to b? Any embedding F defines a classifier F’(q, a, b). F’ checks if F(q) is closer to F(a) or to F(b).

Classifier Definition q a b Classification task: is q closer to a or to b? Given embedding F: X  Rd: F’(q, a, b) = ||F(q) – F(b)|| - ||F(q) – F(a)||. F’(q, a, b) > 0 means “q is closer to a.” F’(q, a, b) < 0 means “q is closer to b.”

Classifier Definition Goal: build an F such that F’ has low error rate on triples of type (q, NN(q), b). Given embedding F: X  Rd: F’(q, a, b) = ||F(q) – F(b)|| - ||F(q) – F(a)||. F’(q, a, b) > 0 means “q is closer to a.” F’(q, a, b) < 0 means “q is closer to b.”

1D Embeddings as Weak Classifiers 1D embeddings define weak classifiers. Better than a random classifier (50% error rate). We can define lots of different classifiers. Every object in the database can be a reference object. Question: how do we combine many such classifiers into a single strong classifier?

1D Embeddings as Weak Classifiers 1D embeddings define weak classifiers. Better than a random classifier (50% error rate). We can define lots of different classifiers. Every object in the database can be a reference object. Question: how do we combine many such classifiers into a single strong classifier? Answer: use AdaBoost. AdaBoost is a machine learning method designed for exactly this problem.

Using AdaBoost original space X Real line F1 F2 Fn Output: H = w1F’1 + w2F’2 + … + wdF’d . AdaBoost chooses 1D embeddings and weighs them. Goal: achieve low classification error. AdaBoost trains on triples chosen from the database.

From Classifier to Embedding H = w1F’1 + w2F’2 + … + wdF’d AdaBoost output What embedding should we use? What distance measure should we use?

From Classifier to Embedding H = w1F’1 + w2F’2 + … + wdF’d AdaBoost output BoostMap embedding F(x) = (F1(x), …, Fd(x)).

From Classifier to Embedding H = w1F’1 + w2F’2 + … + wdF’d AdaBoost output BoostMap embedding F(x) = (F1(x), …, Fd(x)). D((u1, …, ud), (v1, …, vd)) = i=1 wi|ui – vi| d Distance measure

From Classifier to Embedding H = w1F’1 + w2F’2 + … + wdF’d AdaBoost output BoostMap embedding F(x) = (F1(x), …, Fd(x)). D((u1, …, ud), (v1, …, vd)) = i=1 wi|ui – vi| d Distance measure Claim: Let q be closer to a than to b. H misclassifies triple (q, a, b) if and only if, under distance measure D, F maps q closer to b than to a.

i=1 i=1 i=1 Proof d d d H(q, a, b) = = wiF’i(q, a, b) = wi(|Fi(q) - Fi(b)| - |Fi(q) - Fi(a)|) = (wi|Fi(q) - Fi(b)| - wi|Fi(q) - Fi(a)|) = D(F(q), F(b)) – D(F(q), F(a)) = F’(q, a, b) i=1 d i=1 d i=1 d

Significance of Proof AdaBoost optimizes a direct measure of embedding quality. We have converted a database indexing problem into a machine learning problem.

Recap of BoostMap Algorithm Start with a large collection of 1D embeddings. Each embedding defines a weak classifier on triples of objects. AdaBoost combines many weak classifiers into a strong classifier. The strong classifier defines an embedding and a weighted L1 distance. Classifier equivalent to embedding and distance.

Basic Questions F(x) = (D(x, LA), D(x, Denver), D(x, Boston)) What is a good way to optimize an embedding? What are the best reference objects? What distance should we use in R3?

How Do We Use It? Filter-and-refine retrieval: Offline step: compute embedding F of entire database. Given a query object q: Embedding step: Compute distances from query to reference objects  F(q). Filter step: Find top p matches of F(q) in vector space. Refine step: Measure exact distance from q to top p matches.

Evaluating Embedding Quality How often do we find the true nearest neighbor? Embedding step: Compute distances from query to reference objects  F(q). Filter step: Find top p matches of F(q) in vector space. Refine step: Measure exact distance from q to top p matches.

Random Projections Based on the Johnson-Lindenstrauss lemma: For: any (sufficiently large) set S of M points in Rn k = O(e-2lnM) There exists a linear map f:S Rk, such that (1- e) D(S,T) < D(f(S),f(T)) < (1+ e)D(S,T) for S,T in S Random projection is good with constant probability

Random Projection: Application Set k = O(e-2lnM) Select k random n-dimensional vectors (an approach is to select k gaussian distributed vectors with variance 1 and mean value 0: N(0,1) ) Project the original points into the k vectors. The resulting k-dimensional space approximately preserves the distances with high probability

Database Friendly Random Projection For each point (vector) x in d-dimensions need to find the projection to point y in k-dimensions For n points, using the naive approach, I need to perform ndk operations. this can be large for large datasets and dimensionalities. A better approach is the following [Achlioptas 2003]: Create a matrix A such that: 1 with prob 1/6 A[i,j] = 0 with prob 2/3 -1 with prob 1/6 Then, we can compute each y as: y = x A Why this is better?

Random Projection A very useful technique, Especially when used in conjunction with another technique (for example SVD) Use Random projection to reduce the dimensionality from thousands to hundred, then apply SVD to reduce dimensionality farther References: [Achlioptas 2003] Dimitris Achlioptas: Database-friendly random projections: Johnson-Lindenstrauss with binary coins. J. Comput. Syst. Sci. 66(4): 671-687 (2003)