Dimensionality Reduction and Embeddings

Dimensionality Reduction and Embeddings

Term Project Progress Report, April 10,2017:
Name, Problem definition, progress so far, plan to continue Final Report, May 3, 2017: Name, Title, Intro, Problem, Solution, Experimental Results, Conclusions

Embeddings Given a metric distance matrix D, embed the objects in a k-dimensional vector space using a mapping F such that D(i,j) is close to D’(F(i),F(j)) Isometric mapping: exact preservation of distance Contractive mapping: D’(F(i),F(j)) <= D(i,j) D’ is some Lp measure

PCA Intuition: find the axis that shows the greatest variation, and project all points into this axis f2 e1 e2 f1

SVD: The mathematical formulation
Normalize the dataset by moving the origin to the center of the dataset Find the eigenvectors of the data (or covariance) matrix These define the new space Sort the eigenvalues in “goodness” order f2 e1 e2 f1

SVD Cont’d Advantages:
Optimal dimensionality reduction (for linear projections) Disadvantages: Computationally expensive… but can be improved with random sampling Sensitive to outliers and non-linearities

Multi-Dimensional Scaling (MDS)
Map the items in a k-dimensional space trying to minimize the stress Steepest Descent algorithm: Start with an assignment Minimize stress by moving points But the running time is O(N2) and O(N) to add a new item Another method: iterative majorization

FastMap What if we have a finite metric space (X, d )?
Faloutsos and Lin (1995) proposed FastMap as metric analogue to the PCA. Imagine that the points are in a Euclidean space. Select two pivot points xa and xb that are far apart. Compute a pseudo-projection of the remaining points along the “line” xaxb . “Project” the points to an orthogonal subspace and recurse.

FastMap We want to find e1 first f2 e1 e2 f1

Selecting the Pivot Points
The pivot points should lie along the principal axes, and hence should be far apart. Select any point x0. Let x1 be the furthest from x0. Let x2 be the furthest from x1. Return (x1, x2). x2 x0 x1

Pseudo-Projections xb Given pivots (xa , xb ), for any third point y, we use the law of cosines to determine the relation of y along xaxb . The pseudo-projection for y is This is first coordinate. db,y da,b y cy da,y xa

“Project to orthogonal plane”
xb cz-cy Given distances along xaxb we can compute distances within the “orthogonal hyperplane” using the Pythagorean theorem. Using d ’(.,.), recurse until k features chosen. dy,z z y xa y’ z’ d’y’,z’

Compute the next coordinate
Now, we have projected all objects into a subspace orthogonal to first dimension (line xa,xb) We can apply recursively FastMap on the new projected dataset: FastMap(k-1, d’, D)

Embedding using ML We can try to use some learning techniques to “learn” the best mapping. Works for general metric spaces, not only “Euclidean spaces” Vassilis Athitsos, Jonathan Alon, Stan Sclaroff, George Kollios: BoostMap: An Embedding Method for Efficient Nearest Neighbor Retrieval. IEEE Trans. Pattern Anal. Mach. Intell. 30(1): (2008)

BoostMap Embedding database x1 x2 x3 xn

Embeddings database Rd x1 x2 x2 x1 embedding F x3 xn x4 x3 xn

Embeddings database Rd x1 x2 x2 x1 embedding F x3 xn x4 x3 query xn q

Embeddings database Rd x1 x2 x2 x1 x3 xn x4 q x3 query xn q embedding
F x3 xn x4 q x3 query xn q

Measure distances between vectors (typically much faster). Embeddings database Rd x1 x2 x2 x1 embedding F x3 xn x4 q x3 query xn q

Measure distances between vectors (typically much faster). Caveat: the embedding must preserve similarity structure. Embeddings database Rd x1 x2 x2 x1 embedding F x3 xn x4 q x3 query xn q

Reference Object Embeddings
original space X

original space X r: reference object

original space X r: reference object Embedding: F(x) = D(x,r) D: distance measure in X.

original space X F Real line r: reference object Embedding: F(x) = D(x,r) D: distance measure in X.

F(r) = D(r,r) = 0 r original space X F Real line r: reference object Embedding: F(x) = D(x,r) D: distance measure in X.

F(r) = D(r,r) = 0 If a and b are similar, their distances to r are also similar (usually). r b a original space X F Real line r: reference object Embedding: F(x) = D(x,r) D: distance measure in X.

F(x) = D(x, Lincoln) F(Sacramento)....= 1543 F(Las Vegas).....= 1232
Oklahoma city – DC: 1320 true distance, 1595 embedded distance (L2) Oklahoma city – Jacksonville, 1200 true distance, 1827 embedded distance). Oklahoma city – las vegas, 1115 true distance, 1730 embedded distance) jacksonville – DC: 711 miles F(Sacramento)....= 1543 F(Las Vegas).....= 1232 F(Oklahoma City).= 437 F(Washington DC).= 1207 F(Jacksonville)..= 1344

F(x) = (D(x, LA), D(x, Lincoln), D(x, Orlando))
Oklahoma city – DC: 1320 true distance, 1595 embedded distance (L2) Oklahoma city – Jacksonville, 1200 true distance, 1827 embedded distance). Oklahoma city – las vegas, 1115 true distance, 1730 embedded distance) jacksonville – DC: 711 miles F(Sacramento)....= ( 386, 1543, 2920) F(Las Vegas).....= ( 262, 1232, 2405) F(Oklahoma City).= (1345, 437, 1291) F(Washington DC).= (2657, 1207, 853) F(Jacksonville)..= (2422, 1344, 141)

Basic Questions What is a good way to optimize an embedding?

Basic Questions F(x) = (D(x, LA), D(x, Denver), D(x, Boston))
What is a good way to optimize an embedding? What are the best reference objects? What distance should we use in R3?

Key Idea Embeddings can be seen as classifiers.
Embedding construction can be seen as a machine learning problem. Formulation is natural. We optimize exactly what we want to optimize.

Ideal Embedding Behavior
original space X F Rd a q Notation: NN(q) is the nearest neighbor of q. For any q: if a = NN(q), we want F(a) = NN(F(q)).

A Quantitative Measure
b a q If b is not the nearest neighbor of q, F(q) should be closer to F(NN(q)) than to F(b). For how many triples (q, NN(q), b) does F fail?

A Quantitative Measure
F fails on five triples.

Embeddings Seen As Classifiers
q a b Classification task: is q closer to a or to b?

Embeddings Seen As Classifiers
q a b Classification task: is q closer to a or to b? Any embedding F defines a classifier F’(q, a, b). F’ checks if F(q) is closer to F(a) or to F(b).

Classifier Definition
q a b Classification task: is q closer to a or to b? Given embedding F: X  Rd: F’(q, a, b) = ||F(q) – F(b)|| - ||F(q) – F(a)||. F’(q, a, b) > 0 means “q is closer to a.” F’(q, a, b) < 0 means “q is closer to b.”

Classifier Definition
Goal: build an F such that F’ has low error rate on triples of type (q, NN(q), b). Given embedding F: X  Rd: F’(q, a, b) = ||F(q) – F(b)|| - ||F(q) – F(a)||. F’(q, a, b) > 0 means “q is closer to a.” F’(q, a, b) < 0 means “q is closer to b.”

1D Embeddings as Weak Classifiers
1D embeddings define weak classifiers. Better than a random classifier (50% error rate). We can define lots of different classifiers. Every object in the database can be a reference object. Question: how do we combine many such classifiers into a single strong classifier?

1D Embeddings as Weak Classifiers
1D embeddings define weak classifiers. Better than a random classifier (50% error rate). We can define lots of different classifiers. Every object in the database can be a reference object. Question: how do we combine many such classifiers into a single strong classifier? Answer: use AdaBoost. AdaBoost is a machine learning method designed for exactly this problem.

Using AdaBoost original space X Real line F1 F2 Fn
Output: H = w1F’1 + w2F’2 + … + wdF’d . AdaBoost chooses 1D embeddings and weighs them. Goal: achieve low classification error. AdaBoost trains on triples chosen from the database.

From Classifier to Embedding
H = w1F’1 + w2F’2 + … + wdF’d AdaBoost output What embedding should we use? What distance measure should we use?

H = w1F’1 + w2F’2 + … + wdF’d AdaBoost output BoostMap embedding F(x) = (F1(x), …, Fd(x)).

H = w1F’1 + w2F’2 + … + wdF’d AdaBoost output BoostMap embedding F(x) = (F1(x), …, Fd(x)). D((u1, …, ud), (v1, …, vd)) = i=1 wi|ui – vi| d Distance measure

H = w1F’1 + w2F’2 + … + wdF’d AdaBoost output BoostMap embedding F(x) = (F1(x), …, Fd(x)). D((u1, …, ud), (v1, …, vd)) = i=1 wi|ui – vi| d Distance measure Claim: Let q be closer to a than to b. H misclassifies triple (q, a, b) if and only if, under distance measure D, F maps q closer to b than to a.

Significance of Proof AdaBoost optimizes a direct measure of embedding quality. We have converted a database indexing problem into a machine learning problem.

Recap of BoostMap Algorithm
Start with a large collection of 1D embeddings. Each embedding defines a weak classifier on triples of objects. AdaBoost combines many weak classifiers into a strong classifier. The strong classifier defines an embedding and a weighted L1 distance. Classifier equivalent to embedding and distance.

Basic Questions F(x) = (D(x, LA), D(x, Denver), D(x, Boston))
What is a good way to optimize an embedding? What are the best reference objects? What distance should we use in R3?

How Do We Use It? Filter-and-refine retrieval:
Offline step: compute embedding F of entire database. Given a query object q: Embedding step: Compute distances from query to reference objects  F(q). Filter step: Find top p matches of F(q) in vector space. Refine step: Measure exact distance from q to top p matches.

Evaluating Embedding Quality
How often do we find the true nearest neighbor? Embedding step: Compute distances from query to reference objects  F(q). Filter step: Find top p matches of F(q) in vector space. Refine step: Measure exact distance from q to top p matches.

Random Projections Based on the Johnson-Lindenstrauss lemma: For:
any (sufficiently large) set S of M points in Rn k = O(e-2lnM) There exists a linear map f:S Rk, such that (1- e) D(S,T) < D(f(S),f(T)) < (1+ e)D(S,T) for S,T in S Random projection is good with constant probability

Random Projection: Application
Set k = O(e-2lnM) Select k random n-dimensional vectors (an approach is to select k gaussian distributed vectors with variance 1 and mean value 0: N(0,1) ) Project the original points into the k vectors. The resulting k-dimensional space approximately preserves the distances with high probability

Database Friendly Random Projection
For each point (vector) x in d-dimensions need to find the projection to point y in k-dimensions For n points, using the naive approach, I need to perform ndk operations. this can be large for large datasets and dimensionalities. A better approach is the following [Achlioptas 2003]: Create a matrix A such that: 1 with prob 1/6 A[i,j] = with prob 2/3 -1 with prob 1/6 Then, we can compute each y as: y = x A Why this is better?

Random Projection A very useful technique,
Especially when used in conjunction with another technique (for example SVD) Use Random projection to reduce the dimensionality from thousands to hundred, then apply SVD to reduce dimensionality farther References: [Achlioptas 2003] Dimitris Achlioptas: Database-friendly random projections: Johnson-Lindenstrauss with binary coins. J. Comput. Syst. Sci. 66(4): (2003)

Dimensionality Reduction and Embeddings

Similar presentations

Presentation on theme: "Dimensionality Reduction and Embeddings"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Dimensionality Reduction and Embeddings

Similar presentations

Presentation on theme: "Dimensionality Reduction and Embeddings"— Presentation transcript:

Similar presentations

About project

Feedback