Download presentation
Presentation is loading. Please wait.
Published bySheryl Gaines Modified over 9 years ago
1
1 Learning Embeddings for Similarity-Based Retrieval Vassilis Athitsos Computer Science Department Boston University
2
2 Overview Background on similarity-based retrieval and embeddings. BoostMap. Embedding optimization using machine learning. Query-sensitive embeddings. Ability to preserve non-metric structure.
3
3 database (n objects) x1x1 x2x2 x3x3 xnxn Problem Definition
4
4 database (n objects) x1x1 x2x2 x3x3 xnxn q Problem Definition Goals: find the k nearest neighbors of query q.
5
5 Goals: find the k nearest neighbors of query q. Brute force time is linear to: n (size of database). time it takes to measure a single distance. database (n objects) x1x1 x2x2 x3x3 xnxn q Problem Definition x2x2 xnxn
6
6 Goals: find the k nearest neighbors of query q. Brute force time is linear to: n (size of database). time it takes to measure a single distance. database (n objects) x1x1 x3x3 q Problem Definition x2x2 xnxn
7
7 Applications Nearest neighbor classification. Similarity-based retrieval. Image/video databases. Biological databases. Time series. Web pages. Browsing music or movie catalogs. faces letters/digits handshapes
8
8 Expensive Distance Measures Comparing d- dimensional vectors is efficient: O(d) time. x1x1 x2x2 x3x3 x4x4 … xdxd y1y1 y2y2 y3y3 y4y4 … ydyd
9
9 Expensive Distance Measures Comparing d- dimensional vectors is efficient: O(d) time. Comparing strings of length d with the edit distance is more expensive: O(d 2 ) time. Reason: alignment. x1x1 x2x2 x3x3 x4x4 … xdxd y1y1 y2y2 y3y3 y4y4 … ydyd i m m i g r a t i o n i m i t a t i o n
10
10 Expensive Distance Measures Comparing d- dimensional vectors is efficient: O(d) time. Comparing strings of length d with the edit distance is more expensive: O(d 2 ) time. Reason: alignment. x1x1 x2x2 x3x3 x4x4 … xdxd y1y1 y2y2 y3y3 y4y4 … ydyd i m m i g r a t i o n i m i t a t i o n
11
11 Matching Handwritten Digits
12
12 Matching Handwritten Digits
13
13 Matching Handwritten Digits
14
14 Shape Context Distance Proposed by Belongie et al. (2001). Error rate: 0.63%, with database of 20,000 images. Uses bipartite matching (cubic complexity!). 22 minutes/object, heavily optimized. Result preview: 5.2 seconds, 0.61% error rate.
15
15 More Examples DNA and protein sequences: Smith-Waterman. Time series: Dynamic Time Warping. Probability distributions: Kullback-Leibler Distance. These measures are non-Euclidean, sometimes non-metric.
16
16 Indexing Problem Vector indexing methods NOT applicable. PCA. R-trees, X-trees, SS-trees. VA-files. Locality Sensitive Hashing.
17
17 Metric Methods Pruning-based methods. VP-trees, MVP-trees, M-trees, Slim-trees,… Use triangle inequality for tree-based search. Filtering methods. AESA, LAESA… Use the triangle inequality to compute upper/lower bounds of distances. Suffer from curse of dimensionality. Heuristic in non-metric spaces. In many datasets, bad empirical performance.
18
18 Embeddings database x1x1 x2x2 x3x3 xnxn embedding F x1x1 x2x2 x3x3 x4x4 xnxn RdRd
19
19 Embeddings database x1x1 x2x2 x3x3 xnxn embedding F x1x1 x2x2 x3x3 x4x4 xnxn q query RdRd
20
20 Embeddings database x1x1 x2x2 x3x3 xnxn embedding F x1x1 x2x2 x3x3 x4x4 xnxn q query q RdRd
21
21 Embeddings database x1x1 x2x2 x3x3 xnxn embedding F x1x1 x2x2 x3x3 x4x4 xnxn RdRd q query q Measure distances between vectors (typically much faster).
22
22 Embeddings database x1x1 x2x2 x3x3 xnxn embedding F x1x1 x2x2 x3x3 x4x4 xnxn RdRd q query q Measure distances between vectors (typically much faster). Caveat: the embedding must preserve similarity structure.
23
23 Reference Object Embeddings database
24
24 Reference Object Embeddings database r1r1 r2r2 r3r3
25
25 Reference Object Embeddings database r1r1 r2r2 r3r3 x F(x) = (D(x, r 1 ), D(x, r 2 ), D(x, r 3 ))
26
26 F(x) = (D(x, LA), D(x, Lincoln), D(x, Orlando)) F(Sacramento)....= ( 386, 1543, 2920) F(Las Vegas).....= ( 262, 1232, 2405) F(Oklahoma City).= (1345, 437, 1291) F(Washington DC).= (2657, 1207, 853) F(Jacksonville)..= (2422, 1344, 141)
27
27 Existing Embedding Methods FastMap, MetricMap, SparseMap, Lipschitz embeddings. Use distances to reference objects (prototypes). Question: how do we directly optimize an embedding for nearest neighbor retrieval? FastMap & MetricMap assume Euclidean properties. SparseMap optimizes stress. Large stress may be inevitable when embedding non-metric spaces into a metric space. In practice often worse than random construction.
28
28 BoostMap BoostMap: A Method for Efficient Approximate Similarity Rankings. Athitsos, Alon, Sclaroff, and Kollios, CVPR 2004. BoostMap: An Embedding Method for Efficient Nearest Neighbor Retrieval. Athitsos, Alon, Sclaroff, and Kollios, PAMI 2007 (to appear).
29
29 Key Features of BoostMap Maximizes amount of nearest neighbor structure preserved by the embedding. Based on machine learning, not on geometric assumptions. Principled optimization, even in non-metric spaces. Can capture non-metric structure. Query-sensitive version of BoostMap. Better results in practice, in all datasets we have tried.
30
30 Ideal Embedding Behavior original space X FRdRd a q For any query q: we want F(NN(q)) = NN(F(q)).
31
31 Ideal Embedding Behavior original space X FRdRd a q For any query q: we want F(NN(q)) = NN(F(q)).
32
32 Ideal Embedding Behavior original space X FRdRd For any query q: we want F(NN(q)) = NN(F(q)). a q
33
33 Ideal Embedding Behavior original space X FRdRd a q For any query q: we want F(NN(q)) = NN(F(q)). For any database object b besides NN(q), we want F(q) closer to F(NN(q)) than to F(b). b
34
34 Embeddings Seen As Classifiers q a b For triples (q, a, b) such that: - q is a query object - a = NN(q) - b is a database object Classification task: is q closer to a or to b?
35
35 Any embedding F defines a classifier F’(q, a, b). F’ checks if F(q) is closer to F(a) or to F(b). q a b Embeddings Seen As Classifiers For triples (q, a, b) such that: - q is a query object - a = NN(q) - b is a database object Classification task: is q closer to a or to b?
36
36 Given embedding F: X R d : F’(q, a, b) = ||F(q) – F(b)|| - ||F(q) – F(a)||. F’(q, a, b) > 0 means “q is closer to a.” F’(q, a, b) < 0 means “q is closer to b.” q a b Classifier Definition For triples (q, a, b) such that: - q is a query object - a = NN(q) - b is a database object Classification task: is q closer to a or to b?
37
37 Key Observation original space X FRdRd a q b If classifier F’ is perfect, then for every q, F(NN(q)) = NN(F(q)). If F(q) is closer to F(b) than to F(NN(q)), then triple (q, a, b) is misclassified.
38
38 Key Observation original space X FRdRd a q b Classification error on triples (q, NN(q), b) measures how well F preserves nearest neighbor structure.
39
39 Goal: construct an embedding F optimized for k-nearest neighbor retrieval. Method: maximize accuracy of F’ on triples (q, a, b) of the following type: q is any object. a is a k-nearest neighbor of q in the database. b is in database, but NOT a k-nearest neighbor of q. If F’ is perfect on those triples, then F perfectly preserves k-nearest neighbors. Optimization Criterion
40
40 1D Embeddings as Weak Classifiers 1D embeddings define weak classifiers. Better than a random classifier (50% error rate).
41
41 Lincoln Chicago Detroit New York LA Cleveland DetroitNew York Chicago LA
42
42 1D Embeddings as Weak Classifiers 1D embeddings define weak classifiers. Better than a random classifier (50% error rate). We can define lots of different classifiers. Every object in the database can be a reference object.
43
43 1D Embeddings as Weak Classifiers 1D embeddings define weak classifiers. Better than a random classifier (50% error rate). We can define lots of different classifiers. Every object in the database can be a reference object. Question: how do we combine many such classifiers into a single strong classifier?
44
44 1D Embeddings as Weak Classifiers 1D embeddings define weak classifiers. Better than a random classifier (50% error rate). We can define lots of different classifiers. Every object in the database can be a reference object. Question: how do we combine many such classifiers into a single strong classifier? Answer: use AdaBoost. AdaBoost is a machine learning method designed for exactly this problem.
45
45 Using AdaBoost original space X FnFn F2F2 F1F1 Real line Output: H = w 1 F’ 1 + w 2 F’ 2 + … + w d F’ d. AdaBoost chooses 1D embeddings and weighs them. Goal: achieve low classification error. AdaBoost trains on triples chosen from the database.
46
46 From Classifier to Embedding AdaBoost output H = w 1 F’ 1 + w 2 F’ 2 + … + w d F’ d What embedding should we use? What distance measure should we use?
47
47 From Classifier to Embedding AdaBoost output H = w 1 F’ 1 + w 2 F’ 2 + … + w d F’ d F(x) = (F 1 (x), …, F d (x)). BoostMap embedding
48
48 From Classifier to Embedding AdaBoost output H = w 1 F’ 1 + w 2 F’ 2 + … + w d F’ d D((u 1, …, u d ), (v 1, …, v d )) = i=1 w i |u i – v i | d F(x) = (F 1 (x), …, F d (x)). BoostMap embedding Distance measure
49
49 From Classifier to Embedding AdaBoost output H = w 1 F’ 1 + w 2 F’ 2 + … + w d F’ d D((u 1, …, u d ), (v 1, …, v d )) = i=1 w i |u i – v i | d F(x) = (F 1 (x), …, F d (x)). BoostMap embedding Distance measure Claim: Let q be closer to a than to b. H misclassifies triple (q, a, b) if and only if, under distance measure D, F maps q closer to b than to a.
50
50 Proof H(q, a, b) = = w i F’ i (q, a, b) = w i (|F i (q) - F i (b)| - |F i (q) - F i (a)|) = (w i |F i (q) - F i (b)| - w i |F i (q) - F i (a)|) = D(F(q), F(b)) – D(F(q), F(a)) = F’(q, a, b) i=1 d d d
51
51 Proof H(q, a, b) = = w i F’ i (q, a, b) = w i (|F i (q) - F i (b)| - |F i (q) - F i (a)|) = (w i |F i (q) - F i (b)| - w i |F i (q) - F i (a)|) = D(F(q), F(b)) – D(F(q), F(a)) = F’(q, a, b) i=1 d d d
52
52 Proof H(q, a, b) = = w i F’ i (q, a, b) = w i (|F i (q) - F i (b)| - |F i (q) - F i (a)|) = (w i |F i (q) - F i (b)| - w i |F i (q) - F i (a)|) = D(F(q), F(b)) – D(F(q), F(a)) = F’(q, a, b) i=1 d d d
53
53 Proof H(q, a, b) = = w i F’ i (q, a, b) = w i (|F i (q) - F i (b)| - |F i (q) - F i (a)|) = (w i |F i (q) - F i (b)| - w i |F i (q) - F i (a)|) = D(F(q), F(b)) – D(F(q), F(a)) = F’(q, a, b) i=1 d d d
54
54 Proof H(q, a, b) = = w i F’ i (q, a, b) = w i (|F i (q) - F i (b)| - |F i (q) - F i (a)|) = (w i |F i (q) - F i (b)| - w i |F i (q) - F i (a)|) = D(F(q), F(b)) – D(F(q), F(a)) = F’(q, a, b) i=1 d d d
55
55 Proof H(q, a, b) = = w i F’ i (q, a, b) = w i (|F i (q) - F i (b)| - |F i (q) - F i (a)|) = (w i |F i (q) - F i (b)| - w i |F i (q) - F i (a)|) = D(F(q), F(b)) – D(F(q), F(a)) = F’(q, a, b) i=1 d d d
56
56 Significance of Proof AdaBoost optimizes a direct measure of embedding quality. We optimize an indexing structure for similarity-based retrieval using machine learning. Take advantage of training data.
57
57 How Do We Use It? Filter-and-refine retrieval: Offline step: compute embedding F of entire database.
58
58 How Do We Use It? Filter-and-refine retrieval: Offline step: compute embedding F of entire database. Given a query object q: Embedding step: Compute distances from query to reference objects F(q).
59
59 How Do We Use It? Filter-and-refine retrieval: Offline step: compute embedding F of entire database. Given a query object q: Embedding step: Compute distances from query to reference objects F(q). Filter step: Find top p matches of F(q) in vector space.
60
60 How Do We Use It? Filter-and-refine retrieval: Offline step: compute embedding F of entire database. Given a query object q: Embedding step: Compute distances from query to reference objects F(q). Filter step: Find top p matches of F(q) in vector space. Refine step: Measure exact distance from q to top p matches.
61
61 Evaluating Embedding Quality Embedding step: Compute distances from query to reference objects F(q). Filter step: Find top p matches of F(q) in vector space. Refine step: Measure exact distance from q to top p matches. How often do we find the true nearest neighbor?
62
62 Evaluating Embedding Quality Embedding step: Compute distances from query to reference objects F(q). Filter step: Find top p matches of F(q) in vector space. Refine step: Measure exact distance from q to top p matches. How often do we find the true nearest neighbor?
63
63 Evaluating Embedding Quality Embedding step: Compute distances from query to reference objects F(q). Filter step: Find top p matches of F(q) in vector space. Refine step: Measure exact distance from q to top p matches. How often do we find the true nearest neighbor? How many exact distance computations do we need?
64
64 Evaluating Embedding Quality Embedding step: Compute distances from query to reference objects F(q). Filter step: Find top p matches of F(q) in vector space. Refine step: Measure exact distance from q to top p matches. How often do we find the true nearest neighbor? How many exact distance computations do we need?
65
65 Evaluating Embedding Quality Embedding step: Compute distances from query to reference objects F(q). Filter step: Find top p matches of F(q) in vector space. Refine step: Measure exact distance from q to top p matches. How often do we find the true nearest neighbor? How many exact distance computations do we need?
66
66 Evaluating Embedding Quality Embedding step: Compute distances from query to reference objects F(q). Filter step: Find top p matches of F(q) in vector space. Refine step: Measure exact distance from q to top p matches. What is the nearest neighbor classification error? How many exact distance computations do we need?
67
67 Chamfer distance: 112 seconds per query Results on Hand Dataset query Database (80,640 images) nearest neighbor
68
68 Query set: 710 real images of hands. Database: 80,640 synthetic images of hands. Results on Hand Dataset Brute Force Accuracy100% Distances80640 Seconds112 Speed-up1
69
69 Brute Force BMRLPFMVP Accuracy100%95% Distances80640450144426475471 Seconds1120.62.03.77.6 Speed-up1179563015 Results on Hand Dataset Query set: 710 real images of hands. Database: 80,640 synthetic images of hands.
70
70 MNIST: 60,000 database objects, 10,000 queries. Shape context (Belongie 2001): 0.63% error, 20,000 distances, 22 minutes. 0.54% error, 60,000 distances, 66 minutes. Results on MNIST Dataset
71
71 Results on MNIST Dataset Method Distances per query Seconds per query Error rate Brute force60,0003,6960.54% VP-trees21,15213060.63% Condensing1,060712.40% BoostMap800530.58% Zhang 2003503.32.55% BoostMap503.31.50% BoostMap*503.30.83%
72
72 Query-Sensitive Embeddings Richer models. Capture non-metric structure. Better embedding quality. References: Athitsos, Hadjieleftheriou, Kollios, and Sclaroff, SIGMOD 2005. Athitsos, Hadjieleftheriou, Kollios, and Sclaroff, TODS, June 2007.
73
73 Capturing Non-Metric Structure A human is not similar to a horse. A centaur is similar both to a human and a horse. Triangle inequality is violated: Using human ratings of similarity (Tversky, 1982). Using k-median Hausdorff distance.
74
74 Capturing Non-Metric Structure Mapping to a metric space presents dilemma: If D(F(centaur), F(human)) = D(F(centaur), F(horse)) = C, then D(F(human), F(horse)) <= 2C. Query-sensitive embeddings: Have the modeling power to preserve non-metric structure.
75
75 Local Importance of Coordinates How important is each coordinate in comparing embeddings? database x1x1 x2x2 xnxn embedding F RdRd q query x 11 x 12 x 13 x 14 … x 1d x 21 x 22 x 23 x 24 … x 2d x n1 x n2 x n3 x n4 … x nd q1q1 q2q2 q3q3 q4q4 … qdqd
76
76 F(x) = (D(x, LA), D(x, Lincoln), D(x, Orlando)) F(Sacramento)....= ( 386, 1543, 2920) F(Las Vegas).....= ( 262, 1232, 2405) F(Oklahoma City).= (1345, 437, 1291) F(Washington DC).= (2657, 1207, 853) F(Jacksonville)..= (2422, 1344, 141)
77
77 original space X 12 3 Classifier: H = w 1 F’ 1 + w 2 F’ 2 + … + w j F’ j. Observation: accuracy of weak classifiers depends on query. F’ 1 is perfect for (q, a, b) where q = reference object 1. F’ 1 is good for queries close to reference object 1. Question: how can we capture that? General Intuition
78
78 Query-Sensitive Weak Classifiers V: area of influence (interval of real numbers). F’(q, a, b) if F(q) is in V Q F,V (q, a, b) = “I don’t know” if F(q) not in V original space X 12 3
79
79 Query-Sensitive Weak Classifiers V: area of influence (interval of real numbers). F’(q, a, b) if F(q) is in V Q F,V (q, a, b) = “I don’t know” if F(q) not in V If V includes all real numbers, Q F,V = F’. original space X 12 j
80
80 Applying AdaBoost original space X FdFd F2F2 F1F1 Real line AdaBoost forms classifiers Q F i,V i. F i : 1D embedding. V i : area of influence for F i. Output: H = w 1 Q F 1,V 1 + w 2 Q F 2,V 2 + … + w d Q F d,V d.
81
81 Applying AdaBoost original space X FdFd F2F2 F1F1 Real line Empirical observation: At late stages of the training, query-sensitive weak classifiers are still useful, whereas query-insensitive classifiers are not.
82
82 From Classifier to Embedding What embedding should we use? What distance measure should we use? AdaBoost output H(q, a, b) = i=1 w i Q F i,V i (q, a, b) d
83
83 From Classifier to Embedding D(F(q), F(x)) = i=1 w i S F i,V i (q) |F i (q) – F i (x)| d F(x) = (F 1 (x), …, F d (x)) BoostMap embedding Distance measure AdaBoost output H(q, a, b) = i=1 w i Q F i,V i (q, a, b) d
84
84 From Classifier to Embedding Distance measure is query-sensitive. Weighted L 1 distance, weights depend on q. S F,V (q) = 1 if F(q) is in V, 0 otherwise. D(F(q), F(x)) = i=1 w i S F i,V i (q) |F i (q) – F i (x)| d F(x) = (F 1 (x), …, F d (x)) BoostMap embedding Distance measure AdaBoost output H(q, a, b) = i=1 w i Q F i,V i (q, a, b) d
85
85 Centaurs Revisited Reference objects: human, horse, centaur. For centaur queries, use weights (0,0,1). For human queries, use weights (1,0,0). Query-sensitive distances are non-metric. Combine efficiency of L 1 distance and ability to capture non- metric structure.
86
86 F(x) = (D(x, LA), D(x, Lincoln), D(x, Orlando)) F(Sacramento)....= ( 386, 1543, 2920) F(Las Vegas).....= ( 262, 1232, 2405) F(Oklahoma City).= (1345, 437, 1291) F(Washington DC).= (2657, 1207, 853) F(Jacksonville)..= (2422, 1344, 141)
87
87 Recap of Advantages Capturing non-metric structure. Finding most informative reference objects for each query. Richer model overall. Choosing a weak classifier now also involves choosing an area of influence.
88
88 Query- Sensitive Query- Insensitive Accuracy95% # of distances19955691 Sec. per query3395 Speed-up factor165.6 Query set: 1000 time series. Database: 31818 time series. Dynamic Time Warping on Time Series
89
89 Query- Sensitive Vlachos KDD 2003 Accuracy100% # of distances640over 6500 Sec. per query10.7over 110 Speed-up factor51.2under 5 Query set: 50 time series. Database: 32768 time series. Dynamic Time Warping on Time Series
90
90 BoostMap Recap - Theory Machine-learning method for optimizing embeddings. Explicitly maximizes amount of nearest neighbor structure preserved by embedding. Optimization method is independent of underlying geometry. Query-sensitive version can capture non-metric structure.
91
91 BoostMap Recap - Practice BoostMap can significantly speed up nearest neighbor retrieval and classification. Useful in real-world datasets: Hand shape classification. Optical character recognition (MNIST, UNIPEN). In all four datasets, better results than other methods. In three benchmark datasets, better than methods custom-made for those distance measures. Domain-independent formulation. Distance measures are used as a black box. Application to proteins/DNA matching…
92
92 END
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.