Download presentation
Presentation is loading. Please wait.
Published byAnthony Basil Modified over 9 years ago
1
APPROXIMATE NEAREST NEIGHBOR QUERIES WITH A TINY INDEX Wei Wang, University of New South Wales 1 6/26/14 NICTA Machine Learning Research Group Seminar
2
Outline Overview of Our Research SRS: c-Approximate Nearest Neighbor with a tiny index [PVLDB 2015] Conclusions 2 6/26/14 NICTA Machine Learning Research Group Seminar
3
Research Projects Similarity Query Processing Keyword search on (Semi-) Structured Data Graph Succinct Data Structures 3 6/26/14 NICTA Machine Learning Research Group Seminar
4
NN and c-ANN Queries Definitions A set of points, D = ∪ i=1 n {O i }, in d-dimensional Euclidean space (d is large, e.g., hundreds) Given a query point q, find the closest point, O*, in D Relaxed version: Return a c-ANN point: i.e., its distance to q is at most c*Dist(O*, q) May return a c-ANN point with at least constant probability 4 NN = a c-ANN = x x D = {a, b, x} aka. (1+ ε )-ANN 6/26/14 NICTA Machine Learning Research Group Seminar
5
Applications and Challenges Applications Feature vectors: Data Mining, Multimedia DB Fundamental geometric problem: “post-office problem” Quantization in coding/compression … Challenges Curse of Dimensionality / Concentration of Measure Hard to find algorithms sub-linear in n and polynomial in d Large data size: 1KB for a single point with 256 dims 5 6/26/14 NICTA Machine Learning Research Group Seminar
6
Existing Solutions NN: O(d 5 log(n)) query time, O(n 2d + δ ) space O(dn 1- ε (d) ) query time, O(dn) space Linear scan: O(dn/B) I/Os, O(dn) space (1+ ε )-ANN O(log(n) + 1/ ε (d-1)/2 ) query time, O(n*log(1/ ε )) space Probabilistic test remove exponential dependency on d Fast JLT: O(d*log(d) + ε -3 log 2 (n)) query time, O(n max(2, ε ^-2) ) space LSH-based: Õ(dn ρ +o(1) ) query time, Õ(n 1+ ρ +o(1) + nd) space ρ = 1/(1+ ε ) + o c (1) 6 LSH is the best approach using sub-quadratic space Linear scan is (practically) the best approach using linear space & time 6/26/14 NICTA Machine Learning Research Group Seminar
7
Approximate NN for Multimedia Retrieval 7 Cover-tree Spill-tree Reduce to NN search with Hamming distance Dimensionality reduction (e.g., PCA) Quantization-based approaches (e.g., CK-Means) 6/26/14 NICTA Machine Learning Research Group Seminar
8
Locality Sensitive Hashing (LSH) Equality search Index: store o into bucket h(o) Query: retrieve every o in the bucket h(q), verify if o = q LSH ∀ h ∈ LSH-family, Pr[ h(q) = h(o) ] ∝ 1/Dist(q, o) h :: R d Z technically, dependent on r “Near-by” points (blue) have more chance of colliding with q than “far- away” points (red) 8 LSH is the best approach using sub-quadratic space 6/26/14 NICTA Machine Learning Research Group Seminar
9
LSH: Indexing & Query Processing Index For a fixed r sig(o) = h 1 (o), h 2 (o), …, h k (o) store o into bucket sig(o) Iteratively increase r Query Search with a fixed r Retrieve and “verify” points in the bucket sig(q) Repeat this L times (boosting) Galloping search to find the first good r 9 Reduce Query Cost Const Succ. Prob. Const Succ. Prob. Incurs additional cost + only c 2 quality guarantee Incurs additional cost + only c 2 quality guarantee 6/26/14 NICTA Machine Learning Research Group Seminar
10
Locality Sensitive Hashing (LSH) Standard LSH c 2 -ANN binary search on R(c i, c i+1 )-NN problems LSH on external memory LSB-forest [SIGMOD’09, TODS’10] : A different reduction from c 2 -ANN to a R(c i, c i+1 )-NN problem C2LSH [SIGMOD’12] : Do not use composite hash keys Perform fine-granular counting number of collisions in m LSH projections 10 O((dn/B) 0.5 ) query, O((dn/B) 1.5 ) space O((dn/B) 0.5 ) query, O((dn/B) 1.5 ) space O(n*log(n)/B) query, O(n*log(n)/B) space O(n*log(n)/B) query, O(n*log(n)/B) space O(n/B) query, O(n/B) space O(n/B) query, O(n/B) space SRS (Ours) 6/26/14 NICTA Machine Learning Research Group Seminar
11
Weakness of Ext-Memory LSH Methods Existing methods uses super-linear space Thousands (or more) of hash tables needed if rigorous People resorts to hashing into binary code (and using Hamming distance) for multimedia retrieval Can only handle c, where c = x 2, for integer x ≥ 2 To enable reusing the hash table (merging buckets) Valuable information lost (due to quantization) Update? (changes to n, and c) 11 Dataset SizeLSB-forestC2LSH + SRS (Ours) Audio, 40MB1500 MB127 MB2 MB 6/26/14 NICTA Machine Learning Research Group Seminar
12
SRS: Our Proposed Method Solving c-ANN queries with O(n) query time and O(n) space with constant probability Constants hidden in O() is very small Early-termination condition is provably effective Advantages: Small index Rich-functionality Simple Central idea: c-ANN query in d dims kNN query in m-dims with filtering Model the distribution of m “stable random projections” 12 6/26/14 NICTA Machine Learning Research Group Seminar
13
2-stable Random Projection 13 Let D be the 2-stable random projection = standard Normal distribution N (0, 1) For two i.i.d. random variables A ~ D, B ~ D, then x*A + y*B ~ (x 2 +y 2 ) 1/2 * D Illustration V r1r1 V.r 1 ~ N (0, ǁ v ǁ ) V.r 2 ~ N (0, ǁ v ǁ ) r2r2 V 6/26/14 NICTA Machine Learning Research Group Seminar
14
Dist(O) and ProjDist(O) and Their Relationship 14 z 1 ≔ V, r 1 ~ N (0, ǁ v ǁ ) z 2 ≔ V, r 2 ~ N (0, ǁ v ǁ ) z 1 2 +z 2 2 ~ ǁ v ǁ 2 * χ 2 m i.e., scaled Chi-squared distribution of m degrees of freedom Ψ m (x): cdf of the standard χ 2 m distribution O in d dims (z 1, … z m ) in m dims m 2-stable random projections V=Dist(O) r1r1 V.r 1 ~ N (0, ǁ v ǁ ) Q O V.r 2 ~ N (0, ǁ v ǁ ) r2r2 V=Dist(O) Q O Proj(O) Proj(Q) ProjDist(O) Dist(O) ProjDist(O) 6/26/14 NICTA Machine Learning Research Group Seminar
15
LSH-like Property 15 Intuitive idea: If Dist(O 1 ) ≪ Dist(O 2 ) then ProjDist(O 1 ) < ProjDist(O 2 ) with high probability But the inverse is NOT true NN object in the projected space is most likely not the NN object in the original space with few projections, as Many far-away objects projected before the NN/cNN objects But we can bound the expected number of such cases! (say T) Solution Perform incremental k-NN search on the projected space till accessing T objects + Early termination test 6/26/14 NICTA Machine Learning Research Group Seminar
16
Indexing 16 Finding the minimum m Input n, c, T ≔ max # of points to access by the algorithm Output m : # of 2-stable random projections T’ ≤ T: a better bound on T m = O(n/T). We use T = O(n), so m = O(1) to achieve linear space index Generate m 2-stable random projections n projected points in a m-dimensional space Index these projections using any index that supports incremental kNN search, e.g., R-tree Space cost: O(m * n) = O(n) 6/26/14 NICTA Machine Learning Research Group Seminar
17
SRS- αβ (T, c, p τ ) 17 Compute proj(Q) Do incremental kNN search from proj(Q) for k = 1 to T Compute Dist(O k ) Maintain O min = argmin 1≤i≤k Dist(O i ) If early-termination test (c, p τ ) = TRUE BREAK Return O min // stopping condition α // stopping condition β c = 4, d = 256, m = 6, T = 0.00242n, B = 1024, p τ =0.18 Index = 0.0059n, Query = 0.0084n, succ prob = 0.13 Early-termination test: Early-termination test: 6/26/14 NICTA Machine Learning Research Group Seminar Main Theorem: SRS- αβ returns a c-NN point with probability p τ -f(m,c) with O(n) I/O cost Main Theorem: SRS- αβ returns a c-NN point with probability p τ -f(m,c) with O(n) I/O cost
18
Variations of SRS- αβ (T, c, p τ ) 18 Compute proj(Q) Do incremental kNN search from proj(Q) for k = 1 to T Compute Dist(O k ) Maintain O min = argmin 1≤i≤k Dist(O i ) If early-termination test (c, p τ ) = TRUE BREAK Return O min // stopping condition α // stopping condition β 1. SRS- α 2. SRS- β 3. SRS- αβ (T, c’, p τ ) Better quality; query cost is O(T) Best quality; query cost bounded by O(n); handles c = 1 Better quality; query cost bounded by O(T) 6/26/14 NICTA Machine Learning Research Group Seminar All with success probability at least p τ
19
Other Results 19 Can be easily extended to support top-k c-ANN queries (k > 1) No previous known guarantee on the correctness of returned results We guarantee the correctness with probability at least p τ, if SRS- αβ stops due to early-termination condition ≈100% in practice (97% in theory) 6/26/14 NICTA Machine Learning Research Group Seminar
20
Analysis 20 6/26/14 NICTA Machine Learning Research Group Seminar
21
Stopping Condition α 21 “near” point: the NN point its distance ≕ r “far” points: points whose distance > c * r Then for any κ > 0 and any o: Pr[ProjDist(o)≤ κ *r | o is a near point] ≥ ψ m ( κ 2 ) Pr[ProjDist(o)≤ κ *r | o is a far point] ≤ ψ m ( κ 2 /c 2 ) Both because ProjDist 2 (o)/Dist 2 (o) ~ χ 2 m Pr[the NN point projected before κ *r] ≥ ψ m ( κ 2 ) Pr[# of bad points projected before κ *r < T] > (1 - ψ m ( κ 2 /c 2 )) * (n/T) Choose κ such that P1 + P2 – 1 > 0 Feasible due to good concentration bound for χ 2 m P1 P2 6/26/14 NICTA Machine Learning Research Group Seminar
22
Choosing κ 22 Let c = 4 Mode = m – 2 Blue: 4 Red: 4*(c 2 ) = 64 κ *r 6/26/14 NICTA Machine Learning Research Group Seminar
23
ProjDist(O T ): Case I 23 ProjDist(O T ) ProjDist(o) in m-dims O min = the NN point Consider cases where both conditions hold (re. near and far points) P1 + P2 – 1 probability 6/26/14 NICTA Machine Learning Research Group Seminar
24
ProjDist(O T ): Case II 24 ProjDist(O T ) ProjDist(o) in m-dims O min = a cNN point Consider cases where both conditions hold (re. near and far points) P1 + P2 – 1 probability 6/26/14 NICTA Machine Learning Research Group Seminar
25
Early-termination Condition ( β ) 25 Omit the proof here Also relies on the fact that the squared sum of m projected distances follows a scaled χ 2 m distribution Key to Handle the case where c = 1 Returns the NN point with guaranteed probability Impossible to handle by LSH-based methods Guarantees the correctness of top-k cANN points returned when stopped by this condition No such guarantee by any previous method 6/26/14 NICTA Machine Learning Research Group Seminar
26
Experiment Setup 26 Algorithms LSB-forest [SIGMOD’09, TODS’10] C2LSH [SIGMOD’12] SRS-* [VLDB’15] Data Measures Index size, query cost, result quality, success probability 6/26/14 NICTA Machine Learning Research Group Seminar
27
Datasets 27 5.6PB 369GB 16GB 6/26/14 NICTA Machine Learning Research Group Seminar
28
Tiny Image Dataset (8M pts, 384 dims) 28 Fastest: SRS- αβ, Slowest: C2LSH Quality the other way around SRS- α has comparable quality with C2LSH yet has much lower cost. SRS-* dominates LSB- forest 6/26/14 NICTA Machine Learning Research Group Seminar
29
Approximate Nearest Neighbor 29 Empirically better than the theoretic guarantee With 15% I/Os of linear scan, returns NN with probability 71% With 62% I/Os of linear scan, returns NN with probability 99.7% 6/26/14 NICTA Machine Learning Research Group Seminar
30
Large Dataset (0.45 Billion) 30 6/26/14 NICTA Machine Learning Research Group Seminar
31
Summary 31 c-ANN queries in arbitrarily high dim space kNN query in low dim space Our index size is approximately d/m of the size of the data file Opens up a new direction in c-ANN queries in high- dimensional space Find efficient solution to kNN problem in 6-10 dimensional space 6/26/14 NICTA Machine Learning Research Group Seminar
32
Q&A 32 Similarity Query Processing Project Homepage: http://www.cse.unsw.edu.au/~weiw/project/simjoin.html 6/26/14 NICTA Machine Learning Research Group Seminar
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.