Presentation is loading. Please wait.

Presentation is loading. Please wait.

The Sweet Spot between Inverted Indices and Metric-Space Indexing for Top-K–List Similarity Search Evica Milchevski , Avishek Anand ★ and Sebastian Michel.

Similar presentations


Presentation on theme: "The Sweet Spot between Inverted Indices and Metric-Space Indexing for Top-K–List Similarity Search Evica Milchevski , Avishek Anand ★ and Sebastian Michel."— Presentation transcript:

1 The Sweet Spot between Inverted Indices and Metric-Space Indexing for Top-K–List Similarity Search Evica Milchevski , Avishek Anand ★ and Sebastian Michel   University of Kaiserslautern ★ L3S Research Center contact: milchevski@cs.uni-kl.de

2 Dating Portal [1] [2] 2

3 Motivation Users J 1. 2. 3. 4. 5. 1. 2. 3. 4. 5. 3

4 Problem Overview Task: Efficiently retrieve all rankings that are close to query ranking q. 4 q

5 Formal Problem Statement Set of Rankings T={r 1, r 2, … r n }, D ri - domain of r i Distance function d(r j, r i ) Query ranking q Threshold θ Rankings r1,r2,... r m where distance d(q,ri)<=θ Top-k List = Ranking 5

6 Footrule Distance for Top-k Lists For top-k rankings r 1 (i) – rank of item i in ranking r 1 k+1 when r 1 (i) not in D 2 and vice versa r1r1 i1i1 i2i2 i3i3 i4i4 i7i7 r2r2 i3i3 i2i2 i4i4 i9i9 i1i1 F(r 1,r 2 ) = 6 F(r 1,r 2 ) = 7 F(r 1,r 2 ) = 4 4 i 7, i 9 not in D 1 overlap D 2 F(r 1,r 2 ) = 8 F(r 1,r 2 ) = 10 [Fagin et al. ’03, SIAM J. Discrete Math] 1. 2. 3. 4. 5. 6. i9i9 i9i9 i7i7 i7i7 2 1 The Footrule distance is a metric 6

7 Outline Motivation and Introduction Problem Statement and Basic Indexing Hybrid Index: Inv. Index vs. Metric Index Cost model for automated Performance Tuning Experimental Evaluation Conclusion and Outlook 7

8 How can we approach the problem? 8

9 Rankings as Sets r1r1 i1i1 i2i2 i3i3 i4i4 Rankings as plain sets 1. 2. 3. 4. r2r2 r4r4 r3r3 r1r1 r9r9 … r3r3 r4r4 … i2i2 i5i5 i9i9 i1i1 i8i8 r8r8 r 11... Inverted Index index Efficiently find all the candidate rankings Compute distance function 9

10 Metric Index Structures Footrule distance is metric r  r  r  r  r  r  r  r  r  r  r  r  r  r  r  r  r  r  r  r  r  r  r  r  r  r  r  r  r  index Metric Index Structure Reduce the search space 10

11 Drawbacks Inverted Index – drawbacks – We need to validate each candidate ranking Metric index structures (BK-tree) – even worse 11

12 Advantages Inverted Index – Efficiently filter out rankings having no overlap with q Metric index structures – Pre compute distances at construction time – Efficiently prune the search space using the triangle inequality Combine advantages of both approaches 12

13 Approach-Coarse Index r  r  r  r  r  r  r  r  r  r  r  r  r  r  r  r  r  r  r  r  r  r  r  r  r  θcθc θcθc θcθc θcθc … θcθc θcθc Metric Index Structure riri riri rjrj rjrj rkrk rkrk Inverted Index rmrm rkrk rmrm rjrj i1i1 i2i2 … ilil r  r  θcθc θcθc r  r  r  r  θ c -partitioning threshold Medoids 13

14 Approach - Coarse Index Benefits – Use the filtering power of inverted index – Validate rankings using metric index 14

15 Coarse Index - Querying r1r1 r1r1 θcθc θcθc rjrj rjrj θcθc θcθc … rmrm rmrm θcθc θcθc θ + θ c θ r1r1 rjrj rmrm rmrm i1i1 i2i2 …ilil Input : q, θ Result: r 1, …r m, where F(q,r i )<=θ avoids missing results with d ≤ θ, represented by a medoid with d > θ 15

16 Performance Sweet Spot Trade-off between inverted index and metric index Large θ c Small θ c r1r1 r1r1 θcθc θcθc rjrj rjrj θcθc θcθc rmrm rmrm θcθc θcθc r1r1 rjrj rmrm rmrm i1i1 i2i2 …ilil r1r1 r1r1 θcθc θcθc rjrj rjrj θcθc θcθc … rmrm rmrm θcθc θcθc rjrj rjrj θcθc θcθc r1r1 rjrj rmrm rmrm rlrl r1r1 rjrj rkrk i1i1 i2i2 i3i3 … ilil i5i5 r5r5 … rkrk Which θ c results in the best performance? 16

17 Cost Model We estimate: 1.Cost for querying the inverted index (filtering cost) – Size of the posting list 2.Cost for validating the partitions (validation cost) – Size of partitions – Number of partitions to be queried Assumption: Distribution of pairwise distances is known Which θ c results in the best performance? 17

18 Cost for validating the partitions Number of partitions to be queried [Flajolet et al. ’92] Number of medoids that capture the rankings Coupon collector problem [3] 18

19 Cost Model We can find the sweet spot 19

20 Inverted Index Access & Optimizations q i1i1 i5i5 i9i9 i2i2 i8i8 r i2i2 i6i6 i7i7 i3i3 i4i4 query threshold θ rankings need to have at least an overlap of w otherwise L(r,q) > θ k – size of rankings Lowest distance 20

21 Pruning by Query-Ranking Overlap r1r1 r4r4 r3r3 r1r1 r9r9 … q1q1 i1i1 i5i5 i9i9 i2i2 r3r3 r4r4 … θ=0.2, k=5 i8i8 i2i2 i5i5 i9i9 i1i1 i8i8 r1r1 r3r3... the resulting rankings must have an overlap of at least 4 items [Wang et al. ’12] Resembles the idea of prefix filtering methods 21

22 Experiments Datasets – New York Times (NYT) 1 million rankings Generated by executing keyword queries on NYT corpus – Yago 25000 rankings The facts in Yago are used to create rankings[Ilieva et al. ’13 CIKM] E.g. Buildings located in New York, ranked by height 22

23 Experiments Algorithms – Baseline approaches Filter and Validate (F&V) Merge of Id-Sorted Lists (ListMerge) – Competitors AdaptSearch [Wang et al. ’12, SIGMOD] Minimal Filter and Validate (Minimal F&V) – Coarse Index (Coarse) – Coarse Index + Dropping index lists (Coarse+Drop) – Filter and Validate + Dropping index lists (F&V+ Drop) – Blocked access + Pruning (Blocked+Prune) – Blocked access + Pruning + Dropping index lists (Blocked+Prune+Drop) 23

24 Experiments Algorithms implemented in Java Main memory Wallclock time needed for processing 1000 queries We normalize d and θ to [0, 1] 24

25 Validity of the Theoretical Cost Model Query threshold θ = 0.2; k=10 25 Average difference is 14.82ms

26 Validity of the Theoretical Cost Model Query threshold θ = 0.2; k=10 26 Average difference is 2.02ms

27 Performance of Algorithms NYT 27 Coarse+Drop index outperforms the competitor by at least factor of 34

28 Performance of Algorithms Yago 28 Coarse+Drop index again outperforms the competitor

29 Conclusion New hybrid index structure for similarity search of top-k list – Tunable towards inverted index or metric index – Cost model for finding the sweet spot Optimizations over the inverted index The presented hybrid index beat the competitor AdaptSearch 29

30 References [Ilieva et al. ‘13] E. Ilieva, S. Michel, and A. Stupar. The essence of knowledge (bases) through entity rankings. CIKM, 2013. [Fagin et al. ’03] R. Fagin, R. Kumar, and D. Sivakumar. Comparing Top-k Lists. SIAM J. Discrete Math., 17(1), 2003. [Flajolet et al. ’92] P. Flajolet, D. Gardy, and L. Thimonier. Birthday Paradox, Coupon Collectors, Caching Algorithms and Self-Organizing Search. Discrete Applied Mathematics, 39(3), 1992 [Wang et al. ’12] J. Wang, G. Li, and J. Feng. Can we beat the prefix filtering?: an adaptive framework for similarity join and search. In SIGMOD Conference, 2012. 30

31 Image References [1] http://www.clipartpanda.com/clipart_images/ business-people-group-in-6212621 [2] http://kkelley.blogspot.de/2011/11/allowing- god-to-take-control.html [3] http://www.bbc.com/news/magazine- 27051215 31


Download ppt "The Sweet Spot between Inverted Indices and Metric-Space Indexing for Top-K–List Similarity Search Evica Milchevski , Avishek Anand ★ and Sebastian Michel."

Similar presentations


Ads by Google