Download presentation
Presentation is loading. Please wait.
Published byGwen Parrish Modified over 9 years ago
1
Efficient Exact Similarity Searches using Multiple Token Orderings Jongik Kim 1 and Hongrae Lee 2 1 Chonbuk National University, South Korea 2 Google Inc.
2
Introduction Similarity search is important in many applications Data cleaning Record linkage Near duplicate detection Query refinement The focus of our work is efficient evaluation of similarity queries A lot of applications invoke queries simultaneously Applications usually require fast response times Need to evaluate similarity queries efficiently … Simultaneous query requests angty bird typo
3
Problem Definition How do we measure the similarity between two string? Name Bill Gates Linus Torvalds Steven P. Jobs Dennis Ritchie … Search Query q: Steve Jobs Output: each string s that satisfies sim(q, s) ≥ α ste The overlap similarity, sim(“steve”, “steven”), is defined as |TS(“steve”) ∩ TS(“steven”)| 1.Convert each string into a record, where a record is a set of tokens Tokenize each string into a token set containing all q-gram tokens of the string q-gram: a substring of a string of length q TS(“steve”) = {ste, tev, eve} and TS(“steven”) = {ste, tev, eve, ven} 2.Count the number of common tokens between two records (or token sets) Collection of strings tev eve steve Why do we use the overlap similarity? It supports many other similarity measures. e.g. J(x, y) = t O(x, y) = t(|x|+|y|)/1+t J: Jaccard similarity, O: Overlap similarity
4
Inverted Lists based Approach IDStringRecord (token set) 1area{, re, ea} 2artisan{, rt, ti, is, sa, an} 3artist{, rt, ti, is, st} 4tisk{ti, is, sk} ……… ar sk ea is sa rt st ti 123 4 1 2 2 23 3 234 re 1 Make Inverted Lists an 2 3 Query: “artist” Overlap threshold: 4 Merge to count occurrences 1 2 3 4 2 4 5 2 Answers of the query 2: “artisan” 3: “artist” {,,,, } ar rt tiis st 4 ar
5
Prefix Filtering based Approach Query q = “artist” {ar, rt, ti, is, st} and overlap threshold α = 4 ar is rt st ti 123 2 23 3 234 3 Inverted lists for the query st rt 3 3 2 ar is ti 123 2 234 3 Sort the lists by their sizes Prefix Lists: the first |TS(q)| – α + 1 lists Suffix Lists: remaining α – 1 lists Filtering Phase (the prefix filtering) Merge the prefix lists to generate candidates Verification Phase Search the suffix lists for each candidate A candidate searches each suffix list to identify if it is contained in the list Binary search is used because suffix lists are usually very long 2 3 1 2 candidates 234 34 5 4 4 Sort the tokens by their document frequencies Document frequency ordering
6
Document Frequency Ordering General Goal: minimize the number of candidates by making use of the document frequency ordering rt st ti 23 3 234 ar is 123 2 3 4 st rt 3 3 2 ar is ti 123 2 234 34 Prefix Lists: the first |TS(q)| – α + 1 lists Query q = “artist” {ar, rt, ti, is, st} and overlap threshold α = 4 Suffix Lists: remaining α – 1 lists Prefix Lists: the first |TS(q)| – α + 1 lists Suffix Lists: remaining α – 1 lists Sort the tokens by their document frequencies 2 3 4 candidates 12 3 We can reduce 1.time for merging short lists 2.number of candidates time for verification candidates
7
Our Observation Query q = {w 1, w 2 } and overlap threshold α = 2 w 2 is the prefix list # of candidates is 5 w 2 is the prefix list # of candidates is 0 w 1 is the prefix list # of candidates is 0 Total number of candidates is 0 Partition Our observation By partitioning a data set, we can artificially modify document frequencies of tokens in each partition. We evaluate a query in each partition and take the union of the results. We can reduce the number of candidates by utilizing different token orderings among partitions. Because partitions have different token orderings, we need to sort tokens in a query record in each partition.
8
Generalization of the Observation Query q = reaby ={re, ea, ab, by} ={w 1, w 3, ab, by} Overlap threshold α = 2 Grouping records in I(w p ) into P, the number of candidates is reduced by at least |I(w p )| – |I(w s ) ∩ P | Grouping records in I(w s ) into P, the number of candidates is reduced by at least |I(w p ) – P| Prefix list: w 3 # of candidate: 5 In P 1, prefix list is w 1 # of candidate: 2 In P 2, prefix list is w 3 # of candidate: 0 In P 1, prefix list is w 3 # of candidate: 2 In P 2, prefix list is w 1 # of candidate: 0 By grouping records containing a token w into a partition, we can benefit queries containing w I(w): the inverted list of w, w p : a prefix token, w s : a suffix token
9
Pivot Set & Partitioning By grouping records containing a token w into a partition, we can benefit queries containing w Pivot set S is a set of tokens such that Grouping I(w i ) into one partition does not affect grouping I(w j ) into another partition w1w1 w2w2 w3w3 w5w5 w4w4 w6w6 w7w7 r1r1 r2r2 r3r3 r4r4 r5r5 r6r6 r7r7 r8r8 r9r9 r 10 r 11 r 12 r 13 r 14 r 15 There are many pivot sets S 1 = {w 1, w 3 } S 2 = {w 2, w 3, w 4 } S 3 = {w 3, w 5 } S 4 = {w 5, w 6 } S 5 = {w 2, w 6 } S 6 = {w 3, w 7 } We can benefit queries containing w i as well as queries containing w j Question: 1.Existence of pivot sets 2.Selection of a good pivot set P1P1 P2P2 P3P3 orphan record: randomly select its partition
10
Relaxation of a Pivot Set w1w1 w2w2 w3w3 w5w5 w4w4 w6w6 w7w7 r1r1 r2r2 r3r3 r4r4 r5r5 r6r6 r7r7 r8r8 r9r9 r 10 r 11 r 12 r 13 r 14 r 15 ※ JC(S 1, S 2 ) = | S 1 ∩ S 2 |/min(| S 1 |, | S 2 |) Pivot set S is a set of tokens such that for any two tokens w i and w j in S, JC(I(w i ), I(w j )) ≤ β If JC(S 1, S 2 ) = 0.1, 90% of S 1 10% of S 1 S1S1 S2S2 less than 10% of S 2 more than 90% of S 2 If β = 0.2, the set S = {w 2, w 3, w 4 } is a pivot set
11
Pivot Set Selection The weight of a token w is the number of queries that contain w Goodness of a pivot set S: By partitioning using tokens contained in many queries, we can benefit many queries Selecting the best pivot set is an NP-hard Problem (see the paper) We use a simple greedy algorithm (simplified version) Select those tokens first whose weights are high w1w1 w2w2 w3w3 w5w5 w4w4 w6w6 w7w7 r1r1 r2r2 r3r3 r4r4 r5r5 r6r6 r7r7 r8r8 r9r9 r 10 r 11 r 12 r 13 r 14 r 15 (See the paper for the details) Problem: By selecting high frequency token w 1 first, we lose the chance to divide records in I(w 2 ) and I(w 4 ). If we divide records in I(w 2 ) and I(w 4 ), however, we can benefit more queries We solve the problem using partitioning algorithm
12
Partitioning Algorithm w1w1 w2w2 w3w3 w5w5 w4w4 w6w6 w7w7 r1r1 r2r2 r3r3 r4r4 r5r5 r6r6 r7r7 r8r8 r9r9 r 10 r 11 r 12 r 13 r 14 r 15 P1P1 P2P2 P 11 P 12 local orphan record: insert it into either P 11 or P 12 Partitioning algorithm (simplified version, see the paper for the details) Select a pivot set Partition records using the pivot set In each partition, recursively partition records and handle local orphan records Balance between the overhead and the benefit of partitioning using a cost model Note: recursive partitioning does not affect the relative document frequencies of w 1 in each partition
13
Experiments Dataset# recordsAvg # tokens# partitions IMDB Actor1,213,39116ED 28, JC 12 IMDB Movie1,568,89119ED 18, JC 12 DBLP Author2,948,92915ED 55, JC 55 Web Corpus6,000,00021ED 54, JC 85 D ATASETS AND STATISTICS Similarity functions Jaccard similary (thresholds - 0.6, 0.7, 0.8) Edit distance (thresholds - 2, 3, 4) Search algorithms Jaccard: SequentialMerge, DivideSkip [Li et al., ICDE `08], PPMerge [Xiao et al., WWW `08] Edit distance: SequentialMerge, DivideSkip, EDMerge [Xiao et al., PVLDB `08] Size Filtering [Arasu et al., VLDB 06] (for all algorithms) Partitioned case vs. unpartitioned case Elapsed times Number of candidates
14
Experiments Jaccard similarity (DBLP Author) Running TimeNumber of Candidates
15
Experiments Edit distance (Web Corpus) Running TimeNumber of Candidates ※ Edit distance – false positives are not removed!!
16
Conclusions Studied how to reduce the number of candidates for efficient similarity searches Proposed the concept of the pivot set and partitioning technique using a pivot set Showed benefits of the proposed technique experimentally
17
THANK YOU!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.