Dong Deng, Guoliang Li, Jianhua Feng Database Group, Tsinghua University Present by Dong Deng A Pivotal Prefix Based Filtering Algorithm for String Similarity.

Dong Deng, Guoliang Li, Jianhua Feng Database Group, Tsinghua University Present by Dong Deng A Pivotal Prefix Based Filtering Algorithm for String Similarity Search

Search is Important Source: http://www.internetlivestats.com/google-search-statistics/ Google Searches per Year

Speed Matters Source:

Data is Dirty Typos Typo in “title” relaxed related Argyrios Zymnis Argyris Zymnis DBLP Complete Search

Similarity Search Query String Dataset All the strings similar to the query

ED(r, s): The min number of edit operations (insertion/deletion/substitution) needed to transform r to s. For example: ED(sigcom, sigmod) = 2 Edit Distance sigcom sigmom sigmod substitute c with m substitute m with d

Problem Definition Query string s = “yotubecom” and τ = 2 string dataset R ed(s, r 4 ) <= 2 output r 4 as a result

Application Spell Checking Copy Detection Entity Linking Bioinformatic ….

Challenge

No Filter-and-Verification Framework Dataset R Threshold τ Query string s Results Filter: Signature(s) ∩ Signature(r) = ϕ ? Verify: ED(r,s) ≤ τ? Yes Index

Preliminary: q-gram q-gram of the substring with length q yo ou ut tb be ec co om youtbecom 2-gram

d d d Preliminary: q-gram 1 edit operation destroies at most q grams. τ edit operations destroy at most qτ grams. if r and s have more than qτ mismatch grams, ED(r, s)> τ. yout ecom yo ou ut t e ec co om

Preliminary: Prefix Filter Sort all q-grams by global ordering, such as idf Pre(s) q(r) : The sorted q-gram set of string r Pre(r) q(s): The sorted q-gram set of string s Pre( ) is the prefix of q( ) |Pre( )|= qτ+1 Prefix Filter: If pre(r) ∩ pre(s) = ϕ, ED(r,s) > τ suffix(r)

Preliminary: Prefix Filter Sort all q-grams by global ordering, such as idf Pre(s) g5g5 g6g6 g 11 g 12 g 13 g1g1 g2g2 g7g7 g8g8 g9g9 g 10 g 12 g3g3 g4g4 q(r) : The sorted q-gram set of string r Pre(r) q(s): The sorted q-gram set of string s Pre( ) is the prefix of q( ) |Pre( )|= qτ+1 Prefix Filter: If pre(r) ∩ pre(s) = ϕ, ED(r,s) > τ >g 10 suffix(r)

d d Preliminary: disjoint q-gram One edit operation destroies at most 1 disjoint gram. τ edit operations destroy at most τ disjoint grams. if r and s have more than τ mismatch disjoint grams, ED(r, s)> τ yout ecom e yo ut om

q(s): The sorted q-gram set of string s Pivotal Prefix Filter Sort all q-grams by global ordering, such as idf Pre(s) q(r) : The sorted q-gram set of string r Pre(r) Piv( ) is the pivotal prefix of q( ) |Piv( )|= τ+1 and the q-grams in Piv( ) are disjoint Piv(r) Piv(s) suffix(r) If piv(s) ∩ pre(r) = ϕ and piv(r) ∩ pre(s) = ϕ, ED(r,s) > τ

q(s): The sorted q-gram set of string s Pivotal Prefix Filter Sort all q-grams by global ordering, such as idf Pre(s) g8g8 g 10 g5g5 g6g6 g9g9 g 11 g 13 g1g1 g3g3 q(r) : The sorted q-gram set of string r Pivotal Prefix Filter: If last(s)> last(r) and piv(r) ∩ pre(s) = ϕ, ED(r,s) > τ Pre(r) Piv( ) is the pivotal prefix of q( ) |Piv( )|= τ+1 and the q-grams in Piv( ) are disjoint Piv(r) Piv(s) >g 10 last(r) last(s) suffix(r)

q(s): The sorted q-gram set of string s Pivotal Prefix Filter Sort all q-grams by global ordering, such as idf Pre(s) g6g6 g9g9 g 12 g 13 g1g1 g4g4 g7g7 g 10 g 11 g3g3 q(r) : The sorted q-gram set of string r Pivotal Prefix Filter: If last(r)> last(s) and piv(s) ∩ pre(r) = ϕ, ED(r,s) > τ Pre(r) Piv( ) is the pivotal prefix of q( ) |Piv( )|= τ+1 and the q-grams in Piv( ) are disjoint Piv(r) Piv(s) >g 10 last(r) last(s) suffix(r)

Pivotal Prefix Filter If last(r)> last(s) and piv(s) ∩ pre(r) = ϕ, ED(r,s) > τ If last(s)> last(r) and piv(r) ∩ pre(s) = ϕ, ED(r,s) > τ Existence: There must exist τ+1 disjoint grams in the prefix The Pivotal Prefix is a subset of the Prefix – The pivotal prefix filter dominates the prefix filter – Signature size are O(τ) and O(qτ) respectively

Related Work Method|Sig(r)||Sig(s)| Prefix FilterO(qτ) Mismatch FilterO(qτ) Qchunk FilterO(τ)O( l ) Pivotal Prefix FilterO(τ)O(qτ) Mismatch Filter [Xiao VLDB08] : Shorten prefix length, but still O(qτ) Qchunk Filter[Qin SIGMOD11] : Shorten one to O(τ) but increased the other one to O( l ) Adaptive Prefix[Wang SIGMOD12] – Increase prefix length to reduce candidate number – Orthogonal and can be integrated into our method Flamingo[Li ICDE08] – Based on count filter. Accelerating counting process. – Orthogonal and can be integrated into our method

Pivotal Search Algorithm Indexing – Build inverted indexes for both the prefix and the pivotal prefix of the data strings Querying – Generate prefix and pivotal prefix for the query string – Probe the prefix index with the pivotal prefix of the query – Probe the pivotal prefix index with the prefix of the query – Verify the candidates and output results

Pivotal Prefix Selection Evaluating Different Pivotal Prefixes: The longer the inverted lists we probe, the more candidates we may have. For query string: For data string:

Optimal Pivotal Prefix Selection Dynamic Programming: Select m-1 optimal pivotal q-grams from the first n-1 q-grams in prefix Select as last pivotal q-gram Object: Select m= τ+1 optimal pivotal q-grams from the first n=qτ+1 grams in the prefix

Optimal Pivotal Prefix Selection Dynamic Programming: Select m-1 optimal pivotal q-grams from the first n-2 q-grams Select as last pivotal q-gram

Optimal Pivotal Prefix Selection Dynamic Programming: Select m-1 optimal pivotal q-grams from the first m-1 q-grams Select as last pivotal q-gram Recursive formula:

No Filter-and-Verification Framework Dataset R Threshold τ Query string s Results Filter: Signature(s) ∩ Signature(r) = ϕ ? Verify: alignment filter? If yes, ED(r,s) ≤ τ? Yes Index

Alignment Filter

Substring edit distance (sed)

Alignment Filter

Experiments Settings: C++, g++ 4.8.2 with -O3 flags 64bit Ubuntu Server 12.04 LTS version Intel Xeon E5-2650 2.00GHz processor and 16GB memory.

Evaluating Pivotal Prefix Filter Average Search Time Mismatch: From EDJoin CrossFiler: Cross Filter PivotalFilter: PivotalFilter CrossSelect: CrossFilter + Pivotal Prefix Selection PivotalSearch: PivotalFilter + Pivotal Prefix Selection

Evaluating Pivotal Prefix Filter Candidate Number Mismatch: From EDJoin CrossFiler: Cross Filter PivotalFilter: PivotalFilter CrossSelect: CrossFilter + Pivotal Prefix Selection PivotalSearch: PivotalFilter + Pivotal Prefix Selection

Evaluating Alignment Filter Average Search Time NoFilter: without any filter ContentFilter: From EDJoin AlignFilter: Alignment Filter

Evaluating Alignment Filter Candidate Number NoFilter: without any filter ContentFilter: From EDJoin AlignFilter: Alignment Filter Real: Number of results

Comparison with State-of-the-arts PivotalSearch: Our method Adaptive: [Wang2012] Flamingo: [Li2008] Qchunk: [Qin 2011]

Scalability

Conclusion Pivotal prefix filter Pivotal search algorithm Optimal pivotal prefix selection Alignment filter

THANK YOU Q & A Project hompage: http://dbgroup.cs.tsinghua.edu.cn/dd/pivotal.html

Outline Problem Definition Pivotal Prefix Filter The Similarity Search Algorithm Alignment Filter Experiment Conclusion

Outline Motivation and Problem Definition Pivotal Prefix Filter The Similarity Search Algorithm Alignment Filter Experiment Conclusion

Outline Problem Definition Pivotal Prefix Filter The Similarity Search Algorithm Alignment Filter Experiment Conclusion

Complexity

Pivotal Prefix Selection Evaluating Different Pivotal Prefixes: T he longer the inverted lists we scan, the larger the filtering cost is and the smaller the pruning power is. For query string: For data string: Existence of Pivotal Prefix: There must exist at least τ+1 disjoint q-grams in the prefix pre(r) for any string r

Complexity

Preliminary: Prefix Filter Sort all q-grams by global ordering, such as idf Pre(s) g5g5 g6g6 g9g9 g 10 g 11 g1g1 g2g2 g7g7 g8g8 g 12 g 13 g3g3 g4g4 q(r) : The sorted q-gram set of string r Pre(r) q(s): The sorted q-gram set of string s Pre( ) is the prefix of q( ) |Pre( )|= qτ+1 Prefix Filter: If pre(r) ∩ pre(s) = ϕ, ED(r,s) > τ >g 10

Alignment Filter non-consecutive errors: youtubecom yoytupecxm q=3, the 3 non-consecutive errors destroy 8 q-grams youtubecom youtzpxcom q=3, the 3 consecutive errors only destroy 5 q-grams consecutive errors:

Indexing Fix a global gram order We use gram frequency ascending order Global gram order immytebuunntucbbtboyytcaomyoouutubcotubeec 111111111112233333334

Indexing Build inverted indexes for prefix and pivotal prefix Global gram order immytebuunntucbbtboyytcaomyoouutubcotubeec 111111111112233333334 Piv(r i )

Indexing Build inverted indexes for prefix and pivotal prefix Pivotal Prefix Index Prefix Index Piv(r i )

Querying Generate prefix and pivotal prefix for the query string Global gram order immytebuunntucbbtboyytcaomyoouutubcotubeec 111111111112233333334

Querying Probe the prefix index with the pivotal prefix of the query Probe the pivotal prefix index with the prefix of the query

Querying Verify the candidates and output results

Related Work EDJoin [Xiao VLDB08] – Shorten prefix length, but still O(qτ) Qchunk[Qin SIGMOD11] – Shorten one to O(τ) but increased the other one to O( l ) Adaptive Prefix[Wang SIGMOD12] – Increase prefix length to reduce candidate number – Orthogonal and can be integrated into our method Flamingo[Li ICDE08] – Based on count filter. Accelerating counting process. – Orthogonal and can be integrated into our method

Optimal Pivotal Prefix Selection Recursive formula: Dynamic Programming: 1. First sort all the q-grams in prefix by their start positions and denote the k-th q-gram as g k 2. Let f(m,n) denote the optimal sum inverted list lengths to select n disjoint grams from the first m grams in the prefix.

Dong Deng, Guoliang Li, Jianhua Feng Database Group, Tsinghua University Present by Dong Deng A Pivotal Prefix Based Filtering Algorithm for String Similarity.

Similar presentations

Presentation on theme: "Dong Deng, Guoliang Li, Jianhua Feng Database Group, Tsinghua University Present by Dong Deng A Pivotal Prefix Based Filtering Algorithm for String Similarity."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Dong Deng, Guoliang Li, Jianhua Feng Database Group, Tsinghua University Present by Dong Deng A Pivotal Prefix Based Filtering Algorithm for String Similarity.

Similar presentations

Presentation on theme: "Dong Deng, Guoliang Li, Jianhua Feng Database Group, Tsinghua University Present by Dong Deng A Pivotal Prefix Based Filtering Algorithm for String Similarity."— Presentation transcript:

Similar presentations

About project

Feedback