Dong Deng, Guoliang Li, Jianhua Feng Database Group, Tsinghua University Present by Dong Deng A Pivotal Prefix Based Filtering Algorithm for String Similarity.

Slides:

Advertisements

Similar presentations

Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai Space-Constrained Gram-Based Indexing for Efficient.

Advertisements

String Similarity Measures and Joins with Synonyms

Extending Q-Grams to Estimate Selectivity of String Matching with Low Edit Distance [1] Pirooz Chubak May 22, 2008.

Efficient Top-k Search across Heterogeneous XML Data Sources Jianxin Li 1 Chengfei Liu 1 Jeffrey Xu Yu 2 Rui Zhou 1 1 Swinburne University of Technology.

Indexing DNA Sequences Using q-Grams

Computer Science and Engineering Inverted Linear Quadtree: Efﬁcient Top K Spatial Keyword Search Chengyuan Zhang 1,Ying Zhang 1,Wenjie Zhang 1, Xuemin.

Reference-based Indexing of Sequence Databases Jayendra Venkateswaran, Deepak Lachwani, Tamer Kahveci, Christopher Jermaine University of Florida-Gainesville.

Jiannan Wang (Tsinghua, China) Guoliang Li (Tsinghua, China) Jianhua Feng (Tsinghua, China)

TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets Chun Chen 1, Feng Li 2, Beng Chin Ooi 2, and Sai Wu 2 1 Zhejiang University, 2 National.

Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.

School of Computer Science and Engineering Finding Top k Most Influential Spatial Facilities over Uncertain Objects Liming Zhan Ying Zhang Wenjie Zhang.

Top-k Set Similarity Joins Chuan Xiao, Wei Wang, Xuemin Lin and Haichuan Shang University of New South Wales and NICTA.

The Flamingo Software Package on Approximate String Queries Chen Li UC Irvine and Bimaple

1 Efficient Subgraph Search over Large Uncertain Graphs Ye Yuan 1, Guoren Wang 1, Haixun Wang 2, Lei Chen 3 1. Northeastern University, China 2. Microsoft.

1 ALAE: Accelerating Local Alignment with Affine Gap Exactly in Biosequence Databases Xiaochun Yang, Honglei Liu, Bin Wang Northeastern University, China.

Speaker: Alexander Behm Space-Constrained Gram-Based Indexing for Efficient Approximate String Search Alexander Behm 1, Shengyue Ji 1, Chen Li 1, Jiaheng.

A Generic Framework for Handling Uncertain Data with Local Correlations Xiang Lian and Lei Chen Department of Computer Science and Engineering The Hong.

Speaker: Sattam Alsubaiee Supporting Location-Based Approximate-Keyword Queries Sattam Alsubaiee, Alexander Behm, and Chen Li University of California,

Efficient Type-Ahead Search on Relational Data: a TASTIER Approach Guoliang Li 1, Shengyue Ji 2, Chen Li 2, Jianhua Feng 1 1 Tsinghua University, Beijing,

Quantile-Based KNN over Multi- Valued Objects Wenjie Zhang Xuemin Lin, Muhammad Aamir Cheema, Ying Zhang, Wei Wang The University of New South Wales, Australia.

SST:an algorithm for finding near- exact sequence matches in time proportional to the logarithm of the database size Eldar Giladi Eldar Giladi Michael.

Reza Sherkat ICDE061 Reza Sherkat and Davood Rafiei Department of Computing Science University of Alberta Canada Efficiently Evaluating Order Preserving.

Sequence Alignment Variations Computing alignments using only O(m) space rather than O(mn) space. Computing alignments with bounded difference Exclusion.

Liang Jin * UC Irvine Nick Koudas University of Toronto Chen Li * UC Irvine Anthony K.H. Tung National University of Singapore VLDB’2005 * Liang Jin and.

1 Notes 06: Efficient Fuzzy Search Professor Chen Li Department of Computer Science UC Irvine CS122B: Projects in Databases and Web Applications Spring.

Cost-Based Variable-Length-Gram Selection for String Collections to Support Approximate Queries Efficiently Xiaochun Yang, Bin Wang Chen Li Northeastern.

Efficient Parallel Set-Similarity Joins Using Hadoop Chen Li Joint work with Michael Carey and Rares Vernica.

Ontology Alignment/Matching Prafulla Palwe. Agenda ► Introduction  Being serious about the semantic web  Living with heterogeneity  Heterogeneity problem.

Efficient Exact Similarity Searches using Multiple Token Orderings Jongik Kim 1 and Hongrae Lee 2 1 Chonbuk National University, South Korea 2 Google Inc.

VGRAM: Improving Performance of Approximate Queries on String Collections Using Variable-Length Grams Chen Li Bin Wang and Xiaochun Yang Northeastern University,

DBease: Making Databases User-Friendly and Easily Accessible Guoliang Li, Ju Fan, Hao Wu, Jiannan Wang, Jianhua Feng Database Group, Department of Computer.

Similarity join problem with Pass- Join-K using Hadoop ---BY Yu Haiyang.

Click to edit Present’s Name Xiaoyang Zhang 1, Jianbin Qin 1, Wei Wang 1, Yifang Sun 1, Jiaheng Lu 2 HmSearch: An Efficient Hamming Distance Query Processing.

Filter Algorithms for Approximate String Matching Stefan Burkhardt.

Experiments An Efficient Trie-based Method for Approximate Entity Extraction with Edit-Distance Constraints Entity Extraction A Document An Efficient Filter.

Reference-Based Indexing of Sequence Databases (VLDB ’ 06) Jayendra Venkateswaran Deepak Lachwani Tamer Kahveci Christopher Jermaine Presented by Angela.

文本挖掘简介邹权博士，助理教授. Outline  Introduction  TF-IDF  Similarity.

Experiments Faerie: Efficient Filtering Algorithms for Approximate Dictionary-based Entity Extraction Entity Extraction A Document An Efficient Filter.

The Sweet Spot between Inverted Indices and Metric-Space Indexing for Top-K–List Similarity Search Evica Milchevski , Avishek Anand ★ and Sebastian Michel.

VGRAM:Improving Performance of Approximate Queries on String Collections Using Variable- Length Grams VLDB 2007 Chen Li (UC, Irvine) Bin Wang (Northeastern.

Zhuo Peng, Chaokun Wang, Lu Han, Jingchao Hao and Yiyuan Ba Proceedings of the Third International Conference on Emerging Databases, Incheon, Korea (August.

Presented by: Aneeta Kolhe. Named Entity Recognition finds approximate matches in text. Important task for information extraction and integration, text.

Mining Document Collections to Facilitate Accurate Approximate Entity Matching Presented By Harshda Vabale.

University of Macau Discovering Longest-lasting Correlation in Sequence Databases Yuhong Li Department of Computer and Information Science.

The Similarity Graph: Analyzing Database Access Patterns Within A Company PhD Preliminary Examination Robert Searles Committee John Cavazos and Stephan.

Output Sensitive Algorithm for Finding Similar Objects Jul/2/2007 Combinatorial Algorithms Day Takeaki Uno Takeaki Uno National Institute of Informatics,

Some string optimization tips Haiyang Yu /14 Outline  Background  Tips for dealing with strings.

Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture.

Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.

On the Intersection of Inverted Lists Yangjun Chen and Weixin Shen Dept. Applied Computer Science, University of Winnipeg 515 Portage Ave. Winnipeg, Manitoba,

Efficient Merging and Filtering Algorithms for Approximate String Searches Chen Li, Jiaheng Lu and Yiming Lu Univ. of California, Irvine, USA ICDE ’08.

1 Double-Patterning Aware DSA Template Guided Cut Redistribution for Advanced 1-D Gridded Designs Zhi-Wen Lin and Yao-Wen Chang National Taiwan University.

Lecture 4: Data Integration and Cleaning CMPT 733, SPRING 2016 JIANNAN WANG.

Ning Jin, Wei Wang ICDE 2011 LTS: Discriminative Subgraph Mining by Learning from Search History.

EFFICIENT ALGORITHMS FOR APPROXIMATE MEMBER EXTRACTION By Swapnil Kharche and Pavan Basheerabad.

Efficient Approximate Search on String Collections Part I

Jiannan Wang (Tsinghua, China) Guoliang Li (Tsinghua, China)

TT-Join: Efficient Set Containment Join

Pass-Join: A Partition based Method for Similarity Joins

Chuan Xiao, Wei Wang, Xuemin Lin

Top-k String Similarity Search with Edit-Distance Constraints

Guoliang Li (Tsinghua, China) Dong Deng (Tsinghua, China)

Efficient Record Linkage in Large Data Sets

Jongik Kim1, Dong-Hoon Choi2, and Chen Li3

An Efficient Partition Based Method for Exact Set Similarity Joins

Dong Deng+, Yu Jiang+, Guoliang Li+, Jian Li+, Cong Yu^

Dong Deng, Guoliang Li, He Wen, H. V. Jagadish, Jianhua Feng

Presentation transcript:

Dong Deng, Guoliang Li, Jianhua Feng Database Group, Tsinghua University Present by Dong Deng A Pivotal Prefix Based Filtering Algorithm for String Similarity Search

Search is Important Source: Google Searches per Year

Speed Matters Source:

Data is Dirty Typos Typo in “title” relaxed related Argyrios Zymnis Argyris Zymnis DBLP Complete Search

Similarity Search Query String Dataset All the strings similar to the query

ED(r, s): The min number of edit operations (insertion/deletion/substitution) needed to transform r to s. For example: ED(sigcom, sigmod) = 2 Edit Distance sigcom sigmom sigmod substitute c with m substitute m with d

Problem Definition Query string s = “yotubecom” and τ = 2 string dataset R ed(s, r 4 ) <= 2 output r 4 as a result

Application Spell Checking Copy Detection Entity Linking Bioinformatic ….

Challenge

No Filter-and-Verification Framework Dataset R Threshold τ Query string s Results Filter: Signature(s) ∩ Signature(r) = ϕ ? Verify: ED(r,s) ≤ τ? Yes Index

Preliminary: q-gram q-gram of the substring with length q yo ou ut tb be ec co om youtbecom 2-gram

d d d Preliminary: q-gram 1 edit operation destroies at most q grams. τ edit operations destroy at most qτ grams. if r and s have more than qτ mismatch grams, ED(r, s)> τ. yout ecom yo ou ut t e ec co om

Preliminary: Prefix Filter Sort all q-grams by global ordering, such as idf Pre(s) q(r) : The sorted q-gram set of string r Pre(r) q(s): The sorted q-gram set of string s Pre( ) is the prefix of q( ) |Pre( )|= qτ+1 Prefix Filter: If pre(r) ∩ pre(s) = ϕ, ED(r,s) > τ suffix(r)

Preliminary: Prefix Filter Sort all q-grams by global ordering, such as idf Pre(s) g5g5 g6g6 g 11 g 12 g 13 g1g1 g2g2 g7g7 g8g8 g9g9 g 10 g 12 g3g3 g4g4 q(r) : The sorted q-gram set of string r Pre(r) q(s): The sorted q-gram set of string s Pre( ) is the prefix of q( ) |Pre( )|= qτ+1 Prefix Filter: If pre(r) ∩ pre(s) = ϕ, ED(r,s) > τ >g 10 suffix(r)

d d Preliminary: disjoint q-gram One edit operation destroies at most 1 disjoint gram. τ edit operations destroy at most τ disjoint grams. if r and s have more than τ mismatch disjoint grams, ED(r, s)> τ yout ecom e yo ut om

q(s): The sorted q-gram set of string s Pivotal Prefix Filter Sort all q-grams by global ordering, such as idf Pre(s) q(r) : The sorted q-gram set of string r Pre(r) Piv( ) is the pivotal prefix of q( ) |Piv( )|= τ+1 and the q-grams in Piv( ) are disjoint Piv(r) Piv(s) suffix(r) If piv(s) ∩ pre(r) = ϕ and piv(r) ∩ pre(s) = ϕ, ED(r,s) > τ

q(s): The sorted q-gram set of string s Pivotal Prefix Filter Sort all q-grams by global ordering, such as idf Pre(s) g8g8 g 10 g5g5 g6g6 g9g9 g 11 g 13 g1g1 g3g3 q(r) : The sorted q-gram set of string r Pivotal Prefix Filter: If last(s)> last(r) and piv(r) ∩ pre(s) = ϕ, ED(r,s) > τ Pre(r) Piv( ) is the pivotal prefix of q( ) |Piv( )|= τ+1 and the q-grams in Piv( ) are disjoint Piv(r) Piv(s) >g 10 last(r) last(s) suffix(r)

q(s): The sorted q-gram set of string s Pivotal Prefix Filter Sort all q-grams by global ordering, such as idf Pre(s) g6g6 g9g9 g 12 g 13 g1g1 g4g4 g7g7 g 10 g 11 g3g3 q(r) : The sorted q-gram set of string r Pivotal Prefix Filter: If last(r)> last(s) and piv(s) ∩ pre(r) = ϕ, ED(r,s) > τ Pre(r) Piv( ) is the pivotal prefix of q( ) |Piv( )|= τ+1 and the q-grams in Piv( ) are disjoint Piv(r) Piv(s) >g 10 last(r) last(s) suffix(r)

Pivotal Prefix Filter If last(r)> last(s) and piv(s) ∩ pre(r) = ϕ, ED(r,s) > τ If last(s)> last(r) and piv(r) ∩ pre(s) = ϕ, ED(r,s) > τ Existence: There must exist τ+1 disjoint grams in the prefix The Pivotal Prefix is a subset of the Prefix – The pivotal prefix filter dominates the prefix filter – Signature size are O(τ) and O(qτ) respectively

Related Work Method|Sig(r)||Sig(s)| Prefix FilterO(qτ) Mismatch FilterO(qτ) Qchunk FilterO(τ)O( l ) Pivotal Prefix FilterO(τ)O(qτ) Mismatch Filter [Xiao VLDB08] : Shorten prefix length, but still O(qτ) Qchunk Filter[Qin SIGMOD11] : Shorten one to O(τ) but increased the other one to O( l ) Adaptive Prefix[Wang SIGMOD12] – Increase prefix length to reduce candidate number – Orthogonal and can be integrated into our method Flamingo[Li ICDE08] – Based on count filter. Accelerating counting process. – Orthogonal and can be integrated into our method

Pivotal Search Algorithm Indexing – Build inverted indexes for both the prefix and the pivotal prefix of the data strings Querying – Generate prefix and pivotal prefix for the query string – Probe the prefix index with the pivotal prefix of the query – Probe the pivotal prefix index with the prefix of the query – Verify the candidates and output results

Pivotal Prefix Selection Evaluating Different Pivotal Prefixes: The longer the inverted lists we probe, the more candidates we may have. For query string: For data string:

Optimal Pivotal Prefix Selection Dynamic Programming: Select m-1 optimal pivotal q-grams from the first n-1 q-grams in prefix Select as last pivotal q-gram Object: Select m= τ+1 optimal pivotal q-grams from the first n=qτ+1 grams in the prefix

Optimal Pivotal Prefix Selection Dynamic Programming: Select m-1 optimal pivotal q-grams from the first n-2 q-grams Select as last pivotal q-gram

Optimal Pivotal Prefix Selection Dynamic Programming: Select m-1 optimal pivotal q-grams from the first m-1 q-grams Select as last pivotal q-gram Recursive formula:

No Filter-and-Verification Framework Dataset R Threshold τ Query string s Results Filter: Signature(s) ∩ Signature(r) = ϕ ? Verify: alignment filter? If yes, ED(r,s) ≤ τ? Yes Index

Alignment Filter

Substring edit distance (sed)

Alignment Filter

Experiments Settings: C++, g with -O3 flags 64bit Ubuntu Server LTS version Intel Xeon E GHz processor and 16GB memory.

Evaluating Pivotal Prefix Filter Average Search Time Mismatch: From EDJoin CrossFiler: Cross Filter PivotalFilter: PivotalFilter CrossSelect: CrossFilter + Pivotal Prefix Selection PivotalSearch: PivotalFilter + Pivotal Prefix Selection

Evaluating Pivotal Prefix Filter Candidate Number Mismatch: From EDJoin CrossFiler: Cross Filter PivotalFilter: PivotalFilter CrossSelect: CrossFilter + Pivotal Prefix Selection PivotalSearch: PivotalFilter + Pivotal Prefix Selection

Evaluating Alignment Filter Average Search Time NoFilter: without any filter ContentFilter: From EDJoin AlignFilter: Alignment Filter

Evaluating Alignment Filter Candidate Number NoFilter: without any filter ContentFilter: From EDJoin AlignFilter: Alignment Filter Real: Number of results

Comparison with State-of-the-arts PivotalSearch: Our method Adaptive: [Wang2012] Flamingo: [Li2008] Qchunk: [Qin 2011]

Scalability

Conclusion Pivotal prefix filter Pivotal search algorithm Optimal pivotal prefix selection Alignment filter

THANK YOU Q & A Project hompage:

Outline Problem Definition Pivotal Prefix Filter The Similarity Search Algorithm Alignment Filter Experiment Conclusion

Outline Motivation and Problem Definition Pivotal Prefix Filter The Similarity Search Algorithm Alignment Filter Experiment Conclusion

Outline Problem Definition Pivotal Prefix Filter The Similarity Search Algorithm Alignment Filter Experiment Conclusion

Outline Problem Definition Pivotal Prefix Filter The Similarity Search Algorithm Alignment Filter Experiment Conclusion

Outline Problem Definition Pivotal Prefix Filter The Similarity Search Algorithm Alignment Filter Experiment Conclusion

Complexity

Pivotal Prefix Selection Evaluating Different Pivotal Prefixes: T he longer the inverted lists we scan, the larger the filtering cost is and the smaller the pruning power is. For query string: For data string: Existence of Pivotal Prefix: There must exist at least τ+1 disjoint q-grams in the prefix pre(r) for any string r

Complexity

Preliminary: Prefix Filter Sort all q-grams by global ordering, such as idf Pre(s) g5g5 g6g6 g9g9 g 10 g 11 g1g1 g2g2 g7g7 g8g8 g 12 g 13 g3g3 g4g4 q(r) : The sorted q-gram set of string r Pre(r) q(s): The sorted q-gram set of string s Pre( ) is the prefix of q( ) |Pre( )|= qτ+1 Prefix Filter: If pre(r) ∩ pre(s) = ϕ, ED(r,s) > τ >g 10

Alignment Filter non-consecutive errors: youtubecom yoytupecxm q=3, the 3 non-consecutive errors destroy 8 q-grams youtubecom youtzpxcom q=3, the 3 consecutive errors only destroy 5 q-grams consecutive errors:

Indexing Fix a global gram order We use gram frequency ascending order Global gram order immytebuunntucbbtboyytcaomyoouutubcotubeec

Indexing Build inverted indexes for prefix and pivotal prefix Global gram order immytebuunntucbbtboyytcaomyoouutubcotubeec Piv(r i )

Indexing Build inverted indexes for prefix and pivotal prefix Pivotal Prefix Index Prefix Index Piv(r i )

Querying Generate prefix and pivotal prefix for the query string Global gram order immytebuunntucbbtboyytcaomyoouutubcotubeec

Querying Probe the prefix index with the pivotal prefix of the query Probe the pivotal prefix index with the prefix of the query

Querying Verify the candidates and output results

Related Work EDJoin [Xiao VLDB08] – Shorten prefix length, but still O(qτ) Qchunk[Qin SIGMOD11] – Shorten one to O(τ) but increased the other one to O( l ) Adaptive Prefix[Wang SIGMOD12] – Increase prefix length to reduce candidate number – Orthogonal and can be integrated into our method Flamingo[Li ICDE08] – Based on count filter. Accelerating counting process. – Orthogonal and can be integrated into our method

Optimal Pivotal Prefix Selection Recursive formula: Dynamic Programming: 1. First sort all the q-grams in prefix by their start positions and denote the k-th q-gram as g k 2. Let f(m,n) denote the optimal sum inverted list lengths to select n disjoint grams from the first m grams in the prefix.