Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1.

Slides:



Advertisements
Similar presentations
Efficient Algorithms for Approximate Member Extraction Using Signature- based Inverted Lists Jialong Han Co-authored with Jiaheng Lu, Xiaofeng Meng Renmin.
Advertisements

1 Efficient Merging and Filtering Algorithms for Approximate String Searches Jiaheng Lu, University of California, Irvine Joint work with Chen Li, Yiming.
Jiaheng Lu, University of California, Irvine
Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai Space-Constrained Gram-Based Indexing for Efficient.
String Similarity Measures and Joins with Synonyms
Extending Q-Grams to Estimate Selectivity of String Matching with Low Edit Distance [1] Pirooz Chubak May 22, 2008.
Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1.
Efficient Interactive Fuzzy Keyword Search Shengyue Ji 1, Guoliang Li 2, Chen Li 1, Jianhua Feng 2 1 University of California, Irvine 2 Tsinghua University.
Chen Li ( 李晨 ) Chen Li Scalable Interactive Search NFIC August 14, 2010, San Jose, CA Joint work with colleagues at UC Irvine and Tsinghua University.
Power-Law Based Estimation of Set Similarity Join Size Hongrae Lee, University of British Columbia Raymond T. Ng, University of British Columbia Kyuseok.
Probabilistic Skyline Operator over Sliding Windows Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Xuemin Lin, Ying Zhang, Wei.
Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.
Top-k Set Similarity Joins Chuan Xiao, Wei Wang, Xuemin Lin and Haichuan Shang University of New South Wales and NICTA.
1 NNH: Improving Performance of Nearest- Neighbor Searches Using Histograms Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research) Chen Li (UC Irvine)
The Flamingo Software Package on Approximate String Queries Chen Li UC Irvine and Bimaple
Dong Deng, Guoliang Li, Jianhua Feng Database Group, Tsinghua University Present by Dong Deng A Pivotal Prefix Based Filtering Algorithm for String Similarity.
An Overview of Similarity Query Processing 김종익 전북대학교 컴퓨터공학부.
Outline SQL Server Optimizer  Enumeration architecture  Search space: flexibility/extensibility  Cost and statistics Automatic Physical Tuning  Database.
Speaker: Alexander Behm Space-Constrained Gram-Based Indexing for Efficient Approximate String Search Alexander Behm 1, Shengyue Ji 1, Chen Li 1, Jiaheng.
Efficient Type-Ahead Search on Relational Data: a TASTIER Approach Guoliang Li 1, Shengyue Ji 2, Chen Li 2, Jianhua Feng 1 1 Tsinghua University, Beijing,
Liang Jin * UC Irvine Nick Koudas University of Toronto Chen Li * UC Irvine Anthony K.H. Tung National University of Singapore VLDB’2005 * Liang Jin and.
Liang Jin and Chen Li VLDB’2005 Supported by NSF CAREER Award IIS Selectivity Estimation for Fuzzy String Predicates in Large Data Sets.
1 Notes 06: Efficient Fuzzy Search Professor Chen Li Department of Computer Science UC Irvine CS122B: Projects in Databases and Web Applications Spring.
Cost-Based Variable-Length-Gram Selection for String Collections to Support Approximate Queries Efficiently Xiaochun Yang, Bin Wang Chen Li Northeastern.
Review of Claremont Report on Database Research Jiaheng Lu Renmin University of China.
Efficient Parallel Set-Similarity Joins Using Hadoop Chen Li Joint work with Michael Carey and Rares Vernica.
L. Padmasree Vamshi Ambati J. Anand Chandulal J. Anand Chandulal M. Sreenivasa Rao M. Sreenivasa Rao Signature Based Duplicate Detection in Digital Libraries.
Hashed Samples Selectivity Estimators for Set Similarity Selection Queries.
Efficient Exact Set-Similarity Joins Arvind Arasu Venkatesh Ganti Raghav Kaushik DMX Group, Microsoft Research.
Efficient Exact Similarity Searches using Multiple Token Orderings Jongik Kim 1 and Hongrae Lee 2 1 Chonbuk National University, South Korea 2 Google Inc.
VGRAM: Improving Performance of Approximate Queries on String Collections Using Variable-Length Grams Chen Li Bin Wang and Xiaochun Yang Northeastern University,
DBease: Making Databases User-Friendly and Easily Accessible Guoliang Li, Ju Fan, Hao Wu, Jiannan Wang, Jianhua Feng Database Group, Department of Computer.
Similarity Join Wu Yang Main work MS--A Primitive Operator for Similarity Joins in Data Cleaning ICDE 2006 Google--Scaling Up All Pairs Similarity.
Click to edit Present’s Name Xiaoyang Zhang 1, Jianbin Qin 1, Wei Wang 1, Yifang Sun 1, Jiaheng Lu 2 HmSearch: An Efficient Hamming Distance Query Processing.
Record Linkage: A 10-Year Retrospective Chen Li and Sharad Mehrotra UC Irvine 1.
Experiments An Efficient Trie-based Method for Approximate Entity Extraction with Edit-Distance Constraints Entity Extraction A Document An Efficient Filter.
Top-k Set Similarity Joins Chuan Xiao, Wei Wang, Xuemin Lin and Haichuan Shang Univ. of New South Wales, Austrailia ICDE ’09 9 Feb 2011 Taewhi Lee Based.
Approximate XML Joins Huang-Chun Yu Li Xu. Introduction XML is widely used to integrate data from different sources. Perform join operation for XML documents:
文本挖掘简介 邹权 博士,助理教授. Outline  Introduction  TF-IDF  Similarity.
Experiments Faerie: Efficient Filtering Algorithms for Approximate Dictionary-based Entity Extraction Entity Extraction A Document An Efficient Filter.
Efficient Approximate Search on String Collections Marios Hadjieleftheriou Chen Li 1.
VGRAM:Improving Performance of Approximate Queries on String Collections Using Variable- Length Grams VLDB 2007 Chen Li (UC, Irvine) Bin Wang (Northeastern.
1 Extending Q-Grams to Estimate Selectivity of String Matching with Low Edit Distance Hongrae Lee, Raymond Ng and Kyuseok Shim.
Byung-Won On (Penn State Univ.) Nick Koudas (Univ. of Toronto) Dongwon Lee (Penn State Univ.) Divesh Srivastava (AT&T Labs – Research) Group Linkage ICDE.
Liang Jin * UC Irvine Nick Koudas University of Toronto Chen Li * UC Irvine Anthony K.H. Tung National University of Singapore * Liang Jin and Chen Li:
Efficient Approximate Search on String Collections Part II Marios HadjieleftheriouChen Li.
Chen Li Department of Computer Science Joint work with Liang Jin, Nick Koudas, Anthony Tung, and Rares Vernica Answering Approximate Queries Efficiently.
Fast Indexes and Algorithms For Set Similarity Selection Queries M. Hadjieleftheriou A.Chandel N. Koudas D. Srivastava.
Improving Search for Emerging Applications * Some techniques current being licensed to Bimaple Chen Li UC Irvine.
Efficient Merging and Filtering Algorithms for Approximate String Searches Chen Li, Jiaheng Lu and Yiming Lu Univ. of California, Irvine, USA ICDE ’08.
Efficient Approximate Search on String Collections Part I
Bolin Ding Silu Huang* Surajit Chaudhuri Kaushik Chakrabarti Chi Wang
Jiannan Wang (Tsinghua, China) Guoliang Li (Tsinghua, China)
Efficient Similarity Joins for Near Duplicate Detection
Integrating XML Data Sources Using Approximate Joins
TT-Join: Efficient Set Containment Join
Chuan Xiao, Wei Wang, Xuemin Lin
Top-k String Similarity Search with Edit-Distance Constraints
Guoliang Li (Tsinghua, China) Dong Deng (Tsinghua, China)
Weighted Exact Set Similarity Join
Structure and Content Scoring for XML
Supporting of search-as-you-type using sql in databases
Efficient Record Linkage in Large Data Sets
Jongik Kim1, Dong-Hoon Choi2, and Chen Li3
Structure and Content Scoring for XML
Minwise Hashing and Efficient Search
Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research)
Presentation transcript:

Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1

DBLP Author Search 2

Try their names (good luck!) 3 Yannis PapakonstantinouMeral OzsoyogluMarios Hadjieleftheriou UCSD Case Western AT&T--Research

 4

Better system? 5

People Search at UC Irvine 6

Web Search  Errors in queries  Errors in data  Bring query and meaningful results closer together Actual queries gathered by Google 7

Data Cleaning R informix microsoft … … S infromix … mcrosoft … 8

Problem Formulation Find strings similar to a given string: dist(Q,D) <= δ Example: find strings similar to “hadjeleftheriou” Performance is important! -10 ms: 100 queries per second (QPS) - 5 ms: 200 QPS 9

Outline Motivation Preliminaries Trie-based approach Gram-based algorithms Sketch-based algorithms Compression Selectivity estimation Transformations/Synonyms Conclusion 10 Part I Part II

Preliminaries 11 Next…

Similarity Functions Similar to: a domain-specific function returns a similarity value between two strings Examples: Edit distance Hamming distance Jaccard similarity Soundex TF/IDF, BM25, DICE See [KSS06] for an excellent survey 12

13 A widely used metric to define string similarity Ed(s1,s2) = minimum # of operations (insertion, deletion, substitution) to change s1 to s2 Example: s1: Tom Hanks s2: Ton Hank ed(s1,s2) = 2 Edit Distance 13

Gram-based algorithms List-merging algorithms [LLL08] Variable-length grams (VGRAM) [LWY07,YWL08] 14 Next…

“ q-grams ” of strings u n i v e r s a l 2-grams 15

Edit operation’s effect on grams k operations could affect k * q grams u n i v e r s a l Fixed length: q 16 If ed(s1,s2) = (|s 1 |- q + 1) – k * q

q-gram inverted lists id strings rich stick stich stuck static grams at ch ck ic ri st ta ti tu uc

# of common grams >= 3 Searching using inverted lists Query: “ shtick ”, ED(shtick, ?)≤1 id strings rich stick stich stuck static 2-grams at ch ck ic ri st ta ti tu uc ti iccksh ht ti ic ck 18

T-occurrence Problem Find elements whose occurrences ≥ T Ascending order Merge 19

Example T = 4 Result:

List-Merging Algorithms HeapMergerMergeOpt [SK04] [LLL08, BK02] ScanCount MergeSkipDivideSkip 21

Heap-based Algorithm Min-heap Count # of occurrences of each element using a heap Push to heap …… 22

MergeOpt Algorithm [SK04] Long Lists: T-1Short Lists Binary search 23

Example of MergeOpt Count threshold T≥ 4 Long Lists: 3 Short Lists: 2 24

25 ScanCount 123…123… Count threshold T≥ 4 # of occurrences Increment by 1 1 String ids Result!

List-Merging Algorithms HeapMergerMergeOpt ScanCount MergeSkipDivideSkip [SK04] [LLL08, BK02] 26

MergeSkip algorithm [BK02, LLL08] Min-heap …… Pop T-1 T-1 Jump Greater or equals 27

Example of MergeSkip Count threshold T≥ 4 minHeap Jump

DivideSkip Algorithm [LLL08] Long ListsShort Lists Binary search MergeSkip 29

How many lists are treated as long lists? 30

Length Filtering Ed(s,t) ≤ 2 s: t: Length: 19 Length: 10 By length only! 31

Positional Filtering ab ab Ed(s,t) ≤ 2 s t (ab,1) (ab,12) 32

Combine filters with list-merging algorithms [LLL08] 33 A filter tree

Variable-length grams (VGRAM) [LWY07,YWL08] 34 Next…

# of common grams >= 1 2-grams -> 3-grams? Query: “ shtick ”, ED(shtick, ?)≤1 id strings rich stick stich stuck static 3-grams ati ich ick ric sta sti stu tat tic tuc uck tic icksht hti tic ick id strings rich stick stich stuck static id strings rich stick stich stuck static 35

Observation 1: dilemma of choosing “q” Increasing “q” causing: Longer grams  Shorter lists Smaller # of common grams of similar strings id strings rich stick stich stuck static grams at ch ck ic ri st ta ti tu uc

Observation 2: skew distributions of gram frequencies DBLP: 276,699 article titles Popular 5-grams: ation (>114K times), tions, ystem, catio 37

VGRAM: Main idea Grams with variable lengths (between q min and q max ) zebra - ze(123) corrasion - co(5213), cor(859), corr(171) Advantages Reduce index size Reducing running time Adoptable by many algorithms 38

Challenges Generating variable-length grams? Constructing a high-quality gram dictionary? Relationship between string similarity and their gram-set similarity? Adopting VGRAM in existing algorithms? 39

Challenge 1: String  Variable-length grams? Fixed-length 2-grams Variable-length grams u n i v e r s a l ni ivr sal uni vers [2,4]-gram dictionary u n i v e r s a l 40

Representing gram dictionary as a trie ni ivr sal uni vers 41

42 Step 2: Constructing a gram dictionary q min =2 q max =4 Frequency-based [LYW07] Cost-based [YLW08]

Challenge 3: Edit operation’s effect on grams k operations could affect k * q grams u n i v e r s a l Fixed length: q 43

Deletion affects variable-length grams i-q max +1i+q max - 1 Deletion Not affected Affected i 44

Main idea For a string, for each position, compute the number of grams that could be destroyed by an operation at this position Compute number of grams possibly destroyed by k operations Store these numbers (for all data strings) as part of the index 45 Vector of s = With 2 edit operations, at most 4 grams can be affected Use this number to do count filtering

Summary of VGRAM index 46

Challenge 4: adopting VGRAM Easily adoptable by many algorithms Basic interfaces: String s  grams String s1, s2 such that ed(s1,s2) <= k  min # of their common grams 47

Lower bound on # of common grams If ed(s1,s2) =: (|s 1 |- q + 1) – k * q u n i v e r s a l Fixed length (q) Variable lengths: # of grams of s1 – NAG(s1,k) 48

Example: algorithm using inverted lists Query: “shtick”, ED(shtick, ?)≤ … ck ic … ti … Lower bound = 3 Lower bound = 1 sh ht tick … ck ic ich … tic tick … 2-4 grams2-grams tick id strings rich stick stich stuck static id strings rich stick stich stuck static id strings rich stick stich stuck static 49

End of part I Motivation Preliminaries Trie-based approach Gram-based algorithms Sketch-based algorithms Compression Selectivity estimation Transformations/Synonyms Conclusion 50 Part I Part II

References [AGK06] Efficient Exact Set-Similarity Joins. Arvind Arasu, Venkatesh Ganti, Raghav Kaushik.VLDB 2006 [ACGK08] Incorporating string transformations in record matching. Arvind Arasu, Surajit Chaudhuri, Kris Ganjam, Raghav Kaushik. SIGMOD 2008 [BK02] Adaptive intersection and t-threshold problems. Jérémy Barbay, Claire Kenyon. SODA 2002 [BJL+09] Space-Constrained Gram-Based Indexing for Efficient Approximate String Search. Alexander Behm, Shengyue Ji, Chen Li, and Jiaheng Lu. ICDE 2009 [BCFM98] Min-Wise Independent Permutations. Andrei Z. Broder, Moses Charikar, Alan M. Frieze, Michael Mitzenmacher. STOC 1998 [CGG+05]Data cleaning in microsoft SQL server Surajit Chaudhuri, Kris Ganjam, Venkatesh Ganti, Rahul Kapoor, Vivek R. Narasayya, Theo Vassilakis. SIGMOD 2005 [CGK06] A Primitive Operator for Similarity Joins in Data Cleaning. Surajit Chaudhuri, Venkatesh Ganti, Raghav Kaushik. ICDE06 [CCGX08] An Efficient Filter for Approximate Membership Checking. Kaushik Chakrabarti, Surajit Chaudhuri, Venkatesh Ganti, Dong Xin. SIGMOD08 [HCK+08] Fast Indexes and Algorithms for Set Similarity Selection Queries. Marios Hadjieleftheriou, Amit Chandel, Nick Koudas, Divesh Srivastava. ICDE

References [HYK+08] Hashed samples: selectivity estimators for set similarity selection queries. Marios Hadjieleftheriou, Xiaohui Yu, Nick Koudas, Divesh Srivastava. PVLDB [JL05] Selectivity Estimation for Fuzzy String Predicates in Large Data Sets. Liang Jin, Chen Li. VLDB [JLL+09] Efficient Interactive Fuzzy Keyword Search. Shengyue Ji, Guoliang Li, Chen Li, and Jianhua Feng. WWW 2009 [JLV08] SEPIA: Estimating Selectivities of Approximate String Predicates in Large Databases. Liang Jin, Chen Li, Rares Vernica. VLDBJ08 [KSS06] Record linkage: Similarity measures and algorithms. Nick Koudas, Sunita Sarawagi, Divesh Srivastava. SIGMOD [LLL08] Efficient Merging and Filtering Algorithms for Approximate String Searches. Chen Li, Jiaheng Lu, and Yiming Lu. ICDE [LNS07] Extending Q-Grams to Estimate Selectivity of String Matching with Low Edit Distance. Hongrae Lee, Raymond T. Ng, Kyuseok Shim. VLDB 2007 [LWY07] VGRAM: Improving Performance of Approximate Queries on String Collections Using Variable-Length Grams, Chen Li, Bin Wang, and Xiaochun Yang. VLDB 2007 [MBK+07] Estimating the selectivity of approximate string queries. Arturas Mazeika, Michael H. Böhlen, Nick Koudas, Divesh Srivastava. ACM TODS

References [SK04] Efficient set joins on similarity predicates. Sunita Sarawagi, Alok Kirpal. SIGMOD 2004 [XWL08] Ed-Join: an efficient algorithm for similarity joins with edit distance constraints. Chuan Xiao, Wei Wang, Xuemin Lin. PVLDB 2008 [XWL+08] Efficient similarity joins for near duplicate detection. Chuan Xiao, Wei Wang, Xuemin Lin, Jeffrey Xu Yu. WWW 2008 [YWL08] Cost-Based Variable-Length-Gram Selection for String Collections to Support Approximate Queries Efficiently. Xiaochun Yang, Bin Wang, and Chen Li. SIGMOD