Efficient Approximate Search on String Collections Part I

Slides:

Advertisements

Similar presentations

1 Efficient Merging and Filtering Algorithms for Approximate String Searches Jiaheng Lu, University of California, Irvine Joint work with Chen Li, Yiming.

Advertisements

Jiaheng Lu, University of California, Irvine

Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai Space-Constrained Gram-Based Indexing for Efficient.

Extending Q-Grams to Estimate Selectivity of String Matching with Low Edit Distance [1] Pirooz Chubak May 22, 2008.

Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1.

Chen Li ( 李晨 ) Chen Li Scalable Interactive Search NFIC August 14, 2010, San Jose, CA Joint work with colleagues at UC Irvine and Tsinghua University.

Overcoming the L 1 Non- Embeddability Barrier Robert Krauthgamer (Weizmann Institute) Joint work with Alexandr Andoni and Piotr Indyk (MIT)

The Inside Story Christine Reilly CSCI 6175 September 27, 2011.

Greedy Algorithms Amihood Amir Bar-Ilan University.

Efficient Top-k Algorithms for Approximate Substring Matching Presented by Jagadeesh Potluri Shiva Krishna Imminni.

1 NNH: Improving Performance of Nearest- Neighbor Searches Using Histograms Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research) Chen Li (UC Irvine)

The Flamingo Software Package on Approximate String Queries Chen Li UC Irvine and Bimaple

Dong Deng, Guoliang Li, Jianhua Feng Database Group, Tsinghua University Present by Dong Deng A Pivotal Prefix Based Filtering Algorithm for String Similarity.

An Overview of Similarity Query Processing 김종익 전북대학교 컴퓨터공학부.

Using Fingerprints in n-Gram Indices Digital Libraries: Advanced Methods and Technologies, Digital Collections Stefan Selbach

Speaker: Alexander Behm Space-Constrained Gram-Based Indexing for Efficient Approximate String Search Alexander Behm 1, Shengyue Ji 1, Chen Li 1, Jiaheng.

IP Address Lookup for Internet Routers Using Balanced Binary Search with Prefix Vector Author: Hyesook Lim, Hyeong-gee Kim, Changhoon Publisher: IEEE TRANSACTIONS.

Optimal Merging Of Runs

Approximate XML Query Answers Alkis Polyzotis (UC Santa Cruz) Minos Garofalakis (Bell Labs) Yannis Ioannidis (U. of Athens, Hellas)

Liang Jin * UC Irvine Nick Koudas University of Toronto Chen Li * UC Irvine Anthony K.H. Tung National University of Singapore VLDB’2005 * Liang Jin and.

Liang Jin and Chen Li VLDB’2005 Supported by NSF CAREER Award IIS Selectivity Estimation for Fuzzy String Predicates in Large Data Sets.

1 Notes 06: Efficient Fuzzy Search Professor Chen Li Department of Computer Science UC Irvine CS122B: Projects in Databases and Web Applications Spring.

Cost-Based Variable-Length-Gram Selection for String Collections to Support Approximate Queries Efficiently Xiaochun Yang, Bin Wang Chen Li Northeastern.

1 Chapter 8 Priority Queues. 2 Implementations Heaps Priority queues and heaps Vector based implementation of heaps Skew heaps Outline.

Review of Claremont Report on Database Research Jiaheng Lu Renmin University of China.

Efficient Parallel Set-Similarity Joins Using Hadoop Chen Li Joint work with Michael Carey and Rares Vernica.

L. Padmasree Vamshi Ambati J. Anand Chandulal J. Anand Chandulal M. Sreenivasa Rao M. Sreenivasa Rao Signature Based Duplicate Detection in Digital Libraries.

Hashed Samples Selectivity Estimators for Set Similarity Selection Queries.

Efficient Exact Similarity Searches using Multiple Token Orderings Jongik Kim 1 and Hongrae Lee 2 1 Chonbuk National University, South Korea 2 Google Inc.

VGRAM: Improving Performance of Approximate Queries on String Collections Using Variable-Length Grams Chen Li Bin Wang and Xiaochun Yang Northeastern University,

Click to edit Present’s Name Xiaoyang Zhang 1, Jianbin Qin 1, Wei Wang 1, Yifang Sun 1, Jiaheng Lu 2 HmSearch: An Efficient Hamming Distance Query Processing.

Record Linkage: A 10-Year Retrospective Chen Li and Sharad Mehrotra UC Irvine 1.

Approximate XML Joins Huang-Chun Yu Li Xu. Introduction XML is widely used to integrate data from different sources. Perform join operation for XML documents:

Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1.

Introduction to Algorithms Chapter 16: Greedy Algorithms.

Efficient Metric Index For Similarity Search Lu Chen, Yunjun Gao, Xinhan Li, Christian S. Jensen, Gang Chen.

Efficient Approximate Search on String Collections Marios Hadjieleftheriou Chen Li 1.

VGRAM:Improving Performance of Approximate Queries on String Collections Using Variable- Length Grams VLDB 2007 Chen Li (UC, Irvine) Bin Wang (Northeastern.

Space-Efficient Online Computation of Quantile Summaries SIGMOD 01 Michael Greenwald & Sanjeev Khanna Presented by ellery.

Liang Jin * UC Irvine Nick Koudas University of Toronto Chen Li * UC Irvine Anthony K.H. Tung National University of Singapore * Liang Jin and Chen Li:

Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.

Chen Li Department of Computer Science Joint work with Liang Jin, Nick Koudas, Anthony Tung, and Rares Vernica Answering Approximate Queries Efficiently.

Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture.

Fast Indexes and Algorithms For Set Similarity Selection Queries M. Hadjieleftheriou A.Chandel N. Koudas D. Srivastava.

Improving Search for Emerging Applications * Some techniques current being licensed to Bimaple Chen Li UC Irvine.

ETRI Linear-Time Search in Suffix Arrays July 14, 2003 Jeong Seop Sim, Dong Kyue Kim Heejin Park, Kunsoo Park.

Efficient Merging and Filtering Algorithms for Approximate String Searches Chen Li, Jiaheng Lu and Yiming Lu Univ. of California, Irvine, USA ICDE ’08.

1 Spatial Query Processing using the R-tree Donghui Zhang CCIS, Northeastern University Feb 8, 2005.

Intelligent Information Retrieval

Outline Introduction State-of-the-art solutions

Data Structures: Disjoint Sets, Segment Trees, Fenwick Trees

Text Indexing and Search

Clustering of Web pages

Fast Approximate Query Answering over Sensor Data with Deterministic Error Guarantees Chunbin Lin Joint with Etienne Boursier, Jacque Brito, Yannis Katsis,

Optimal Merging Of Runs

Jiannan Wang (Tsinghua, China) Guoliang Li (Tsinghua, China)

Query Languages.

Top-k String Similarity Search with Edit-Distance Constraints

Heapsort Heap & Priority Queue.

Guoliang Li (Tsinghua, China) Dong Deng (Tsinghua, China)

Information Organization: Clustering

Locality Sensitive Hashing

3.4 Push-Relabel(Preflow-Push) Maximum Flow Alg.

Efficient Record Linkage in Large Data Sets

Jongik Kim1, Dong-Hoon Choi2, and Chen Li3

Minwise Hashing and Efficient Search

On the resemblance and containment of documents (MinHash)

Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research)

Presentation transcript:

Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li

DBLP Author Search http://www.informatik.uni-trier.de/~ley/db/indices/a-tree/index.html

Try their names (good luck!) Case Western AT&T--Research UCSD Yannis Papakonstantinou Meral Ozsoyoglu Marios Hadjieleftheriou http://www.informatik.uni-trier.de/~ley/db/indices/a-tree/index.html



Better system? http://dblp.ics.uci.edu/authors/

People Search at UC Irvine http://psearch.ics.uci.edu/

Web Search Errors in queries Errors in data Bring query and meaningful results closer together Actual queries gathered by Google http://www.google.com/jobs/britney.html 7

Data Cleaning R S informix microsoft … infromix … mcrosoft

Problem Formulation Find strings similar to a given string: dist(Q,D) <= δ Example: find strings similar to “hadjeleftheriou” Performance is important! 10 ms: 100 queries per second (QPS) 5 ms: 200 QPS

Outline Motivation Preliminaries Trie-based approach Gram-based algorithms Sketch-based algorithms Compression Selectivity estimation Transformations/Synonyms Conclusion Part I Part II

Next… Preliminaries

Similarity Functions Similar to: Examples: a domain-specific function returns a similarity value between two strings Examples: Edit distance Hamming distance Jaccard similarity Soundex TF/IDF, BM25, DICE See [KSS06] for an excellent survey

Edit Distance A widely used metric to define string similarity Ed(s1,s2) = minimum # of operations (insertion, deletion, substitution) to change s1 to s2 Example: s1: Tom Hanks s2: Ton Hank ed(s1,s2) = 2 13 13

Gram-based algorithms Next… Gram-based algorithms List-merging algorithms [LLL08] Variable-length grams (VGRAM) [LWY07,YWL08]

“q-grams” of strings u n i v e r s a l 2-grams

Edit operation’s effect on grams Fixed length: q u n i v e r s a l k operations could affect k * q grams If ed(s1,s2) <= k, then their # of common grams >= (|s1|- q + 1) – k * q 16

q-gram inverted lists at ch ck ic ri st ta ti tu uc id strings 1 2 3 4 1 2-grams at ch ck ic ri st ta ti tu uc id strings 1 2 3 4 rich stick stich stuck static

Searching using inverted lists Query: “shtick”, ED(shtick, ?)≤1 sh ht ti ic ck ti ic ck # of common grams >= 3 at ch ck ic ri st ta ti tu uc 4 2 3 1 id strings 1 2 3 4 rich stick stich stuck static 2-grams

Find elements whose occurrences ≥ T T-occurrence Problem Merge Ascending order Find elements whose occurrences ≥ T

Example T = 4 1 3 5 10 13 10 13 15 5 7 13 13 15 Result: 13

List-Merging Algorithms HeapMerger MergeOpt [SK04] [LLL08, BK02] ScanCount MergeSkip DivideSkip

Count # of occurrences of each element using a heap Heap-based Algorithm Push to heap …… Min-heap Count # of occurrences of each element using a heap

MergeOpt Algorithm [SK04] Binary search Long Lists: T-1 Short Lists

Example of MergeOpt Count threshold T≥ 4 Long Lists: 3 Short Lists: 2 1 3 5 10 13 10 13 15 5 7 13 13 15 Long Lists: 3 Short Lists: 2 Count threshold T≥ 4

ScanCount Count threshold T≥ 4 1 2 3 … 1 1 3 5 10 13 10 13 15 5 7 13 String ids # of occurrences Increment by 1 1 2 3 … 1 1 3 5 10 13 10 13 15 5 7 13 13 15 1 13 4 Result! 14 15 2 Count threshold T≥ 4 25

List-Merging Algorithms HeapMerger MergeOpt [SK04] [LLL08, BK02] ScanCount MergeSkip DivideSkip

MergeSkip algorithm [BK02, LLL08] Pop T-1 …… Min-heap Jump Greater or equals T-1

Example of MergeSkip Count threshold T≥ 4 minHeap Jump 1 5 10 13 15 1 7 13 15 13 13 Jump 17 17 15 15 Count threshold T≥ 4

DivideSkip Algorithm [LLL08] Binary search MergeSkip Long Lists Short Lists

How many lists are treated as long lists?

Length Filtering s: t: Length: 10 By length only! Ed(s,t) ≤ 2

Positional Filtering Ed(s,t) ≤ 2 s a b (ab,1) t a b (ab,12)

A filter tree Combine filters with list-merging algorithms [LLL08]

Variable-length grams (VGRAM) [LWY07,YWL08] Next… Variable-length grams (VGRAM) [LWY07,YWL08]

2-grams -> 3-grams? sht hti tic ick tic ick Query: “shtick”, ED(shtick, ?)≤1 sht hti tic ick tic ick # of common grams >= 1 ati ich ick ric sta sti stu tat tic tuc uck 4 2 1 3 id strings 1 2 3 4 rich stick stich stuck static id strings 1 2 3 4 rich stick stich stuck static id strings 1 2 3 4 rich stick stich stuck static 3-grams

Observation 1: dilemma of choosing “q” Increasing “q” causing: Longer grams  Shorter lists Smaller # of common grams of similar strings 4 2 3 1 2-grams at ch ck ic ri st ta ti tu uc id strings 1 2 3 4 rich stick stich stuck static

Observation 2: skew distributions of gram frequencies DBLP: 276,699 article titles Popular 5-grams: ation (>114K times), tions, ystem, catio

VGRAM: Main idea Grams with variable lengths (between qmin and qmax) zebra ze(123) corrasion co(5213), cor(859), corr(171) Advantages Reduce index size  Reducing running time  Adoptable by many algorithms 

Challenges Generating variable-length grams? Constructing a high-quality gram dictionary? Relationship between string similarity and their gram-set similarity? Adopting VGRAM in existing algorithms?

Challenge 1: String  Variable-length grams? Fixed-length 2-grams u n i v e r s a l Variable-length grams ni ivr sal uni vers [2,4]-gram dictionary u n i v e r s a l

Representing gram dictionary as a trie ni ivr sal uni vers

Step 2: Constructing a gram dictionary qmin=2 qmax=4 Frequency-based [LYW07] Cost-based [YLW08]

Challenge 3: Edit operation’s effect on grams Fixed length: q u n i v e r s a l k operations could affect k * q grams

Deletion affects variable-length grams Not affected Not affected Affected i-qmax+1 i i+qmax- 1 Deletion

With 2 edit operations, at most 4 grams can be affected Main idea For a string, for each position, compute the number of grams that could be destroyed by an operation at this position Compute number of grams possibly destroyed by k operations Store these numbers (for all data strings) as part of the index Vector of s = <2,4,6,8,9> With 2 edit operations, at most 4 grams can be affected Use this number to do count filtering

Summary of VGRAM index

Challenge 4: adopting VGRAM Easily adoptable by many algorithms Basic interfaces: String s  grams String s1, s2 such that ed(s1,s2) <= k  min # of their common grams

Lower bound on # of common grams Fixed length (q) u n i v e r s a l If ed(s1,s2) <= k, then their # of common grams >=: (|s1|- q + 1) – k * q Variable lengths: # of grams of s1 – NAG(s1,k)

Example: algorithm using inverted lists Query: “shtick”, ED(shtick, ?)≤1 sh ht tick tick 2-grams 2-4 grams 2 4 1 3 … ck ic ich tic tick 1 2 4 3 … ck ic ti Lower bound = 3 id strings 1 2 3 4 rich stick stich stuck static id strings 1 2 3 4 rich stick stich stuck static id strings 1 2 3 4 rich stick stich stuck static Lower bound = 1

End of part I Motivation Preliminaries Trie-based approach Gram-based algorithms Sketch-based algorithms Compression Selectivity estimation Transformations/Synonyms Conclusion Part I Part II