VGRAM:Improving Performance of Approximate Queries on String Collections Using Variable- Length Grams VLDB 2007 Chen Li (UC, Irvine) Bin Wang (Northeastern.

Slides:



Advertisements
Similar presentations
1 Efficient Merging and Filtering Algorithms for Approximate String Searches Jiaheng Lu, University of California, Irvine Joint work with Chen Li, Yiming.
Advertisements

Jiaheng Lu, University of California, Irvine
Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai Space-Constrained Gram-Based Indexing for Efficient.
Extending Q-Grams to Estimate Selectivity of String Matching with Low Edit Distance [1] Pirooz Chubak May 22, 2008.
Indexing DNA Sequences Using q-Grams
Spelling Correction for Search Engine Queries Bruno Martins, Mario J. Silva In Proceedings of EsTAL-04, España for Natural Language Processing Presenter:
Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1.
Association Analysis (2). Example TIDList of item ID’s T1I1, I2, I5 T2I2, I4 T3I2, I3 T4I1, I2, I4 T5I1, I3 T6I2, I3 T7I1, I3 T8I1, I2, I3, I5 T9I1, I2,
DECISION TREES. Decision trees  One possible representation for hypotheses.
Spatial Indexing SAMs. Spatial Indexing Point Access Methods can index only points. What about regions? Z-ordering and quadtrees Use the transformation.
Data Compression CS 147 Minh Nguyen.
Introduction to Computer Science 2 Lecture 7: Extended binary trees
Algorithm Design Techniques: Greedy Algorithms. Introduction Algorithm Design Techniques –Design of algorithms –Algorithms commonly used to solve problems.
Reference-based Indexing of Sequence Databases Jayendra Venkateswaran, Deepak Lachwani, Tamer Kahveci, Christopher Jermaine University of Florida-Gainesville.
TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets Chun Chen 1, Feng Li 2, Beng Chin Ooi 2, and Sai Wu 2 1 Zhejiang University, 2 National.
Efficient Top-k Algorithms for Approximate Substring Matching Presented by Jagadeesh Potluri Shiva Krishna Imminni.
Addressing Diverse User Preferences in SQL-Query-Result Navigation SIGMOD ‘07 Zhiyuan Chen Tao Li University of Maryland, Baltimore County Florida International.
Experiments on Query Expansion for Internet Yellow Page Services Using Log Mining Summarized by Dongmin Shin Presented by Dongmin Shin User Log Analysis.
1 NNH: Improving Performance of Nearest- Neighbor Searches Using Histograms Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research) Chen Li (UC Irvine)
The Flamingo Software Package on Approximate String Queries Chen Li UC Irvine and Bimaple
Genome-scale disk-based suffix tree indexing Benjarath Phoophakdee Mohammed J. Zaki Compiled by: Amit Mahajan Chaitra Venus.
Speaker: Alexander Behm Space-Constrained Gram-Based Indexing for Efficient Approximate String Search Alexander Behm 1, Shengyue Ji 1, Chen Li 1, Jiaheng.
Association Analysis (2). Example TIDList of item ID’s T1I1, I2, I5 T2I2, I4 T3I2, I3 T4I1, I2, I4 T5I1, I3 T6I2, I3 T7I1, I3 T8I1, I2, I3, I5 T9I1, I2,
Aki Hecht Seminar in Databases (236826) January 2009
1 String Edit Distance Matching Problem With Moves Graham Cormode S. Muthukrishna November 2001.
Singular Value Decomposition in Text Mining Ram Akella University of California Berkeley Silicon Valley Center/SC Lecture 4b February 9, 2011.
1 Notes 06: Efficient Fuzzy Search Professor Chen Li Department of Computer Science UC Irvine CS122B: Projects in Databases and Web Applications Spring.
Cost-Based Variable-Length-Gram Selection for String Collections to Support Approximate Queries Efficiently Xiaochun Yang, Bin Wang Chen Li Northeastern.
Review of Claremont Report on Database Research Jiaheng Lu Renmin University of China.
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
Hashed Samples Selectivity Estimators for Set Similarity Selection Queries.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department.
Efficient Exact Similarity Searches using Multiple Token Orderings Jongik Kim 1 and Hongrae Lee 2 1 Chonbuk National University, South Korea 2 Google Inc.
VGRAM: Improving Performance of Approximate Queries on String Collections Using Variable-Length Grams Chen Li Bin Wang and Xiaochun Yang Northeastern University,
Unifying Data and Domain Knowledge Using Virtual Views IBM T.J. Watson Research Center Lipyeow Lim, Haixun Wang, Min Wang, VLDB Summarized.
Similarity join problem with Pass- Join-K using Hadoop ---BY Yu Haiyang.
Huffman Encoding Veronica Morales.
Data Structures Week 6: Assignment #2 Problem
Record Linkage: A 10-Year Retrospective Chen Li and Sharad Mehrotra UC Irvine 1.
An Effective Approach for Searching Closest Sentence Translations from The Web Ju Fan, Guoliang Li, and Lizhu Zhou Database Research Group, Tsinghua University.
Bayesian Extension to the Language Model for Ad Hoc Information Retrieval Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping Presented by Chen Yi-Ting.
Experiments An Efficient Trie-based Method for Approximate Entity Extraction with Edit-Distance Constraints Entity Extraction A Document An Efficient Filter.
Approximate XML Joins Huang-Chun Yu Li Xu. Introduction XML is widely used to integrate data from different sources. Perform join operation for XML documents:
Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1.
A Query Adaptive Data Structure for Efficient Indexing of Time Series Databases Presented by Stavros Papadopoulos.
Lossless Compression CIS 465 Multimedia. Compression Compression: the process of coding that will effectively reduce the total number of bits needed to.
Efficient Approximate Search on String Collections Marios Hadjieleftheriou Chen Li 1.
Early Profile Pruning on XML-aware Publish- Subscribe Systems Mirella M. Moro, Petko Bakalov, Vassilis J. Tsotras University of California VLDB 2007 Presented.
Presented by: Aneeta Kolhe. Named Entity Recognition finds approximate matches in text. Important task for information extraction and integration, text.
Space-Efficient Online Computation of Quantile Summaries SIGMOD 01 Michael Greenwald & Sanjeev Khanna Presented by ellery.
Large-scale Similarity Join with Edit-distance Constraints ---BY Yu Haiyang 1/30.
Learning in a Pairwise Term-Term Proximity Framework for Information Retrieval Ronan Cummins, Colm O’Riordan Digital Enterprise Research Institute SIGIR.
Measuring the Structural Similarity of Semistructured Documents Using Entropy Sven Helmer University of London, Birkbeck VLDB’07, September 23-28, 2007,
Seeking Stable Clusters in the Blogosphere Nilesh Bansal, Fei Chiang, Nick Koudas (univ. of Toronto) Frank Wm. Tompa (univ. of Waterloo) Presented by Jung-yeon,
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Predicting User Interests from Contextual Information R. W. White, P. Bailey, L. Chen Microsoft (SIGIR 2009) Presenter : Jae-won Lee.
Efficient Merging and Filtering Algorithms for Approximate String Searches Chen Li, Jiaheng Lu and Yiming Lu Univ. of California, Irvine, USA ICDE ’08.
Submitted To-: Submitted By-: Mrs.Sushma Rani (HOD) Aashish Kr. Goyal (IT-7th) Deepak Soni (IT-8 th )
EFFICIENT ALGORITHMS FOR APPROXIMATE MEMBER EXTRACTION By Swapnil Kharche and Pavan Basheerabad.
Federated text retrieval from uncooperative overlapped collections Milad Shokouhi, RMIT University, Melbourne, Australia Justin Zobel, RMIT University,
Efficient Approximate Search on String Collections Part I
RE-Tree: An Efficient Index Structure for Regular Expressions
Jiannan Wang (Tsinghua, China) Guoliang Li (Tsinghua, China)
CS222P: Principles of Data Management Notes #6 Index Overview and ISAM Tree Index Instructor: Chen Li.
Top-k String Similarity Search with Edit-Distance Constraints
Guoliang Li (Tsinghua, China) Dong Deng (Tsinghua, China)
CS222/CS122C: Principles of Data Management Notes #6 Index Overview and ISAM Tree Index Instructor: Chen Li.
Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research)
CS222/CS122C: Principles of Data Management UCI, Fall 2018 Notes #05 Index Overview and ISAM Tree Index Instructor: Chen Li.
Presentation transcript:

VGRAM:Improving Performance of Approximate Queries on String Collections Using Variable- Length Grams VLDB 2007 Chen Li (UC, Irvine) Bin Wang (Northeastern University) Xiaochun Yang (Northeastern University) Presented by Jae-won Lee

Copyright  2006 by CEBT Introduction  Many applications have an increasing need to support approximate string queries on data collections  Examples of approximate string queries Data Cleaning – the same entity can be represented in slightly different forms – “PO BOX 23” and “P.O. Box 23” Query Relaxation – errors in the query, inconsistencies in the data, limited knowledge about the data – “Steven Spielburg” and “Steve Spielberg” Spellchecking – find potential candidates for a possibly mistyped word IDS Lab. Seminar - 2Center for E-Business Technology

Copyright  2006 by CEBT Introduction  Dilemma of Choosing Gram Length The gram length can greatly affect the performance of string matches Increasing gram length – Causes the inverted list to be shorter This may decrease the time to merge the inverted lists – Cases the lower threshold on the number of common grams This causes a less selectiveness IDS Lab. Seminar - 3Center for E-Business Technology id strings rich stick stich stuck static 2-grams at ch ck ic ri st ta ti tu uc # of common grams >= 3 id strings rich stick stich stuck static 3-grams ati ich ick ric sta sti stu tat tic tuc uck id strings rich stick stich stuck static id strings rich stick stich stuck static # of common grams >= 1

Copyright  2006 by CEBT VGRAM : Main Idea  We analyze the frequencies of variable-length grams in the strings, and select a set of grams, called gram dictionary  For a string, we generate a set of grams of variable lengths using the gram dictionary  Challenges How to generate variable-length grams ? How to construct a high-quality gram dictionary ? What is the relationship between string similarity and their gram-set similarity? How to adopt VGRAM in existing algorithms ? IDS Lab. Seminar - 4Center for E-Business Technology

Copyright  2006 by CEBT Challenge 1 : Generating Variable-Length Grams  Example String s = universal D = {ni, ivr, sal, uni, vers} q min = 2, q max = 4 By setting position p = 1, VG = {} The longest substring starting at u that appears in D is uni  (1, uni) Move to the next character n, the longest substring is ni – However, this candidate (2, ni) is subsumed by the previous one, the algorithm does not insert it into VG Move to the next character i, there is no substring starting at this character that matches a gram in D, so the algorithm produces (3, iv) of length q min = 2 Final set VG(s) = {(1, uni), (3, iv), (4, vers), (7, sal)} IDS Lab. Seminar - 5Center for E-Business Technology

Copyright  2006 by CEBT Challenge 2:Constructing Gram Dictionary  Step 1 : Collecting gram frequencies with length in [q min =2, q max =4] IDS Lab. Seminar - 6Center for E-Business Technology st  0, 1, 3 sti  0, 1 stu  3 stic  0, 1 stuc  3 Leaf node

Copyright  2006 by CEBT Challenge 2:Constructing Gram Dictionary  Step 2: Selecting High-Quality Grams If a gram has a low frequency, we eliminate from the tree all the extended grams of g If a gram is very frequent, keep some of its extended grams IDS Lab. Seminar - 7Center for E-Business Technology

Copyright  2006 by CEBT Challenge 2:Constructing Gram Dictionary  Pruning tree using a frequency threshold T = 2 Frequency of node (which has leaf node) ≤ T IDS Lab. Seminar - 8Center for E-Business Technology 8 removed

Copyright  2006 by CEBT Challenge 2:Constructing Gram Dictionary  Pruning tree using a frequency threshold T = 2 Frequency of node (which has leaf node) ≥ T Pruning policies to be used to select a maximal subset of children to remove – SmallFirst : choose children with the smallest frequencies – LargeFirst : choose children with the largest frequencies – Random : Randomly choose children so that L.freq is not greater than T IDS Lab. Seminar - 9Center for E-Business Technology

Copyright  2006 by CEBT Challenge 3:Similarity of Gram Sets  Analyzing the effect of an edit operation on the positional grams These effects are stored NAG Vector (the vector of number of affected grams) Category 1 : for positional gram (p, g) – p i+q max -1 Category 2 : p ≤ i ≤ p+|g| -1 Category 3 : positional gram (p, g) on the left of the i-th character Category 4 : positional gram (p, g) on the right of the i-th character IDS Lab. Seminar - 10Center for E-Business Technology i-q max +1i+q max - 1 Deletion i String s Category 1 Category 3 Category 2 Category 4 Category 1

Copyright  2006 by CEBT Challenge 3:Similarity of Gram Sets  Example S = universal, D= {ni, ivr, sal, uni, vers}, q min = 2, q max = 4 VG(s) = {(1, uni), (3, iv), (4,vers), (7,sal)} A deletion on the 5-th character e in the string s i-q max +1 =2, i+q max -1 = 8 Positional gram (1, uni) and (7, sal) is category 1 – Starting position is before 2 / ending position is after 8 These gram are not affected by deletion operation (4, vers) is category 2 (3, iv) is category 3 – Since there is an extension of iv in D (ivr), (3, iv) could be affected by the deletion (potentially affected) IDS Lab. Seminar - 11Center for E-Business Technology

Copyright  2006 by CEBT Challenge 3:Similarity of Gram Sets  # of grams affected by each operation We want to transform string s to string s’ with 2 edit operations – At most 4 grams can be affected IDS Lab. Seminar - 12Center for E-Business Technology _ u _ n _ i _v _ e _ r _s _ a _ l _ Deletion/substitutionInsertion GAP ; insertion ? String S’ # of edit operation# of grams

Copyright  2006 by CEBT Challenge 4: Adopting VGRAM Technique  Example of Algorithm based on Inverted Lists Query : Edit Distance (shtick, ?) ≤ 1 VG(q) = { (1, sh), (2, ht), (3, tick) } ; which are extracted using gram dictionary IDS Lab. Seminar - 13Center for E-Business Technology … ck ic … ti … # of common grams = (|s 1 |- q + 1) – k * q = (6-2+1) – 1 * 2 = 3 2 grams … ck ic ich … tic tick … 2-4 grams id strings rich stick stich stuck static # of common grams = |VG(q)| - NAG(q, k) = 3 – 2 = 1

Copyright  2006 by CEBT Experiments  Data Sets Data set 1: Texas Real Estate Commission. – 151K person names, average length = 33. Data set 2: English dictionary from the Aspell spellchecker for Cygwin. – 149,165 words, average length = 8. Data set 3: DBLP Bibliography. – 277K titles, average length = 62. IDS Lab. Seminar - 14Center for E-Business Technology

Copyright  2006 by CEBT VGRAM Overhead  Data set 3 IDS Lab. Seminar - 15Center for E-Business Technology Index SizeConstruction Time

Copyright  2006 by CEBT Benefits of Using Variable-Length Grams  Data set 1 IDS Lab. Seminar - 16Center for E-Business Technology Construction Time/SizeQuery Time

Copyright  2006 by CEBT Effect of q max  Data Set 1 IDS Lab. Seminar - 17Center for E-Business Technology Construction Time / Query TimeQuery Performance

Copyright  2006 by CEBT Effect of Frequency Threshold  Data Set 1 IDS Lab. Seminar - 18Center for E-Business Technology Construction Time Index SizeQuery Time

Copyright  2006 by CEBT Conclusion  We developed VGRAM to improve performance of approximate string queries Variable-length grams, High Quality grams  We gave a full specification of the technique Index structure How to generate grams for a string using index structure Relationship btw the similarity of two strings and the similarity of their grams  We show how to adopt this technique in a variety of existing algorithms IDS Lab. Seminar - 19Center for E-Business Technology