Jiannan Wang (Tsinghua, China) Guoliang Li (Tsinghua, China)

Slides:



Advertisements
Similar presentations
Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai Space-Constrained Gram-Based Indexing for Efficient.
Advertisements

String Similarity Measures and Joins with Synonyms
Extending Q-Grams to Estimate Selectivity of String Matching with Low Edit Distance [1] Pirooz Chubak May 22, 2008.
Indexing DNA Sequences Using q-Grams
Efficient Interactive Fuzzy Keyword Search Shengyue Ji 1, Guoliang Li 2, Chen Li 1, Jianhua Feng 2 1 University of California, Irvine 2 Tsinghua University.
Database Group – CSE - UNSW 1 Efficient Error-tolerant Query Autocompletion Chuan Xiao 1, Jianbin Qin 2, Wei Wang 2, Yoshiharu Ishikawa 1, Koji Tsuda 3,
Computer Science and Engineering Inverted Linear Quadtree: Efficient Top K Spatial Keyword Search Chengyuan Zhang 1,Ying Zhang 1,Wenjie Zhang 1, Xuemin.
Twig 2 Stack: Bottom-up Processing of Generalized-Tree-Pattern Queries over XML Documents Songting Chen, Hua-Gang Li *, Junichi Tatemura Wang-Pin Hsiung,
Reference-based Indexing of Sequence Databases Jayendra Venkateswaran, Deepak Lachwani, Tamer Kahveci, Christopher Jermaine University of Florida-Gainesville.
Jiannan Wang (Tsinghua, China) Guoliang Li (Tsinghua, China) Jianhua Feng (Tsinghua, China)
Searching on Multi-Dimensional Data
Top-k Set Similarity Joins Chuan Xiao, Wei Wang, Xuemin Lin and Haichuan Shang University of New South Wales and NICTA.
1 NNH: Improving Performance of Nearest- Neighbor Searches Using Histograms Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research) Chen Li (UC Irvine)
Dong Deng, Guoliang Li, Jianhua Feng Database Group, Tsinghua University Present by Dong Deng A Pivotal Prefix Based Filtering Algorithm for String Similarity.
1 Suffix Trees and Suffix Arrays Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto Addison-Wesley, (Chapter 8)
ViST: a dynamic index method for querying XML data by tree structures Authors: Haixun Wang, Sanghyun Park, Wei Fan, Philip Yu Presenter: Elena Zheleva,
Speaker: Alexander Behm Space-Constrained Gram-Based Indexing for Efficient Approximate String Search Alexander Behm 1, Shengyue Ji 1, Chen Li 1, Jiaheng.
Efficient Type-Ahead Search on Relational Data: a TASTIER Approach Guoliang Li 1, Shengyue Ji 2, Chen Li 2, Jianhua Feng 1 1 Tsinghua University, Beijing,
Reza Sherkat ICDE061 Reza Sherkat and Davood Rafiei Department of Computing Science University of Alberta Canada Efficiently Evaluating Order Preserving.
Sequence Alignment Variations Computing alignments using only O(m) space rather than O(mn) space. Computing alignments with bounded difference Exclusion.
Liang Jin * UC Irvine Nick Koudas University of Toronto Chen Li * UC Irvine Anthony K.H. Tung National University of Singapore VLDB’2005 * Liang Jin and.
Liang Jin and Chen Li VLDB’2005 Supported by NSF CAREER Award IIS Selectivity Estimation for Fuzzy String Predicates in Large Data Sets.
1 Notes 06: Efficient Fuzzy Search Professor Chen Li Department of Computer Science UC Irvine CS122B: Projects in Databases and Web Applications Spring.
Cost-Based Variable-Length-Gram Selection for String Collections to Support Approximate Queries Efficiently Xiaochun Yang, Bin Wang Chen Li Northeastern.
Efficient Parallel Set-Similarity Joins Using Hadoop Chen Li Joint work with Michael Carey and Rares Vernica.
Efficient Exact Set-Similarity Joins Arvind Arasu Venkatesh Ganti Raghav Kaushik DMX Group, Microsoft Research.
Subgraph Containment Search Dayu Yuan The Pennsylvania State University 1© Dayu Yuan9/7/2015.
Efficient Exact Similarity Searches using Multiple Token Orderings Jongik Kim 1 and Hongrae Lee 2 1 Chonbuk National University, South Korea 2 Google Inc.
Efficient Keyword Search over Virtual XML Views Feng Shao and Lin Guo and Chavdar Botev and Anand Bhaskar and Muthiah Chettiar and Fan Yang Cornell University.
VGRAM: Improving Performance of Approximate Queries on String Collections Using Variable-Length Grams Chen Li Bin Wang and Xiaochun Yang Northeastern University,
DBease: Making Databases User-Friendly and Easily Accessible Guoliang Li, Ju Fan, Hao Wu, Jiannan Wang, Jianhua Feng Database Group, Department of Computer.
Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002.
1 CPS216: Advanced Database Systems Notes 04: Operators for Data Access Shivnath Babu.
Experiments An Efficient Trie-based Method for Approximate Entity Extraction with Edit-Distance Constraints Entity Extraction A Document An Efficient Filter.
Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.
Top-k Set Similarity Joins Chuan Xiao, Wei Wang, Xuemin Lin and Haichuan Shang Univ. of New South Wales, Austrailia ICDE ’09 9 Feb 2011 Taewhi Lee Based.
Approximate XML Joins Huang-Chun Yu Li Xu. Introduction XML is widely used to integrate data from different sources. Perform join operation for XML documents:
文本挖掘简介 邹权 博士,助理教授. Outline  Introduction  TF-IDF  Similarity.
Experiments Faerie: Efficient Filtering Algorithms for Approximate Dictionary-based Entity Extraction Entity Extraction A Document An Efficient Filter.
Efficient Metric Index For Similarity Search Lu Chen, Yunjun Gao, Xinhan Li, Christian S. Jensen, Gang Chen.
QED: A Novel Quaternary Encoding to Completely Avoid Re-labeling in XML Updates Changqing Li,Tok Wang Ling.
Marwan Al-Namari Hassan Al-Mathami. Indexing What is Indexing? Indexing is a mechanisms. Why we need to use Indexing? We used indexing to speed up access.
Liang Jin * UC Irvine Nick Koudas University of Toronto Chen Li * UC Irvine Anthony K.H. Tung National University of Singapore * Liang Jin and Chen Li:
Database Management Systems, R. Ramakrishnan 1 Algorithms for clustering large datasets in arbitrary metric spaces.
Efficient Processing of Updates in Dynamic XML Data Changqing Li, Tok Wang Ling, Min Hu.
Improving Search for Emerging Applications * Some techniques current being licensed to Bimaple Chen Li UC Irvine.
Efficient Merging and Filtering Algorithms for Approximate String Searches Chen Li, Jiaheng Lu and Yiming Lu Univ. of California, Irvine, USA ICDE ’08.
Spatial Approximate String Search. Abstract This work deals with the approximate string search in large spatial databases. Specifically, we investigate.
CPS216: Data-intensive Computing Systems
RE-Tree: An Efficient Index Structure for Regular Expressions
Probabilistic Data Management
TT-Join: Efficient Set Containment Join
Pass-Join: A Partition based Method for Similarity Joins
Entity Matching : How Similar Is Similar?
Chuan Xiao, Wei Wang, Xuemin Lin
Top-k String Similarity Search with Edit-Distance Constraints
Guoliang Li (Tsinghua, China) Dong Deng (Tsinghua, China)
Weighted Exact Set Similarity Join
Structure and Content Scoring for XML
Efficient Record Linkage in Large Data Sets
Jongik Kim1, Dong-Hoon Choi2, and Chen Li3
Structure and Content Scoring for XML
Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research)
Relax and Adapt: Computing Top-k Matches to XPath Queries
An Efficient Partition Based Method for Exact Set Similarity Joins
Dong Deng+, Yu Jiang+, Guoliang Li+, Jian Li+, Cong Yu^
Dong Deng, Guoliang Li, He Wen, H. V. Jagadish, Jianhua Feng
Presentation transcript:

Trie-Join : Efficient Trie-based String Similarity Joins with Edit Distance Constraints Jiannan Wang (Tsinghua, China) Guoliang Li (Tsinghua, China) Jianhua Feng (Tsinghua, China)

Outline Motivation Preliminaries Trie-based Framework Trie-based Algorithms Pruning Techniques Support Data Update Experiment Conclusion 2018/9/17 Trie-Join @ VLDB2010

Real-world Data is Rather Dirty! Microsoft Academic Search Kenneth De Jong Kenneth Dejong PK http://academic.research.microsoft.com/Author/2037349.aspx http://academic.research.microsoft.com/Author/3054641.aspx 2018/9/17 Trie-Join @ VLDB2010

Real-world Data is Rather Dirty! DBLP Complete Search Typo in “author” Typo in “title” Argyrios Zymnis Argyris Zymnis relaxed 2018/9/17 Trie-Join @ VLDB2010 related

Similarity Joins The similarity join is an essential operation for data integration and cleaning Perform a similarity join on Name attribute (find all record pairs whose Name attributes are similar) Output: (2037349, 3054641), … Id Name Univ. 2037349 Kenneth De Jong George … … 3054641 Kenneth Dejong R 2018/9/17 Trie-Join @ VLDB2010

Outline Motivation Preliminaries Trie-based Framework Trie-based Algorithms Pruning Techniques Support Data Update Experiment Conclusion 2018/9/17 Trie-Join @ VLDB2010

Similarity Evaluation We use edit distance to quantify the string similarity ED(r,s) is the minimum number of single-character edit operations (i.e., insertion, deletion, and substitution) needed to transform r to s Dynamic Programming For example: ED(“kobe”, “ebey”)=3 1. kobe → obe (delete ‘k’ at the beginning) 2. obe → ebe (substitute ‘o' with ‘e') 3. ebe → ebey (insert ‘y' at the end) Verify ED(r, s) ≤ 1 Compute ED(r, s) Given an edit-distance threshold τ , we say r and s are similar if ED(r, s) ≤ τ 2018/9/17 Trie-Join @ VLDB2010

Problem Formulation For example: Suppose R=S. ------------------------------------------------------------ Input: S = {“bag”, “ebay”, “bay”, “kobe”, “koby”, “beagy”} τ = 1 Output: {<“kobe”, “koby”>, <“ebay”, “bay”>, <“bag”, “bay”>} String Similarity Join ----------------------------------------------- Input: two sets of strings R and S an edit-distance threshold τ Output: <r, s> ∈ R×S s.t. ED(r, s) ≤ τ Naïve Solution Enumerate all pairs <r, s> ∈ R×S and verify ED(r, s) ≤ τ --------------------------------------------------------------------------- |R|×|S| verifications are rather expensive!! For |R|=|S|=1m, it needs 1012 verifications. 2018/9/17 Trie-Join @ VLDB2010

Prior Work Signature-based methods Disadvantages Basic idea Framework ED(r, s) ≤ τ Sig(r) ∩Sig(s) ≠ φ Framework Filter pairs <r, s> s.t. Sig(r) ∩Sig(s) = φ Verify the survived pairs Disadvantages Large index Inverted index for signatures is large Low efficiency for short strings Can not select high-quality signatures for short strings As least one parameter need to be tuned E.g. tune the parameter q for q-gram signatures 2018/9/17 Trie-Join @ VLDB2010

Outline Motivation Preliminaries Trie-based Framework Trie-based Algorithms Pruning Techniques Support Data Update Experiment Conclusion 2018/9/17 Trie-Join @ VLDB2010

Trie Index Trie is a tree structure, in which Each path from the root to a leaf represents a string in the data set Every node on the path has a label of a character in the string Node 2 / Node “ba” / String “ba” 2018/9/17 Trie-Join @ VLDB2010

Subtrie Pruning Observation I Subtrie Pruning × Verify ED(r, s) ≤ 1 Whatever * is, Δ (≥2) is larger than τ =1. Given τ =1 , all strings starting with “ko” can not be similar with “ebay”. Given τ =1 , for the string “ebay”, we can prune the subtrie rooted at “ko”. 2018/9/17 Trie-Join @ VLDB2010

Computing Active-Node Set Node u is an active node of string s if ED(u, s) ≤ τ. E.g. Node “bay” is an active node of string “ebay”, but node “ko” is not. Active-Node Set For a string s, As is a set consisting of all the active nodes in the trie. 12 8 7 5 b a g y e 1 2 4 6 9 10 11 Incremental algorithm (www’09) 13 k A“ko”={13,14,15} 14 o 3 15 b A“kob”={14,15,16,17} 16 17 e y 2018/9/17 Trie-Join @ VLDB2010

Trie-Search Algorithm Observation II Share prefix “eb”.  Should we do subtrie pruning on S! R Trie T S bag bay beagy ebay kobe koby … ebags ebay ebey ebook … , A“ebags”={} , A“ebay”={4,11,12} , A“ebey”={12} , A“ebooks”={} , … <“ebay”, 4>, <“ebay”, 12> <“ebey”, 12> 2018/9/17 Trie-Join @ VLDB2010

Dual Subtrie Pruning Construct a trie for stings in both R and S Do subtrie pruning for strings in both R and S For example: Given τ =1 , all strings starting with “ko” can not be similar with the strings starting with “eb” 2018/9/17 Trie-Join @ VLDB2010

Outline Motivation Preliminaries Trie-based Framework Trie-based Algorithms Pruning Techniques Support Data Update Experiment Conclusion 2018/9/17 Trie-Join @ VLDB2010

Trie-Traverse We focus on self-join, that is R = S Algorithm Description Construct a trie index T for all strings in S (R) Traverse the trie T in pre-order Compute the active-node set of each visited node incrementally When reaching a leaf node, find leaf nodes in its active-node set and output the similar string pairs Benefits of pre-order trie traversal Compute active-node sets incrementally Discard active-node sets of the nodes whose descendants have been visited 2018/9/17 Trie-Join @ VLDB2010

Illustration of Trie-Traverse Consider τ = 1 and S = {“bag”, “ebay”, “bay”, “kobe”, “koby”, “beagy”}. {0,1,9,13} Depth ≤ τ 1 {0,1,2,5,9,10,13} 9 13 b e k 2 5 {1,2,3,4,5,6,11} 10 14 a e b o 3 4 6 11 15 {2,3,4,7} g y a a b {2,3,4,12} 7 Output: <3,4> 12 16 17 g Output: <4,12> y e y 8 y 2018/9/17 Trie-Join @ VLDB2010

Trie-Dynamic Basic idea: utilize active-node symmetry property Illustration of Trie-Dynamic Trie-Dynamic maintains the active-node sets of all trie nodes that involve large space!! Consider τ = 1 and S = {“bag”, “ebay”, “bay”, “kobe”, “koby”, “beagy”}. {0,1} b {1,2,3,6,8} {1,2,3,6} a 8 {2,3,8} {2,3} y {2,3,7,8} {6,7,8} {6,7} 2018/9/17 Trie-Join @ VLDB2010

Trie-PathStack b a g y e k o Motivation Basic Idea Trie-Traverse uses little memory space but involves unnecessary active-node computation Trie-Dynamic avoids repeated active-node computation but involves large memory space Basic Idea Virtual partial subtrie Runtime stack from current node to root node (τ = 1) 12 8 7 15 5 b a g y e k o 1 2 3 4 6 9 10 11 13 14 16 17 Virtual Partial Subtrie 2018/9/17 Trie-Join @ VLDB2010

Bi-Trie-PathStack Motivation Basic Idea Algorithm It is expensive to compute active-node sets for large edit-distance thresholds Basic Idea Consider a string r = “arnold schwarzeneger” and τ = 5. If a string s is similar to r within τ=5 (i.e. ed(r, s) ≤5), then either r’s first part “arnold sch” is similar to a prefix of s within 5/2=2 or r’s second part “warzeneger” is similar to a suffix of s within 5/2=2. Algorithm Perform Trie-PathStack twice within the half threshold Verify the survived pairs “arnold sch” “warzeneger” 2018/9/17 Trie-Join @ VLDB2010

Outline Motivation Preliminaries Trie-based Framework Trie-based algorithms Pruning Techniques Support Data Update Experiment Conclusion 2018/9/17 Trie-Join @ VLDB2010

Pruning Techniques Length pruning Single-branch pruning Count pruning τ=1 Length pruning Prune v from Au if the difference of v’s range and u’s range is larger than τ Single-branch pruning Prune v from Au if u is the only child node of v Count pruning Prune v from Au if there’s only one string have both u and v as prefixes 2018/9/17 Trie-Join @ VLDB2010

Outline Motivation Preliminaries Trie-based Framework Trie-based algorithms Pruning Techniques Support Data Update Experiment Conclusion 2018/9/17 Trie-Join @ VLDB2010

Support Data Update Incremental Similarity Join Input: a set of strings S, a set of strings ΔS, an edit-distance threshold τ Output: <r ∈ ΔS, s ∈ S∪ΔS > s.t ED(r, s) ≤ τ Illustration of incremental similarity join Consider τ = 1 and S = {“bag”, “ebay”, “bay”, “kobe”, “koby”, “beagy”} and ΔS = {“eby”}. 2018/9/17 Trie-Join @ VLDB2010

Outline Motivation Preliminaries Trie-based Framework Trie-based algorithms Pruning Techniques Support Data Update Experiment Conclusion 2018/9/17 Trie-Join @ VLDB2010

Experiment Setup Data sets Existing algorithms Environment English Dict: English words from the Aspell spellchecker for Cygwin DBLP Author: Author names from DBLP dataset AOL Query Log: Queries from AOL dataset Existing algorithms All-Pairs-Ed[www’07] Ed-Join [vldb’08] Part-Enum [vldb’06] Environment C++ , GCC 4.2.3, Ubuntu Intel Core 2 Quad X5450 3.00GHz processor and 4 GB memory 2018/9/17 Trie-Join @ VLDB2010

Comparison of Four Trie-Based Algorithms Our trie-join algorithms outperform Trie-Search by 1~2 orders of magnitude Trie-PathStack performs the best 2018/9/17 Trie-Join @ VLDB2010

Trie-PathStack VS Ed-Join, All-Pairs-Ed, Part-Enum Index Size (MB) 3~5 times smaller 2018/9/17 Trie-Join @ VLDB2010

Trie-PathStack VS Ed-Join, All-Pairs-Ed, Part-Enum Efficiency (Log-Scale) Trie-PathStack performs the best Existing methods need to tune parameters 2018/9/17 Trie-Join @ VLDB2010

Algorithm Selection Trie-based algorithms VS Ed-Join 2018/9/17 Trie-Join @ VLDB2010

Trie-based algorithms outperform Ed-Join for all string lengths (τ ≤ 3) 2018/9/17 Trie-Join @ VLDB2010

Trie-based algorithms outperform Ed-Join for short strings (avg Trie-based algorithms outperform Ed-Join for short strings (avg. len ≤ 30 and τ >3) 2018/9/17 Trie-Join @ VLDB2010

Ed-Join outperforms Trie-based algorithms for long strings (avg Ed-Join outperforms Trie-based algorithms for long strings (avg. len >30 and τ >3) 2018/9/17 Trie-Join @ VLDB2010

Outline Motivation Preliminaries Trie-based Framework Trie-based algorithms Pruning Techniques Support Data Update Experiment Conclusion 2018/9/17 Trie-Join @ VLDB2010

Conclusion Trie-based similarity-join framework Trie-based algorithms Pruning techniques Trie-based algorithms have many advantages small index no need to tune parameters efficient for short strings support dynamic data update Trie-based algorithms significantly outperform state-of-the-art methods on data sets with short strings (Avg. Length ≤ 30) 2018/9/17 Trie-Join @ VLDB2010

Reference [1] Arasu et al. Efficient exact set-similarity joins. VLDB 2006. [2] Bayardo et al. Scaling up all pairs similarity search. WWW 2007. [3] Chaudhuri et al. A primitive operator for similarity joins in data cleaning. ICDE 2006. [4] Gravano et al. Approximate string joins in a database (almost) for free. VLDB 2001. [5] Sarawagi et al. Efficient set joins on similarity predicates. SIGMOD 2004. [6] Xiao et al. Ed-join: an efficient algorithm for similarity joins with edit distance constraints. PVLDB 2008. [7] Ji et al. Efficient interactive fuzzy keyword search. WWW 2009. [8] Chaudhuri et al. Extending autocompletion to tolerate errors. SIGMOD 2009. 2018/9/17 Trie-Join @ VLDB2010

Thanks! Q&A http://dbgroup.cs.tsinghua.edu.cn/wangjn/projects/triejoin/ 2018/9/17 Trie-Join @ VLDB2010