Dong Deng, Guoliang Li, He Wen, H. V. Jagadish, Jianhua Feng

Slides:



Advertisements
Similar presentations
6/10/20141 Top-Down Clustering Method Based On TV-Tree Zbigniew W. Ras.
Advertisements

Extending Q-Grams to Estimate Selectivity of String Matching with Low Edit Distance [1] Pirooz Chubak May 22, 2008.
Efficient Interactive Fuzzy Keyword Search Shengyue Ji 1, Guoliang Li 2, Chen Li 1, Jianhua Feng 2 1 University of California, Irvine 2 Tsinghua University.
A General Algorithm for Subtree Similarity-Search The Hebrew University of Jerusalem ICDE 2014, Chicago, USA Sara Cohen, Nerya Or 1.
Database Group – CSE - UNSW 1 Efficient Error-tolerant Query Autocompletion Chuan Xiao 1, Jianbin Qin 2, Wei Wang 2, Yoshiharu Ishikawa 1, Koji Tsuda 3,
Introduction to Computer Science 2 Lecture 7: Extended binary trees
Twig 2 Stack: Bottom-up Processing of Generalized-Tree-Pattern Queries over XML Documents Songting Chen, Hua-Gang Li *, Junichi Tatemura Wang-Pin Hsiung,
Reference-based Indexing of Sequence Databases Jayendra Venkateswaran, Deepak Lachwani, Tamer Kahveci, Christopher Jermaine University of Florida-Gainesville.
Jiannan Wang (Tsinghua, China) Guoliang Li (Tsinghua, China) Jianhua Feng (Tsinghua, China)
File Processing : Hash 2015, Spring Pusan National University Ki-Joune Li.
Suffix Tree. Suffix Tree Representation S=xabxac Represent every edge using its start and end text location.
M. Waldvogel, G. Varghese, J. Turner, B. Plattner Presenter: Shulin You UNIVERSITY OF MASSACHUSETTS, AMHERST – Department of Electrical and Computer Engineering.
Dong Deng, Guoliang Li, Jianhua Feng Database Group, Tsinghua University Present by Dong Deng A Pivotal Prefix Based Filtering Algorithm for String Similarity.
Genome-scale disk-based suffix tree indexing Benjarath Phoophakdee Mohammed J. Zaki Compiled by: Amit Mahajan Chaitra Venus.
SASH Spatial Approximation Sample Hierarchy
Efficient Type-Ahead Search on Relational Data: a TASTIER Approach Guoliang Li 1, Shengyue Ji 2, Chen Li 2, Jianhua Feng 1 1 Tsinghua University, Beijing,
CSC 213 Lecture 18: Tries. Announcements Quiz results are getting better Still not very good, however Average score on last quiz was 5.5 Every student.
Liang Jin * UC Irvine Nick Koudas University of Toronto Chen Li * UC Irvine Anthony K.H. Tung National University of Singapore VLDB’2005 * Liang Jin and.
Lars Arge1, Mark de Berg2, Herman Haverkort3 and Ke Yi1
Quick Review of material covered Apr 8 B+-Tree Overview and some definitions –balanced tree –multi-level –reorganizes itself on insertion and deletion.
DBease: Making Databases User-Friendly and Easily Accessible Guoliang Li, Ju Fan, Hao Wu, Jiannan Wang, Jianhua Feng Database Group, Department of Computer.
Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and Gagan Agrawal
Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002.
Efficient Minimal Perfect Hash Language Models David Guthrie, Mark Hepple, Wei Liu University of Sheffield.
Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.
A Polynomial Time Approximation Scheme For Timing Constrained Minimum Cost Layer Assignment Shiyan Hu*, Zhuo Li**, Charles J. Alpert** *Dept of Electrical.
Experiments An Efficient Trie-based Method for Approximate Entity Extraction with Edit-Distance Constraints Entity Extraction A Document An Efficient Filter.
EASE: An Effective 3-in-1 Keyword Search Method for Unstructured, Semi-structured and Structured Data Cuoliang Li, Beng Chin Ooi, Jianhua Feng, Jianyong.
Experiments Faerie: Efficient Filtering Algorithms for Approximate Dictionary-based Entity Extraction Entity Extraction A Document An Efficient Filter.
Dynamic Programming Louis Siu What is Dynamic Programming (DP)? Not a single algorithm A technique for speeding up algorithms (making use of.
Tries1. 2 Outline and Reading Standard tries (§9.2.1) Compressed tries (§9.2.2) Suffix tries (§9.2.3)
Author: Haoyu Song, Murali Kodialam, Fang Hao and T.V. Lakshman Publisher/Conf. : IEEE International Conference on Network Protocols (ICNP), 2009 Speaker:
BATON A Balanced Tree Structure for Peer-to-Peer Networks H. V. Jagadish, Beng Chin Ooi, Quang Hieu Vu.
Graph Data Management Lab, School of Computer Science Branch Code: A Labeling Scheme for Efficient Query Answering on Tree
Efficient Processing of Updates in Dynamic XML Data Changqing Li, Tok Wang Ling, Min Hu.
بسم الله الرحمن الرحيم My Project Huffman Code. Introduction Introduction Encoding And Decoding Encoding And Decoding Applications Applications Advantages.
Qq q q q q q q q q q q q q q q q q q q Background: DNA Sequencing Goal: Acquire individual’s entire DNA sequence Mechanism: Read DNA fragments and reconstruct.
FastHASH: A New Algorithm for Fast and Comprehensive Next-generation Sequence Mapping Hongyi Xin1, Donghyuk Lee1, Farhad Hormozdiari2, Can Alkan3, Onur.
Tries 4/16/2018 8:59 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
B-Trees B-Trees.
Tries 07/28/16 11:04 Text Compression
A Unified Framework for Efficiently Processing Ranking Related Queries
Tries 5/27/2018 3:08 AM Tries Tries.
Updating SF-Tree Speaker: Ho Wai Shing.
New Characterizations in Turnstile Streams with Applications
B-Trees B-Trees.
Mark Redekopp David Kempe
Distance Computation “Efficient Distance Computation Between Non-Convex Objects” Sean Quinlan Stanford, 1994 Presentation by Julie Letchner.
Tries 9/14/ :13 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
Jiannan Wang (Tsinghua, China) Guoliang Li (Tsinghua, China)
Distance Functions for Sequence Data and Time Series
Pass-Join: A Partition based Method for Similarity Joins
Top-k String Similarity Search with Edit-Distance Constraints
On Spatial Joins in MapReduce
Guoliang Li (Tsinghua, China) Dong Deng (Tsinghua, China)
Other time considerations
CSIT 402 Data Structures II With thanks to TK Prasad
Efficient Record Linkage in Large Data Sets
2018, Spring Pusan National University Ki-Joune Li
Tries 2/23/2019 8:29 AM Tries 2/23/2019 8:29 AM Tries.
Tries 2/27/2019 5:37 PM Tries Tries.
Jongik Kim1, Dong-Hoon Choi2, and Chen Li3
Actively Learning Ontology Matching via User Interaction
Sequences 5/17/ :43 AM Pattern Matching.
B-Trees.
Donghui Zhang, Tian Xia Northeastern University
An Efficient Partition Based Method for Exact Set Similarity Joins
Dong Deng+, Yu Jiang+, Guoliang Li+, Jian Li+, Cong Yu^
Presentation transcript:

META: An Efficient Matching-Based Method for Error-Tolerant Autocompletion Dong Deng, Guoliang Li, He Wen, H. V. Jagadish, Jianhua Feng DCST, Tsinghua University EECS, University of Michigan

Problems with Typing Tedious and slow Prone to error

Autocompletion

Error-Tolerant Autocompletion

Speed is Very Important Autocompletion must be done in real-time

Edit Distance sso so solve ED(r, s): The min number of edit operations (insertion/deletion/substitution) needed to transform r to s. For example: ED(sso, solve) = 4 sso delete s so insert l, v, e solve

Prefix Edit Distance sso so solve PED(r, s): The min number of edit operations (insertion/deletion/substitution) needed to transform r to any prefix of s. For example: PED(sso, solve) = 1 sso delete s so is a prefix of solve

Problem Definition Threshold-based: 𝜏 = 1 Top-k: k = 3 Input query q: q1 = ‘s’  s1, s2, s3, s4, s5, s6 q1 = ‘s’  s1, s2, s3 q2 = ‘ss’  s1, s2, s3, s4, s5, s6 q2 = ‘ss’  s1, s2, s3 q3 = ‘sso’  s1, s2, s3, s4, s5 q3 = ‘sso’  s1, s2, s3 q4 = ‘ssol’  s1, s2, s3, s4, s5 q4 = ‘ssol’  s2, s3, s4

Related Work Error-Tolerant Auto-completion String Similarity Search Cannot share computation between continuous queries

Contributions Matching-based Framework Reduce redundant computations between consecutive keystrokes Support top-k queries

Edit Distance Calculation if qi = sj is the last matching in the optimal transformation, then ED(q, s) = ED(q[1..i], s[1..j]) + max(|q|-i, |s|-j) Enumerate every matching qi = sj ED(q, s)=min( ED(q[1..i], s[1..j]) + max(|q|-i, |s|-j) )

Prefix Edit Distance Calculation if qi = sj is the last matching in the optimal transformation, then PED(q, s) = ED(q[1..i], s[1..j]) + (|q|-i) Enumerate every matching qi = sj PED(q, s)=min( ED(q[1..i], s[1..j]) + (|q|-i) )

Matching-based Framework Active Nodes: matching nodes that can deduce results 𝒏.𝒄𝒉𝒂𝒓=𝒒 𝒒 && 𝑬𝑫 𝒒,𝒏 +( 𝒒 − 𝒏 )≤𝝉 Find new active nodes from existing ones by binary searching the matching nodes

Matching-based Framework Active Nodes: matching nodes that can deduce results 𝒏.𝒄𝒉𝒂𝒓=𝒒 𝒒 && 𝑬𝑫 𝒒,𝒏 +( 𝒒 − 𝒏 )≤𝝉 𝜏 = 2 & q= ‘’  n1

Matching-based Framework Active Nodes: matching nodes that can deduce results 𝒏.𝒄𝒉𝒂𝒓=𝒒 𝒒 && 𝑬𝑫 𝒒,𝒏 +( 𝒒 − 𝒏 )≤𝝉 𝜏 = 2 & q=‘s’  n2

Matching-based framework Active Nodes: matching nodes that can deduce results 𝒏.𝒄𝒉𝒂𝒓=𝒒 𝒒 && 𝑬𝑫 𝒒,𝒏 +( 𝒒 − 𝒏 )≤𝝉 𝜏 = 2 & q='ss’ 

Matching-based Framework Active Nodes: matching nodes that can deduce results 𝒏.𝒄𝒉𝒂𝒓=𝒒 𝒒 && 𝑬𝑫 𝒒,𝒏 +( 𝒒 − 𝒏 )≤𝝉 𝜏 = 2 & q='sso’  n3, n5, n11, n12

Matching-based Framework Active Nodes: matching nodes that can deduce results 𝒏.𝒄𝒉𝒂𝒓=𝒒 𝒒 && 𝑬𝑫 𝒒,𝒏 +( 𝒒 − 𝒏 )≤𝝉 𝜏 = 2 & q='ssol’  n6

Compact Tree Based Framework Motivation: reduce redundant computations Nodes marked in blue may lead to redundant computation. We need a way to skip all overlapped depth

Compact Tree Based Framework We addressed the problem by building compact tree for only the active nodes Nodes in compact tree remain their relative orders in trie Use depth information stored in the compact node to skip overlapped nodes 1 2 6 12 7

Compact Tree Based Framework Maintain the compact tree: add active nodes

Supporting Top-K Queries Observation: for two continuous queries, their largest prefix edit distances to their results remains unchanged or increases 1, For each continuous query, we can group all the active nodes by their edit distance b to the query, which is called b-matching set.

Supporting Top-K Queries Extend matching-based framework Once the current matching set can no more produce K results, we: (1) First deduce matchings with threshold b untill reach K results. If all active nodes with threshold b are deduced, (2) Then set b=b+1 and repeat (1). Note it is guaranteed the second deduction will always suceed. τ=b still produces at least K results b-Matching Set b-Matching Set new keystroke (b+1)-Matching Set τ=b produces less than K results

Experiments Datasets: State-of-the-art methods: IncNGTrie ICAN IPCAN String Similarity Search: HSTree Pivotal Setup: C++,GCC 4.8.2 with –O3 Intel Xeon E3650 2.00GHz 48 GB memory

Matching vs Compact Tree On AOL QueryLog Compact Tree outperforms Matching by avoiding the redundant binary searches

Comparing with State-of-the-arts Top-k query Threshold-based Query

Scalability On AOL QueryLog Threshold-based Top-K

Conclusion Matching-based Framework Compact Tree Based Method Supporting Top-K Queries

Thank you Q & A