Pass-Join: A Partition based Method for Similarity Joins

Slides:

Advertisements

Similar presentations

String Similarity Measures and Joins with Synonyms

Advertisements

Extending Q-Grams to Estimate Selectivity of String Matching with Low Edit Distance [1] Pirooz Chubak May 22, 2008.

Indexing DNA Sequences Using q-Grams

Computer Science and Engineering Inverted Linear Quadtree: Efﬁcient Top K Spatial Keyword Search Chengyuan Zhang 1,Ying Zhang 1,Wenjie Zhang 1, Xuemin.

Longest Common Subsequence

Jiannan Wang (Tsinghua, China) Guoliang Li (Tsinghua, China) Jianhua Feng (Tsinghua, China)

Top-k Set Similarity Joins Chuan Xiao, Wei Wang, Xuemin Lin and Haichuan Shang University of New South Wales and NICTA.

Dong Deng, Guoliang Li, Jianhua Feng Database Group, Tsinghua University Present by Dong Deng A Pivotal Prefix Based Filtering Algorithm for String Similarity.

Speaker: Alexander Behm Space-Constrained Gram-Based Indexing for Efficient Approximate String Search Alexander Behm 1, Shengyue Ji 1, Chen Li 1, Jiaheng.

Reza Sherkat ICDE061 Reza Sherkat and Davood Rafiei Department of Computing Science University of Alberta Canada Efficiently Evaluating Order Preserving.

L. Padmasree Vamshi Ambati J. Anand Chandulal J. Anand Chandulal M. Sreenivasa Rao M. Sreenivasa Rao Signature Based Duplicate Detection in Digital Libraries.

1 A Discriminative Approach to Topic- Based Citation Recommendation Jie Tang and Jing Zhang Presented by Pei Li Knowledge Engineering Group, Dept. of Computer.

Efficient Exact Similarity Searches using Multiple Token Orderings Jongik Kim 1 and Hongrae Lee 2 1 Chonbuk National University, South Korea 2 Google Inc.

VGRAM: Improving Performance of Approximate Queries on String Collections Using Variable-Length Grams Chen Li Bin Wang and Xiaochun Yang Northeastern University,

DBease: Making Databases User-Friendly and Easily Accessible Guoliang Li, Ju Fan, Hao Wu, Jiannan Wang, Jianhua Feng Database Group, Department of Computer.

Similarity join problem with Pass- Join-K using Hadoop ---BY Yu Haiyang.

Click to edit Present’s Name Xiaoyang Zhang 1, Jianbin Qin 1, Wei Wang 1, Yifang Sun 1, Jiaheng Lu 2 HmSearch: An Efficient Hamming Distance Query Processing.

Experiments An Efficient Trie-based Method for Approximate Entity Extraction with Edit-Distance Constraints Entity Extraction A Document An Efficient Filter.

Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.

Approximate XML Joins Huang-Chun Yu Li Xu. Introduction XML is widely used to integrate data from different sources. Perform join operation for XML documents:

ISV Innovation Presented by ISV Innovation Presented by Business Intelligence Fundamentals: Data Cleansing Ola Ekdahl IT Mentors 9/12/08.

Date: 2011/12/26 Source: Dustin Lange et. al (CIKM’11) Advisor: Jia-ling, Koh Speaker: Jiun Jia, Chiou Frequency-aware Similarity Measures 1.

EASE: An Effective 3-in-1 Keyword Search Method for Unstructured, Semi-structured and Structured Data Cuoliang Li, Beng Chin Ooi, Jianhua Feng, Jianyong.

文本挖掘简介邹权博士，助理教授. Outline  Introduction  TF-IDF  Similarity.

Clustering Moving Objects in Spatial Networks Jidong Chen, Caifeng Lai, Xiaofeng Meng, Renmin University of China Jianliang Xu, and Haibo Hu Hong Kong.

Experiments Faerie: Efficient Filtering Algorithms for Approximate Dictionary-based Entity Extraction Entity Extraction A Document An Efficient Filter.

The Sweet Spot between Inverted Indices and Metric-Space Indexing for Top-K–List Similarity Search Evica Milchevski , Avishek Anand ★ and Sebastian Michel.

VGRAM:Improving Performance of Approximate Queries on String Collections Using Variable- Length Grams VLDB 2007 Chen Li (UC, Irvine) Bin Wang (Northeastern.

Efficient Processing of Top-k Spatial Preference Queries

Presented by: Aneeta Kolhe. Named Entity Recognition finds approximate matches in text. Important task for information extraction and integration, text.

Large-scale Similarity Join with Edit-distance Constraints ---BY Yu Haiyang 1/30.

Shaoxu Song 1, Aoqian Zhang 1, Lei Chen 2, Jianmin Wang 1 1 Tsinghua University, China 2Hong Kong University of Science & Technology, China 1/19 VLDB 2015.

Efficient Merging and Filtering Algorithms for Approximate String Searches Chen Li, Jiaheng Lu and Yiming Lu Univ. of California, Irvine, USA ICDE ’08.

Lecture 4: Data Integration and Cleaning CMPT 733, SPRING 2016 JIANNAN WANG.

EFFICIENT ALGORITHMS FOR APPROXIMATE MEMBER EXTRACTION By Swapnil Kharche and Pavan Basheerabad.

Click to edit Present’s Name AP-Tree: Efficiently Support Continuous Spatial-Keyword Queries Over Stream Xiang Wang 1*, Ying Zhang 2, Wenjie Zhang 1, Xuemin.

Presented by Niwan Wattanakitrungroj

Efficient Approximate Search on String Collections Part I

Recommendation in Scholarly Big Data

Tian Xia and Donghui Zhang Northeastern University

Outline Introduction State-of-the-art solutions

A review of audio fingerprinting (Cano et al. 2005)

Fast Preprocessing for Robust Face Sketch Synthesis

Query in Streaming Environment

Jiannan Wang (Tsinghua, China) Guoliang Li (Tsinghua, China)

TT-Join: Efficient Set Containment Join

Entity Matching : How Similar Is Similar?

Pyramid Sketch: a Sketch Framework

An Efficient method to recommend research papers and highly influential authors. VIRAJITHA KARNATAPU.

Top-k String Similarity Search with Edit-Distance Constraints

Automatic Physical Design Tuning: Workload as a Sequence

Guoliang Li (Tsinghua, China) Dong Deng (Tsinghua, China)

Xiaoyang Zhang1, Yuchong Hu1, Patrick P. C. Lee2, Pan Zhou1

Text Joins in an RDBMS for Web Data Integration

Selected Topics: External Sorting, Join Algorithms, …

Minimizing the Aggregate Movements for Interval Coverage

Efficient Record Linkage in Large Data Sets

Jongik Kim1, Dong-Hoon Choi2, and Chen Li3

Time Relaxed Spatiotemporal Trajectory Joins

Actively Learning Ontology Matching via User Interaction

Efficient Processing of Top-k Spatial Preference Queries

Fraction-Score: A New Support Measure for Co-location Pattern Mining

Donghui Zhang, Tian Xia Northeastern University

Relax and Adapt: Computing Top-k Matches to XPath Queries

An Efficient Partition Based Method for Exact Set Similarity Joins

Dong Deng+, Yu Jiang+, Guoliang Li+, Jian Li+, Cong Yu^

Dong Deng, Guoliang Li, He Wen, H. V. Jagadish, Jianhua Feng

Presentation transcript:

Pass-Join: A Partition based Method for Similarity Joins Guoliang Li (Tsinghua, China) Dong Deng (Tsinghua, China) Jiannan Wang (Tsinghua, China) Jianhua Feng (Tsinghua, China) Good afternoon ladies and gentlemen. Thank you for this opportunity to talk to you today. My name is Dong Deng and I am from Tsinghua University. Today I am going to present a partition based method for similarity joins. I will first give our motivation.

Real-world Data is Rather Dirty！ DBLP Complete Search Typo in “author” Typo in “title” Argyrios Zymnis Argyris Zymnis As we all know, data in real word is very dirty. Here we can see that there are typos both in the author field and title field in DBLP data. When you integrate data from different datasets, you may miss results due to the relaxed 11/13/2018 PassJoin @ VLDB2012 related

Similarity Join Equal Join Conference Conference CIDR VLDB SIGMOD Dataset R Dataset S Conference VLDB SIGMOD ICDE Conference CIDR SIGMOD PVLDB When integrate two datasets using equal join. bla bla Here is an example, equal join find SIGMOD as an answer. But here we also want VLDB and PVLDB as an result. 11/13/2018 PassJoin @ VLDB2012

Similarity Join Similarity Join Conference Conference CIDR VLDB SIGMOD Dataset R Dataset S Conference VLDB SIGMOD ICDE Conference CIDR SIGMOD PVLDB An similarity join will find all the pairs of values similar to each other. Besides the pair of SIGMOD, we also find the pair of VLDB and PVLDB as an answer even though they do not exactly match each other. 11/13/2018 PassJoin @ VLDB2012

Applications Data Cleaning and Integration Near Duplicate Object Detection Collaborative Filtering …….. Similarity Joins have a lot of applications, such as data cleaning and integration, near duplicate object detection and collaborative filtering. we need a similarity function to quantify the similarity between two strings. In this paper we use edit distance, which is a well known similarity function. 11/13/2018 PassJoin @ VLDB2012

Edit Distance hilton hulton huston substitute i with u ED(r, s): The minimum number of single-character edit operations(insertion/deletion/substitution) to transform r to s. For example: ED(hilton, huston) = 2 Property: ED(r, s) ≥ ||r|-|s|| hilton substitute i with u hulton The edit distance between two strings is the minimum number of single-character edit operations (i.e., insertion, deletion, and substitution) needed to transform one string to another. For example, the edit distance between marios and maras is 2. From the slide we can see that to transform maras to marios, we only need a deletion and a substitution. Thus the edit distance between them is 2. Edit Distance has a lot of interested features. One of them is that the edit distance between two string will definitely larger than their length difference. substitute l with u huston 11/13/2018 PassJoin @ VLDB2012

Problem Formulation Give threshold τ=3 Takes as input and outputs N^2 ED <s1 ,s2>=5 ED <s1 ,s3>=13 ED <s1 ,s4>=12 ED <s1 ,s5>=12 ED <s1 ,s6>=14 ED <s2 ,s3>=12 ED <s2 ,s4>=12 ED <s2 ,s5>=12 ED <s2 ,s6>=14 ED <s3 ,s4>=5 ED <s3 ,s5>=4 ED <s3 ,s6>=8 ED <s4 ,s5>=4 ED <s4 ,s6>=3 ED <s5 ,s6>=8 11/13/2018 PassJoin @ VLDB2012

Filter-and-refine Methods Basic idea Filter a large number of dissimilar string pairs Verify the remaining potentially similar pairs Propose some filtering conditions 11/13/2018 PassJoin @ VLDB2012

Filter-and-refine Methods Give threshold τ=3 Pruning Condition: ||si| - |sj|| > 3 To judge whether two strings are similar or not, we also need an edit distance threshold. Thus the problem of similarity joins is that given a set of strings and an edit distance threshold, find all the string pairs whose edit distance is within the give threshold. For example, Here we have a string set and an edit distance threshold three. A naive method enumerate all string pairs and calculate their edit distance. Here is our problem formulation and below is an example which will go through all this slides. String similarity joins take two string sets and an edit distance threshold as input, and outputs all the similar pairs between the two string sets. Hereinafter we focus on self join that’s the two string set is all the same. In the example, we have a string set with 6 records, each attached with an ID on the left side. Suppose the edit-distance threshold τ = 3. the highlighted pair is an example answers. M: 去掉length 加naive方法 ED <s1 ,s2>=5 ED <s1 ,s3>=13 ED <s1 ,s4>=12 ED <s1 ,s5>=12 ED <s1 ,s6>=14 ED <s2 ,s3>=12 ED <s2 ,s4>=12 ED <s2 ,s5>=12 ED <s2 ,s6>=14 ED <s3 ,s4>=5 ED <s3 ,s5>=4 ED <s3 ,s6>=8 ED <s4 ,s5>=4 ED <s4 ,s6>=3 ED <s5 ,s6>=8 11/13/2018 PassJoin @ VLDB2012

Filter-and-refine Methods Basic idea Filter a large number of dissimilar string pairs Verify the remaining potentially similar pairs Drawbacks Need to tune parameters Bad for short strings to achieve high performance Cannot select high-quality pruning condition 短串在实际应用中很普遍例如名字和jilu Two state of art methods Faerie [17] and NGPP [19] have been proposed to address this problem. The basic idea of NGPP is that: first partitions entities into different partitions, and guarantees that an entity and a substring are similar if they have two similar partitions with edit distance no larger than 1. faerie proposed a unified framework to support various similar functions, an entity and a substring are similar if their overlap similarity is larger than a threshold. however they both have some limitations. Firstly, they need to tune parameters to achieve a high performance, which is a tedious and troublesome process. Secondly, faerie used gram-based index structures and NGPP indexed all 1-variant of partitions. Both of them involve large index size. Last, then are inefficient for large edit distance threshold. 11/13/2018 PassJoin @ VLDB2012

Outline Motivation & Problem Formulation Partition-based Framework Improving Substring Selection Improving the Verification Experiment Conclusion Here is the outline of today’s presentation. I’ll give the motivation of our project first . 11/13/2018 PassJoin @ VLDB2012

Our Filter Condition Give threshold τ=1 hilton huston 1

1 hilton huston Our Filter Condition Minimum # edit operations is 2 Give threshold τ=1 hilton huston 1 Minimum # edit operations is 2 Prune!

Our Filter Condition Split r to τ +1 disjoint segments Threshold τ String r String s Split r to τ +1 disjoint segments If string s is similar to string r, string s must have a substring matching a segment of r. Is there any substring of s matching a segment of r ? Yes No <r, s> is a candidate We prune <r, s>

How to partition? Give threshold τ=1 hilton huston Match Candidate!

Partition Scheme Even Partition Scheme tau = 3 “avataresha”  {“av”, “at”, “are”, “sha”} Other Schemes Select good partition strategies. Adaptive partition scheme [Deng et al. 2012a]. tau+1 partitions Each partition nearly has the same length Example:

Partition-based Framework 1. Group all the strings by length: Sl S9 S10 S15 Using the basic idea to solve the self set similarity joins problem S17

Partition-based Framework 2. For each Sl , partition strings into segments and build tau+1 inverted indexes Lli S15 1 2 3 4 s3=kau shic _cha duri s4=kau shik _cha krab s5=kau shuk _cha dhui Using the basic idea to solve the self set similarity joins problem

Partition-based Framework 3. Select substrings and generate candidates s6=caushik _chakrabar Using the basic idea to solve the self set similarity joins problem Candidates: <s3, s6>; <s4, s6>; <s5, s6>

Partition-based Framework 4. Verify the candidates Candidates: <3, 6>; <4, 6>; <5, 6> ED(s3, s6) > 3 ED(s4, s6) = 3 ED(s5, s6) > 3 Using the basic idea to solve the self set similarity joins problem

Challenge Decrease selected substring set size. Accelerate the verification. Using the basic idea to solve the self set similarity joins problem

Outline Motivation & Problem Formulation Partition-based Framework Improving Substring Selection Improving the Verification Experiment Conclusion Here is the outline of today’s presentation. I’ll give the motivation of our project first . 11/13/2018 PassJoin @ VLDB2012

Naive Method a v a t a r e s h a va nk at esh For each Lli, put all the substrings of s into W(s, l). a v a t a r e s h a va nk at esh L91 av va at ta ar re es sh ha L92 ava vat ata tar are res esh sha L93 …… …… …… avataresh vataresha L94 avataresha

Naive Method For each Lli, put all the substrings of s into W(s, l). The size of W(s, l) is: For example, with 4 segments and , the size of W(s, l) is 220.

Length-based Method va nk at esh av va at ta ar re es sh ha For each Lli, only select substrings with the same length. va nk at esh L91 av va at ta ar re es sh ha L92 av va at ta ar re es sh ha L93 av va at ta ar re es sh ha L94 ava vat ata tar are res esh sha

Length-based Method The size of W(s, l) is: For and , the size of W(s, l) is 35.

Shift-based Method For each inverted index Lli with start position pi, select all substrings with start position in [pi - τ, pi+τ]. Pruning Condition: ||sl|-|rl|| > τ First transform rl to sl

Shift-based Method va nk at esh av va at ta av va at ta ar re The size of W(s, l) is: (tau+1)(2tau+1) For and , the size of W(s, l) is 22. va nk at esh L91 av va at ta L92 av va at ta ar re L93 va at ta ar re es sh L94 tar are res esh sha

Position-aware Method rl rr sl sr ||sl|-|rl||+||sr|-|rr||=2+3>3

Position-aware Method For each inverted index Lli with start position pi, select all substrings with start position in where Δ=|s|-|r|=|s|-l. Transform rl to sl and then transform rr to sr Pruning Condition: ||sl|-|rl|| +||sr|-|rr||> τ

Position-aware Method The size of W(s, l) is: (tau+1)2 For and , the size of W(s, l) is 14. va nk at esh L91 av va at L92 va at ta ar L93 ta ar re es L94 res esh sha

Multi-match-aware Method -- Left-side Perspective rl=“” sl=“a” ||sl|-|rl|| = 1 <= 2 errors in 3 undetected partitions. Still have matching segments

Multi-match-aware Method -- Left-side Perspective For each inverted index Lli with start position pi, select all substrings with start position in Pruning Condition: ||sl|-|rl||+(# undetected parts) > τ

Multi-match-aware Method -- Left-side Perspective The size of W(s, l) is: tau2+2tau For and , the size of W(s, l) is 14. va nk at esh L91 av L92 va at ta L93 at ta ar re es L94 tar are res esh sha

Multi-match-aware Method -- Right-side Perspective For each inverted index Lli with start position pi, select all substrings with start position in Pruning Condition: (# undetected parts)+||sr|-|rr|| > τ

Multi-match-aware Method We can combine the conclusion from left and right side simultaneously. For each inverted index Lli with start position pi, select all substrings with start position in

Multi-match-aware Method The size of W(s, l) is: For and , the size of W(s, l) is 8. va nk at esh L91 av L92 va at ta L93 ar re es L94 sha

Theoretical Results The number of selected substrings by the multi-match-aware method is minimum For strings longer than 2*(tau+1), our selection method is the only way.

Number of Selected Substrings

Outline Motivation Problem Formulation Partition-based Framework Improving Substring Selection Improving the Verification Experiment Conclusion Here is the outline of today’s presentation. I’ll give the motivation of our project first . 11/13/2018 PassJoin @ VLDB2012

Improving Verification Length-aware Verification Extension-based Verification Sharing Computations

Length-aware Verification

Length-aware Verification Total difference is 4 > tau, No need to process M[2][5]. Length Difference: 3 Length Difference: 1

Length-aware Verification

Extension-base Method Share computation between different r

Extension-base Method We can verify a candidate pair using tighter thresholds: For the left parts we can set For the right parts we can set

Verification Time 11/13/2018 PassJoin @ VLDB2012

Outline Motivation & Problem Formulation Partition-based Framework Improving Substring Selection Improving the Verification Experiment Conclusion Here is the outline of today’s presentation. I’ll give the motivation of our project first . 11/13/2018 PassJoin @ VLDB2012

Experimental Results Setting Datasets Baselines Trie-Join ED-Join

Comparison with existing methods

Scalability 11/13/2018 PassJoin @ VLDB2012

Outline Motivation & Problem Formulation Partition-based Framework Improving Substring Selection Improving the Verification Experiment Conclusion Here is the outline of today’s presentation. I’ll give the motivation of our project first . 11/13/2018 PassJoin @ VLDB2012

Conclusion We propose a partition-based framework. We develop techniques to select substrings. We prove that our method can minimize the number of selected substrings. We propose an extension-based method to efficiently verify a candidate pair. 11/13/2018 PassJoin @ VLDB2012

Thanks! Q&A http://dbgroup.cs.tsinghua.edu.cn/dd/projects/passjoin/ Thank you very much! Welcome to our website for more information. http://dbgroup.cs.tsinghua.edu.cn/dd/projects/passjoin/