Pass-Join: A Partition based Method for Similarity Joins Guoliang Li (Tsinghua, China) Dong Deng (Tsinghua, China) Jiannan Wang (Tsinghua, China) Jianhua Feng (Tsinghua, China) Good afternoon ladies and gentlemen. Thank you for this opportunity to talk to you today. My name is Dong Deng and I am from Tsinghua University. Today I am going to present a partition based method for similarity joins. I will first give our motivation.
Real-world Data is Rather Dirty! DBLP Complete Search Typo in “author” Typo in “title” Argyrios Zymnis Argyris Zymnis As we all know, data in real word is very dirty. Here we can see that there are typos both in the author field and title field in DBLP data. When you integrate data from different datasets, you may miss results due to the relaxed 11/13/2018 PassJoin @ VLDB2012 related
Similarity Join Equal Join Conference Conference CIDR VLDB SIGMOD Dataset R Dataset S Conference VLDB SIGMOD ICDE Conference CIDR SIGMOD PVLDB When integrate two datasets using equal join. bla bla Here is an example, equal join find SIGMOD as an answer. But here we also want VLDB and PVLDB as an result. 11/13/2018 PassJoin @ VLDB2012
Similarity Join Similarity Join Conference Conference CIDR VLDB SIGMOD Dataset R Dataset S Conference VLDB SIGMOD ICDE Conference CIDR SIGMOD PVLDB An similarity join will find all the pairs of values similar to each other. Besides the pair of SIGMOD, we also find the pair of VLDB and PVLDB as an answer even though they do not exactly match each other. 11/13/2018 PassJoin @ VLDB2012
Applications Data Cleaning and Integration Near Duplicate Object Detection Collaborative Filtering …….. Similarity Joins have a lot of applications, such as data cleaning and integration, near duplicate object detection and collaborative filtering. we need a similarity function to quantify the similarity between two strings. In this paper we use edit distance, which is a well known similarity function. 11/13/2018 PassJoin @ VLDB2012
Edit Distance hilton hulton huston substitute i with u ED(r, s): The minimum number of single-character edit operations(insertion/deletion/substitution) to transform r to s. For example: ED(hilton, huston) = 2 Property: ED(r, s) ≥ ||r|-|s|| hilton substitute i with u hulton The edit distance between two strings is the minimum number of single-character edit operations (i.e., insertion, deletion, and substitution) needed to transform one string to another. For example, the edit distance between marios and maras is 2. From the slide we can see that to transform maras to marios, we only need a deletion and a substitution. Thus the edit distance between them is 2. Edit Distance has a lot of interested features. One of them is that the edit distance between two string will definitely larger than their length difference. substitute l with u huston 11/13/2018 PassJoin @ VLDB2012
Problem Formulation Give threshold τ=3 Takes as input and outputs N^2 ED <s1 ,s2>=5 ED <s1 ,s3>=13 ED <s1 ,s4>=12 ED <s1 ,s5>=12 ED <s1 ,s6>=14 ED <s2 ,s3>=12 ED <s2 ,s4>=12 ED <s2 ,s5>=12 ED <s2 ,s6>=14 ED <s3 ,s4>=5 ED <s3 ,s5>=4 ED <s3 ,s6>=8 ED <s4 ,s5>=4 ED <s4 ,s6>=3 ED <s5 ,s6>=8 11/13/2018 PassJoin @ VLDB2012
Filter-and-refine Methods Basic idea Filter a large number of dissimilar string pairs Verify the remaining potentially similar pairs Propose some filtering conditions 11/13/2018 PassJoin @ VLDB2012
Filter-and-refine Methods Give threshold τ=3 Pruning Condition: ||si| - |sj|| > 3 To judge whether two strings are similar or not, we also need an edit distance threshold. Thus the problem of similarity joins is that given a set of strings and an edit distance threshold, find all the string pairs whose edit distance is within the give threshold. For example, Here we have a string set and an edit distance threshold three. A naive method enumerate all string pairs and calculate their edit distance. Here is our problem formulation and below is an example which will go through all this slides. String similarity joins take two string sets and an edit distance threshold as input, and outputs all the similar pairs between the two string sets. Hereinafter we focus on self join that’s the two string set is all the same. In the example, we have a string set with 6 records, each attached with an ID on the left side. Suppose the edit-distance threshold τ = 3. the highlighted pair is an example answers. M: 去掉length 加naive方法 ED <s1 ,s2>=5 ED <s1 ,s3>=13 ED <s1 ,s4>=12 ED <s1 ,s5>=12 ED <s1 ,s6>=14 ED <s2 ,s3>=12 ED <s2 ,s4>=12 ED <s2 ,s5>=12 ED <s2 ,s6>=14 ED <s3 ,s4>=5 ED <s3 ,s5>=4 ED <s3 ,s6>=8 ED <s4 ,s5>=4 ED <s4 ,s6>=3 ED <s5 ,s6>=8 11/13/2018 PassJoin @ VLDB2012
Filter-and-refine Methods Basic idea Filter a large number of dissimilar string pairs Verify the remaining potentially similar pairs Drawbacks Need to tune parameters Bad for short strings to achieve high performance Cannot select high-quality pruning condition 短串在实际应用中很普遍例如名字和jilu Two state of art methods Faerie [17] and NGPP [19] have been proposed to address this problem. The basic idea of NGPP is that: first partitions entities into different partitions, and guarantees that an entity and a substring are similar if they have two similar partitions with edit distance no larger than 1. faerie proposed a unified framework to support various similar functions, an entity and a substring are similar if their overlap similarity is larger than a threshold. however they both have some limitations. Firstly, they need to tune parameters to achieve a high performance, which is a tedious and troublesome process. Secondly, faerie used gram-based index structures and NGPP indexed all 1-variant of partitions. Both of them involve large index size. Last, then are inefficient for large edit distance threshold. 11/13/2018 PassJoin @ VLDB2012
Outline Motivation & Problem Formulation Partition-based Framework Improving Substring Selection Improving the Verification Experiment Conclusion Here is the outline of today’s presentation. I’ll give the motivation of our project first . 11/13/2018 PassJoin @ VLDB2012
Our Filter Condition Give threshold τ=1 hilton huston 1
1 hilton huston Our Filter Condition Minimum # edit operations is 2 Give threshold τ=1 hilton huston 1 Minimum # edit operations is 2 Prune!
Our Filter Condition Split r to τ +1 disjoint segments Threshold τ String r String s Split r to τ +1 disjoint segments If string s is similar to string r, string s must have a substring matching a segment of r. Is there any substring of s matching a segment of r ? Yes No <r, s> is a candidate We prune <r, s>
How to partition? Give threshold τ=1 hilton huston Match Candidate!
Partition Scheme Even Partition Scheme tau = 3 “avataresha” {“av”, “at”, “are”, “sha”} Other Schemes Select good partition strategies. Adaptive partition scheme [Deng et al. 2012a]. tau+1 partitions Each partition nearly has the same length Example:
Partition-based Framework 1. Group all the strings by length: Sl S9 S10 S15 Using the basic idea to solve the self set similarity joins problem S17
Partition-based Framework 2. For each Sl , partition strings into segments and build tau+1 inverted indexes Lli S15 1 2 3 4 s3=kau shic _cha duri s4=kau shik _cha krab s5=kau shuk _cha dhui Using the basic idea to solve the self set similarity joins problem
Partition-based Framework 3. Select substrings and generate candidates s6=caushik _chakrabar Using the basic idea to solve the self set similarity joins problem Candidates: <s3, s6>; <s4, s6>; <s5, s6>
Partition-based Framework 4. Verify the candidates Candidates: <3, 6>; <4, 6>; <5, 6> ED(s3, s6) > 3 ED(s4, s6) = 3 ED(s5, s6) > 3 Using the basic idea to solve the self set similarity joins problem
Challenge Decrease selected substring set size. Accelerate the verification. Using the basic idea to solve the self set similarity joins problem
Outline Motivation & Problem Formulation Partition-based Framework Improving Substring Selection Improving the Verification Experiment Conclusion Here is the outline of today’s presentation. I’ll give the motivation of our project first . 11/13/2018 PassJoin @ VLDB2012
Naive Method a v a t a r e s h a va nk at esh For each Lli, put all the substrings of s into W(s, l). a v a t a r e s h a va nk at esh L91 av va at ta ar re es sh ha L92 ava vat ata tar are res esh sha L93 …… …… …… avataresh vataresha L94 avataresha
Naive Method For each Lli, put all the substrings of s into W(s, l). The size of W(s, l) is: For example, with 4 segments and , the size of W(s, l) is 220.
Length-based Method va nk at esh av va at ta ar re es sh ha For each Lli, only select substrings with the same length. va nk at esh L91 av va at ta ar re es sh ha L92 av va at ta ar re es sh ha L93 av va at ta ar re es sh ha L94 ava vat ata tar are res esh sha
Length-based Method The size of W(s, l) is: For and , the size of W(s, l) is 35.
Shift-based Method For each inverted index Lli with start position pi, select all substrings with start position in [pi - τ, pi+τ]. Pruning Condition: ||sl|-|rl|| > τ First transform rl to sl
Shift-based Method va nk at esh av va at ta av va at ta ar re The size of W(s, l) is: (tau+1)(2tau+1) For and , the size of W(s, l) is 22. va nk at esh L91 av va at ta L92 av va at ta ar re L93 va at ta ar re es sh L94 tar are res esh sha
Position-aware Method rl rr sl sr ||sl|-|rl||+||sr|-|rr||=2+3>3
Position-aware Method For each inverted index Lli with start position pi, select all substrings with start position in where Δ=|s|-|r|=|s|-l. Transform rl to sl and then transform rr to sr Pruning Condition: ||sl|-|rl|| +||sr|-|rr||> τ
Position-aware Method The size of W(s, l) is: (tau+1)2 For and , the size of W(s, l) is 14. va nk at esh L91 av va at L92 va at ta ar L93 ta ar re es L94 res esh sha
Multi-match-aware Method -- Left-side Perspective rl=“” sl=“a” ||sl|-|rl|| = 1 <= 2 errors in 3 undetected partitions. Still have matching segments
Multi-match-aware Method -- Left-side Perspective For each inverted index Lli with start position pi, select all substrings with start position in Pruning Condition: ||sl|-|rl||+(# undetected parts) > τ
Multi-match-aware Method -- Left-side Perspective The size of W(s, l) is: tau2+2tau For and , the size of W(s, l) is 14. va nk at esh L91 av L92 va at ta L93 at ta ar re es L94 tar are res esh sha
Multi-match-aware Method -- Right-side Perspective For each inverted index Lli with start position pi, select all substrings with start position in Pruning Condition: (# undetected parts)+||sr|-|rr|| > τ
Multi-match-aware Method We can combine the conclusion from left and right side simultaneously. For each inverted index Lli with start position pi, select all substrings with start position in
Multi-match-aware Method The size of W(s, l) is: For and , the size of W(s, l) is 8. va nk at esh L91 av L92 va at ta L93 ar re es L94 sha
Theoretical Results The number of selected substrings by the multi-match-aware method is minimum For strings longer than 2*(tau+1), our selection method is the only way.
Number of Selected Substrings
Outline Motivation Problem Formulation Partition-based Framework Improving Substring Selection Improving the Verification Experiment Conclusion Here is the outline of today’s presentation. I’ll give the motivation of our project first . 11/13/2018 PassJoin @ VLDB2012
Improving Verification Length-aware Verification Extension-based Verification Sharing Computations
Length-aware Verification
Length-aware Verification Total difference is 4 > tau, No need to process M[2][5]. Length Difference: 3 Length Difference: 1
Length-aware Verification
Extension-base Method Share computation between different r
Extension-base Method We can verify a candidate pair using tighter thresholds: For the left parts we can set For the right parts we can set
Verification Time 11/13/2018 PassJoin @ VLDB2012
Outline Motivation & Problem Formulation Partition-based Framework Improving Substring Selection Improving the Verification Experiment Conclusion Here is the outline of today’s presentation. I’ll give the motivation of our project first . 11/13/2018 PassJoin @ VLDB2012
Experimental Results Setting Datasets Baselines Trie-Join ED-Join
Comparison with existing methods
Scalability 11/13/2018 PassJoin @ VLDB2012
Outline Motivation & Problem Formulation Partition-based Framework Improving Substring Selection Improving the Verification Experiment Conclusion Here is the outline of today’s presentation. I’ll give the motivation of our project first . 11/13/2018 PassJoin @ VLDB2012
Conclusion We propose a partition-based framework. We develop techniques to select substrings. We prove that our method can minimize the number of selected substrings. We propose an extension-based method to efficiently verify a candidate pair. 11/13/2018 PassJoin @ VLDB2012
Thanks! Q&A http://dbgroup.cs.tsinghua.edu.cn/dd/projects/passjoin/ Thank you very much! Welcome to our website for more information. http://dbgroup.cs.tsinghua.edu.cn/dd/projects/passjoin/