Pass-Join: A Partition based Method for Similarity Joins

Pass-Join: A Partition based Method for Similarity Joins
Guoliang Li (Tsinghua, China) Dong Deng (Tsinghua, China) Jiannan Wang (Tsinghua, China) Jianhua Feng (Tsinghua, China) Good afternoon ladies and gentlemen. Thank you for this opportunity to talk to you today. My name is Dong Deng and I am from Tsinghua University. Today I am going to present a partition based method for similarity joins. I will first give our motivation.

Real-world Data is Rather Dirty！
DBLP Complete Search Typo in “author” Typo in “title” Argyrios Zymnis Argyris Zymnis As we all know, data in real word is very dirty. Here we can see that there are typos both in the author field and title field in DBLP data. When you integrate data from different datasets, you may miss results due to the relaxed 11/13/2018 VLDB2012 related

Similarity Join Equal Join Conference Conference CIDR VLDB SIGMOD
Dataset R Dataset S Conference VLDB SIGMOD ICDE Conference CIDR SIGMOD PVLDB When integrate two datasets using equal join. bla bla Here is an example, equal join find SIGMOD as an answer. But here we also want VLDB and PVLDB as an result. 11/13/2018 VLDB2012

Similarity Join Similarity Join Conference Conference CIDR VLDB SIGMOD
Dataset R Dataset S Conference VLDB SIGMOD ICDE Conference CIDR SIGMOD PVLDB An similarity join will find all the pairs of values similar to each other. Besides the pair of SIGMOD, we also find the pair of VLDB and PVLDB as an answer even though they do not exactly match each other. 11/13/2018 VLDB2012

Applications Data Cleaning and Integration
Near Duplicate Object Detection Collaborative Filtering …….. Similarity Joins have a lot of applications, such as data cleaning and integration, near duplicate object detection and collaborative filtering. we need a similarity function to quantify the similarity between two strings. In this paper we use edit distance, which is a well known similarity function. 11/13/2018 VLDB2012

Edit Distance hilton hulton huston substitute i with u
ED(r, s): The minimum number of single-character edit operations(insertion/deletion/substitution) to transform r to s. For example: ED(hilton, huston) = 2 Property: ED(r, s) ≥ ||r|-|s|| hilton substitute i with u hulton The edit distance between two strings is the minimum number of single-character edit operations (i.e., insertion, deletion, and substitution) needed to transform one string to another. For example, the edit distance between marios and maras is 2. From the slide we can see that to transform maras to marios, we only need a deletion and a substitution. Thus the edit distance between them is 2. Edit Distance has a lot of interested features. One of them is that the edit distance between two string will definitely larger than their length difference. substitute l with u huston 11/13/2018 VLDB2012

Problem Formulation Give threshold τ=3
Takes as input and outputs N^2 ED <s1 ,s2>=5 ED <s1 ,s3>=13 ED <s1 ,s4>=12 ED <s1 ,s5>=12 ED <s1 ,s6>=14 ED <s2 ,s3>=12 ED <s2 ,s4>=12 ED <s2 ,s5>=12 ED <s2 ,s6>=14 ED <s3 ,s4>=5 ED <s3 ,s5>=4 ED <s3 ,s6>=8 ED <s4 ,s5>=4 ED <s4 ,s6>=3 ED <s5 ,s6>=8 11/13/2018 VLDB2012

Filter-and-refine Methods
Basic idea Filter a large number of dissimilar string pairs Verify the remaining potentially similar pairs Propose some filtering conditions 11/13/2018 VLDB2012

Give threshold τ=3 Pruning Condition: ||si| - |sj|| > 3 To judge whether two strings are similar or not, we also need an edit distance threshold. Thus the problem of similarity joins is that given a set of strings and an edit distance threshold, find all the string pairs whose edit distance is within the give threshold. For example, Here we have a string set and an edit distance threshold three. A naive method enumerate all string pairs and calculate their edit distance. Here is our problem formulation and below is an example which will go through all this slides. String similarity joins take two string sets and an edit distance threshold as input, and outputs all the similar pairs between the two string sets. Hereinafter we focus on self join that’s the two string set is all the same. In the example, we have a string set with 6 records, each attached with an ID on the left side. Suppose the edit-distance threshold τ = 3. the highlighted pair is an example answers. M: 去掉length 加naive方法 ED <s1 ,s2>=5 ED <s1 ,s3>=13 ED <s1 ,s4>=12 ED <s1 ,s5>=12 ED <s1 ,s6>=14 ED <s2 ,s3>=12 ED <s2 ,s4>=12 ED <s2 ,s5>=12 ED <s2 ,s6>=14 ED <s3 ,s4>=5 ED <s3 ,s5>=4 ED <s3 ,s6>=8 ED <s4 ,s5>=4 ED <s4 ,s6>=3 ED <s5 ,s6>=8 11/13/2018 VLDB2012

Basic idea Filter a large number of dissimilar string pairs Verify the remaining potentially similar pairs Drawbacks Need to tune parameters Bad for short strings to achieve high performance Cannot select high-quality pruning condition 短串在实际应用中很普遍例如名字和jilu Two state of art methods Faerie [17] and NGPP [19] have been proposed to address this problem. The basic idea of NGPP is that: first partitions entities into different partitions, and guarantees that an entity and a substring are similar if they have two similar partitions with edit distance no larger than 1. faerie proposed a unified framework to support various similar functions, an entity and a substring are similar if their overlap similarity is larger than a threshold. however they both have some limitations. Firstly, they need to tune parameters to achieve a high performance, which is a tedious and troublesome process. Secondly, faerie used gram-based index structures and NGPP indexed all 1-variant of partitions. Both of them involve large index size. Last, then are inefficient for large edit distance threshold. 11/13/2018 VLDB2012

Outline Motivation & Problem Formulation Partition-based Framework
Improving Substring Selection Improving the Verification Experiment Conclusion Here is the outline of today’s presentation. I’ll give the motivation of our project first . 11/13/2018 VLDB2012

Our Filter Condition Give threshold τ=1 hilton huston 1

1 hilton huston Our Filter Condition Minimum # edit operations is 2
Give threshold τ=1 hilton huston 1 Minimum # edit operations is 2 Prune!

Our Filter Condition Split r to τ +1 disjoint segments
Threshold τ String r String s Split r to τ +1 disjoint segments If string s is similar to string r, string s must have a substring matching a segment of r. Is there any substring of s matching a segment of r ? Yes No <r, s> is a candidate We prune <r, s>

How to partition? Give threshold τ=1 hilton huston Match Candidate!

Partition Scheme Even Partition Scheme
tau = 3 “avataresha”  {“av”, “at”, “are”, “sha”} Other Schemes Select good partition strategies. Adaptive partition scheme [Deng et al. 2012a]. tau+1 partitions Each partition nearly has the same length Example:

Partition-based Framework
1. Group all the strings by length: Sl S9 S10 S15 Using the basic idea to solve the self set similarity joins problem S17

2. For each Sl , partition strings into segments and build tau+1 inverted indexes Lli S s3=kau shic _cha duri s4=kau shik _cha krab s5=kau shuk _cha dhui Using the basic idea to solve the self set similarity joins problem

3. Select substrings and generate candidates s6=caushik _chakrabar Using the basic idea to solve the self set similarity joins problem Candidates: <s3, s6>; <s4, s6>; <s5, s6>

4. Verify the candidates Candidates: <3, 6>; <4, 6>; <5, 6> ED(s3, s6) > 3 ED(s4, s6) = 3 ED(s5, s6) > 3 Using the basic idea to solve the self set similarity joins problem

Challenge Decrease selected substring set size.
Accelerate the verification. Using the basic idea to solve the self set similarity joins problem

Naive Method a v a t a r e s h a va nk at esh
For each Lli, put all the substrings of s into W(s, l). a v a t a r e s h a va nk at esh L91 av va at ta ar re es sh ha L92 ava vat ata tar are res esh sha L93 …… …… …… avataresh vataresha L94 avataresha

Naive Method For each Lli, put all the substrings of s into W(s, l).
The size of W(s, l) is: For example, with 4 segments and , the size of W(s, l) is 220.

Length-based Method va nk at esh av va at ta ar re es sh ha
For each Lli, only select substrings with the same length. va nk at esh L91 av va at ta ar re es sh ha L92 av va at ta ar re es sh ha L93 av va at ta ar re es sh ha L94 ava vat ata tar are res esh sha

Length-based Method The size of W(s, l) is:
For and , the size of W(s, l) is 35.

Shift-based Method For each inverted index Lli with start position pi, select all substrings with start position in [pi - τ, pi+τ]. Pruning Condition: ||sl|-|rl|| > τ First transform rl to sl

Shift-based Method va nk at esh av va at ta av va at ta ar re
The size of W(s, l) is: (tau+1)(2tau+1) For and , the size of W(s, l) is 22. va nk at esh L91 av va at ta L92 av va at ta ar re L93 va at ta ar re es sh L94 tar are res esh sha

Position-aware Method
rl rr sl sr ||sl|-|rl||+||sr|-|rr||=2+3>3

For each inverted index Lli with start position pi, select all substrings with start position in where Δ=|s|-|r|=|s|-l. Transform rl to sl and then transform rr to sr Pruning Condition: ||sl|-|rl|| +||sr|-|rr||> τ

The size of W(s, l) is: (tau+1)2 For and , the size of W(s, l) is 14. va nk at esh L91 av va at L92 va at ta ar L93 ta ar re es L94 res esh sha

Multi-match-aware Method -- Left-side Perspective
rl=“” sl=“a” ||sl|-|rl|| = 1 <= 2 errors in 3 undetected partitions. Still have matching segments

For each inverted index Lli with start position pi, select all substrings with start position in Pruning Condition: ||sl|-|rl||+(# undetected parts) > τ

The size of W(s, l) is: tau2+2tau For and , the size of W(s, l) is 14. va nk at esh L91 av L92 va at ta L93 at ta ar re es L94 tar are res esh sha

Multi-match-aware Method -- Right-side Perspective
For each inverted index Lli with start position pi, select all substrings with start position in Pruning Condition: (# undetected parts)+||sr|-|rr|| > τ

Multi-match-aware Method
We can combine the conclusion from left and right side simultaneously. For each inverted index Lli with start position pi, select all substrings with start position in

Multi-match-aware Method
The size of W(s, l) is: For and , the size of W(s, l) is 8. va nk at esh L91 av L92 va at ta L93 ar re es L94 sha

Theoretical Results The number of selected substrings by the multi-match-aware method is minimum For strings longer than 2*(tau+1), our selection method is the only way.

Number of Selected Substrings

Outline Motivation Problem Formulation Partition-based Framework

Improving Verification
Length-aware Verification Extension-based Verification Sharing Computations

Length-aware Verification

Total difference is 4 > tau, No need to process M[2][5]. Length Difference: 3 Length Difference: 1

Extension-base Method
Share computation between different r

Extension-base Method
We can verify a candidate pair using tighter thresholds: For the left parts we can set For the right parts we can set

Verification Time 11/13/2018 VLDB2012

Experimental Results Setting Datasets Baselines Trie-Join ED-Join

Comparison with existing methods

Scalability 11/13/2018 VLDB2012

Conclusion We propose a partition-based framework.
We develop techniques to select substrings. We prove that our method can minimize the number of selected substrings. We propose an extension-based method to efficiently verify a candidate pair. 11/13/2018 VLDB2012

Thanks! Q&A http://dbgroup.cs.tsinghua.edu.cn/dd/projects/passjoin/
Thank you very much! Welcome to our website for more information.

Pass-Join: A Partition based Method for Similarity Joins

Similar presentations

Presentation on theme: "Pass-Join: A Partition based Method for Similarity Joins"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Pass-Join: A Partition based Method for Similarity Joins

Similar presentations

Presentation on theme: "Pass-Join: A Partition based Method for Similarity Joins"— Presentation transcript:

Similar presentations

About project

Feedback