Similarity join problem with Pass- Join-K using Hadoop ---BY Yu Haiyang.

Similarity join problem with Pass- Join-K using Hadoop ---BY Yu Haiyang

2015-10-2 http://datamining.xmu.edu.cn 2/32 Outline  Background  The introduction of Pass-Join-K  Combining Pass-Join-K with Hadoop

2015-10-2 http://datamining.xmu.edu.cn 3/32 Background  Similarity join: Find all similar pairs from two sets. Data Cleaning. Query Relaxation Spellchecking “PO BOX 23, Main St.” “P.O. Box 23, Main St” “information”“imformation”

2015-10-2 http://datamining.xmu.edu.cn 4/32 Background  How to define similarity? Jaccard distance Cosine distance Edit distance

2015-10-2 http://datamining.xmu.edu.cn 5/32 Background  Edit distance The minimum number of edit operations (insertion, deletion, and substitution) to transform one string to another. BabyBodySubstitution BodBodyInsertion

2015-10-2 http://datamining.xmu.edu.cn 6/32 Background  How does the edit distance compare with other two? Accuracy: {“abcdefg”,”gfedcba”} Verification time: O(mn) -> O(m+n)

2015-10-2 http://datamining.xmu.edu.cn 7/32 Background  Find similar pairs We have two string sets,one is {vldb,sigmod,….},the other is {pvldb,icde,…}. Find some candidate pairs, and then verify these pairs. {,,,,,….} Yes No

2015-10-2 http://datamining.xmu.edu.cn 8/32 Background  So we have to: Finding candidate pairs. There are O(N 2 ) if we do not prune some pairs. verifying these pairs. O(mn)

2015-10-2 http://datamining.xmu.edu.cn 9/32 Introduction of Pass-Join-K  Some obvious pruning techniques Length –based: threshold = 2, Shift-based: abcd cdef

2015-10-2 http://datamining.xmu.edu.cn 10/32 Introduction of Pass-Join-K  Partition-based pruning technique We suppose the threshold tau = 2, K=2and we have a pair abcdefghijk abdefghk

2015-10-2 http://datamining.xmu.edu.cn 11/32 Introduction of Pass-Join-K  Partition Scheme We have seen that the longer the substrings are, the harder they could be marched. So we break the string into tau+k parts and each part while its length equals length/(tau+k) or length/(tau+k)+1.

2015-10-2 http://datamining.xmu.edu.cn 12/32 Introduction of Pass-Join-K  Partition Scheme So we break the string into tau+k parts and each part while its length equals length/(tau+k) or length/(tau+k)+1. abcdefghijk

2015-10-2 http://datamining.xmu.edu.cn 13/32 Introduction of Pass-Join-K  Partition Scheme r = “abcdefghijk”s = “abdefghk” abcdefghijk L 11 1 2 34 rr rr def

2015-10-2 http://datamining.xmu.edu.cn 14/32 Introduction of Pass-Join-K  Substring Selection Here we suppose tau = 3 and k = 1; abcdefghijk a b d e f g h k

2015-10-2 http://datamining.xmu.edu.cn 16/32 Introduction of Pass-Join-K  Substring Selection Here we suppose tau = 3 and k = 1; abcdefghijk a b d e f gh k

2015-10-2 http://datamining.xmu.edu.cn 17/32 Introduction of Pass-Join-K  Substring Selection Here we suppose tau = 3 and k = 1; abcdefghijk abdefghk

2015-10-2 http://datamining.xmu.edu.cn 19/32 Introduction of Pass-Join-K  Substring Selection So what we do is to deduce the number of substrings. More pruning techniques, please read our paper: 《 Pass-Join-K 多分段匹配的相似性连接算法》

2015-10-2 http://datamining.xmu.edu.cn 20/32 Introduction of Pass-Join-K  Verification DP( Dynamic programming) D(m,n)=max(D(m,n-1)+1,D(m-1,n)+1,D(m-1,n- 1)+flag) where flag = 1 when s m =r n, s and r are both strings.

2015-10-2 http://datamining.xmu.edu.cn 21/32 Introduction of Pass-Join-K  Verification Here we suppose tau = 3 and k = 1; abcdefghijk def e f g h k Tau left = 3 Tau right = 3-3=0

2015-10-2 http://datamining.xmu.edu.cn 22/32 Combining Pass-Join-K with Hadoop  Inverted index tree in hadoop (abc, 1, 11,r) (def,2,11,r) (ghi,3,11,r) (jk,4,11,r) abcdefghijk 1 2 34 rr rr L 11

2015-10-2 http://datamining.xmu.edu.cn 23/32 Combining Pass-Join-K with Hadoop  Substrings in hadoop Suppose tau = 3, k = 1, and s = “abdefghk”, length(s) = 8. We have to generate some records such as (a,1,5,s),(a,2,6,s)(a,3,7,s),(ab,1,8,s),…,(ab,1,11,s ),…

2015-10-2 http://datamining.xmu.edu.cn 24/32 Combining Pass-Join-K with Hadoop  Substrings in hadoop Suppose tau = 3, k = 1, and s = “abdefghk”, length(s) = 8. We have to generate more than 2*tau*(tau+k)*m records where m is the average number that substring for each segment, such as (a,1,5,s),(a,1,6,s)(a,1,7,s),(ab,1,8,s),…,(ab,1,11,s),…

2015-10-2 http://datamining.xmu.edu.cn 25/32 Combining Pass-Join-K with Hadoop  Data flows in hadoop

2015-10-2 http://datamining.xmu.edu.cn 26/32 Combining Pass-Join-K with Hadoop  How to improve the performance ? We have known that as k increased, the pairs we need to verity would be decrease. As k increased, more than (tau+k+1)/(tau+k) records should be translated in Mapper phase.

2015-10-2 http://datamining.xmu.edu.cn 27/32 Combining Pass-Join-K with Hadoop  Here we have 2 ways to improve our algorithm. Finding a dataset that the candidate pairs number are large enough or making tau are large enough. Decreasing the data which were generated in Mapper phase.

2015-10-2 http://datamining.xmu.edu.cn 28/32 Combining Pass-Join-K with Hadoop  Decrease the data flows

2015-10-2 http://datamining.xmu.edu.cn 29/32 Combining Pass-Join-K with Hadoop  Decrease the data flows The inverted index record was formulated as (substring,segmentNumber, LengthInf, Id, flag) Each record’s length is length(substring)+4*sizeof(int), and substring sometimes could be so long. Hash(substring) -> integer, then record length is 5*sizeof(int)

2015-10-2 http://datamining.xmu.edu.cn 30/32 Combining Pass-Join-K with Hadoop  Decrease the data flows The substring would generate some similar records such as (a,1,5,s),(a,1,6,s)(a,1,7,s)… Each substring would generate tau+k similar segments, so we combine them as,for example, (a,1,5,7,s). So we make the (tau+k)*4*sizeof(int) to 5*sizeof(int).

2015-10-2 http://datamining.xmu.edu.cn 31/32 Combining Pass-Join-K with Hadoop  Decrease the data flows So by using two steps we have seen before, we have reduced the (length(substring)+4*sizeof(int))*(tau+k) to 5 times sizeof(int)

2015-10-2 http://datamining.xmu.edu.cn 32/32  Email: yhycai@gmail.com

Similarity join problem with Pass- Join-K using Hadoop ---BY Yu Haiyang.

Similar presentations

Presentation on theme: "Similarity join problem with Pass- Join-K using Hadoop ---BY Yu Haiyang."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Similarity join problem with Pass- Join-K using Hadoop ---BY Yu Haiyang.

Similar presentations

Presentation on theme: "Similarity join problem with Pass- Join-K using Hadoop ---BY Yu Haiyang."— Presentation transcript:

Similar presentations

About project

Feedback