Download presentation
Presentation is loading. Please wait.
Published byPhilippa Quinn Modified over 9 years ago
1
Large-scale Similarity Join with Edit-distance Constraints ---BY Yu Haiyang 1/30
2
2015-12-19 http://datamining.xmu.edu.cn Outline Background The introduction of Pass-Join-K Combining Pass-Join-K with Hadoop 2/30
3
2015-12-19 http://datamining.xmu.edu.cn Background Similarity join: Find all similar pairs from two sets. Data Cleaning. Query Relaxation Spellchecking 3/30
4
2015-12-19 http://datamining.xmu.edu.cn Background How to define similarity? Jaccard distance( 词袋模型 ) Cosine distance Edit distance 4/30
5
2015-12-19 http://datamining.xmu.edu.cn Background Edit distance The minimum number of edit operations (insertion, deletion, and substitution) to transform one string to another. BabyBodySubstitution BodBodyInsertion 5/30
6
2015-12-19 http://datamining.xmu.edu.cn Background How does the edit distance compare with other two? Accuracy: {“abcdefg”,”gfedcba”} Verification time: O(m+n) -> O(mn) 6/30
7
2015-12-19 http://datamining.xmu.edu.cn Background Find similar pairs We have two string sets,one is {vldb,sigmod,….},the other is {pvldb,icde,…}. Find some candidate pairs, and then verify these pairs. {,,,,,….} Yes No 7/30
8
2015-12-19 http://datamining.xmu.edu.cn Background So we have to: Finding candidate pairs. There are O(N 2 ) if we do not prune some pairs. verifying these pairs. O(mn) 8/30
9
2015-12-19 http://datamining.xmu.edu.cn Outline Background The introduction of Pass-Join-K Combining Pass-Join-K with Hadoop 9/30
10
2015-12-19 http://datamining.xmu.edu.cn Introduction of Pass-Join-K Partition-based pruning technique We suppose the threshold tau = 2, K= 1 and we have a pair 10/30
11
2015-12-19 http://datamining.xmu.edu.cn Introduction of Pass-Join-K Partition-based pruning technique We suppose the threshold tau = 2, K=2and we have a pair 11/30
12
2015-12-19 http://datamining.xmu.edu.cn Introduction of Pass-Join-K Some obvious pruning techniques Length –based: threshold = 2, Shift-based: abcd cdef 12/30
13
2015-12-19 http://datamining.xmu.edu.cn Introduction of Pass-Join-K Partition Scheme We have seen that the longer the substrings are, the harder they could be marched. So we break the string into tau+k parts and each part while its length equals length/(tau+k) or length/(tau+k)+1. 13/30
14
2015-12-19 http://datamining.xmu.edu.cn Introduction of Pass-Join-K Partition Scheme 14/30
15
2015-12-19 http://datamining.xmu.edu.cn Introduction of Pass-Join-K Substring Selection Here we suppose tau = 3 and k = 1; abcdefghijk a b d e f g h k 15/30
16
2015-12-19 http://datamining.xmu.edu.cn Introduction of Pass-Join-K Substring Selection Here we suppose tau = 3 and k = 1; abcdefghijk a b d e f g h k 16/30
17
2015-12-19 http://datamining.xmu.edu.cn Introduction of Pass-Join-K Substring Selection Here we suppose tau = 3 and k = 1; abcdefghijk a b d e f gh k 17/30
18
2015-12-19 http://datamining.xmu.edu.cn Introduction of Pass-Join-K Substring Selection Here we suppose tau = 3 and k = 1; abcdefghijk abdefghk 18/30
19
2015-12-19 http://datamining.xmu.edu.cn Introduction of Pass-Join-K Substring Selection Here we suppose tau = 3 and k = 1; abcdefghijk a b d e f g h k 19/30
20
2015-12-19 http://datamining.xmu.edu.cn Introduction of Pass-Join-K Substring Selection So what we do is to deduce the number of substrings. More pruning techniques, please read our paper: 《 Pass-Join-K 多分段匹配的相 似性连接算法》 20/30
21
2015-12-19 http://datamining.xmu.edu.cn Introduction of Pass-Join-K Verification DP( Dynamic programming) D(m,n)=max(D(m,n-1)+1,D(m-1,n)+1,D(m-1,n- 1)+flag) where flag = 1 when s m =r n, s and r are both strings. 21/30
22
2015-12-19 http://datamining.xmu.edu.cn Introduction of Pass-Join-K Verification Here we suppose tau = 3 and k = 1; abcdefghijk def e f g h k Tau left = 3 Tau right = 3-3=0 22/30
23
2015-12-19 http://datamining.xmu.edu.cn Outline Background The introduction of Pass-Join-K Combining Pass-Join-K with Hadoop 23/30
24
2015-12-19 http://datamining.xmu.edu.cn Combining Pass-Join-K with Hadoop Big data Big file Large number of files 24/30
25
2015-12-19 http://datamining.xmu.edu.cn Combining Pass-Join-K with Hadoop Inverted index tree in hadoop (abc, 1, 11,r,IFlag) (def,2,11,r,IFlag) (ghi,3,11,r,IFlag) (jk,4,11,r,IFlag) abcdefghijk 1 2 34 rr rr L 11 25/30
26
2015-12-19 http://datamining.xmu.edu.cn Combining Pass-Join-K with Hadoop Substrings in hadoop Suppose tau = 3, k = 1, and s = “abdefghk”, length(s) = 8. We have to generate some records such as (a,1,5,s,SFlag),(a,2,6,s,SFlag)(a,3,7,s,SFlag),(a b,1,8,s,SFlag),…,(ab,1,11,s,SFlag),… 26/30
27
2015-12-19 http://datamining.xmu.edu.cn Combining Pass-Join-K with Hadoop Data flows in hadoop 27/30
28
2015-12-19 http://datamining.xmu.edu.cn Combining Pass-Join-K with Hadoop Big data Big file Large number of files 28/30
29
2015-12-19 http://datamining.xmu.edu.cn 29/30 Combining Pass-Join-K with Hadoop [segmentString, segmentNumber, stringLength, FLAG], [DirNumber, ID]
30
2015-12-19 http://datamining.xmu.edu.cn 30/30 Email: yhycai@gmail.com
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.