Download presentation
Presentation is loading. Please wait.
Published byKarin Davis Modified over 9 years ago
1
Efficient Exact Set-Similarity Joins Arvind Arasu Venkatesh Ganti Raghav Kaushik DMX Group, Microsoft Research
2
Sept. 15, 2006Set-Similarity Joins2 Data Cleaning NameStreetCityStateZip INGRAM MICRO 1600 ST ANDREWS PL SANTA ANA CA92799 GTE CORP 1 STAMFORD FORUM STAMFORDCT LOGISOFT 274 GOODMAN ST N ROCHESTER14607 CIEDC 1800 5TH ST LINCONLIL92799 INGRAM MCRO 1600 ST ANDREW’S PL SANTA ANA CA92799
3
Sept. 15, 2006Set-Similarity Joins3 Data Cleaning NameStreetCityStateZip INGRAM MICRO 1600 ST ANDREWS PL SANTA ANA CA92799 GTE CORP 1 STAMFORD FORUM STAMFORDCT LOGISOFT 274 GOODMAN ST N ROCHESTER14607 CIEDC 1800 5TH ST LINCONLIL92799 INGRAM MCRO 1600 ST ANDREW’S PL SANTA ANA CA92799
4
Sept. 15, 2006Set-Similarity Joins4 Data Cleaning NameStreetCityStateZip INGRAM MICRO 1600 ST ANDREWS PL SANTA ANA CA92799 GTE CORP 1 STAMFORD FORUM STAMFORDCT LOGISOFT 274 GOODMAN ST N ROCHESTER14607 CIEDC 1800 5TH ST LINCONL IL92799 INGRAM MCRO 1600 ST ANDREW’S PL SANTA ANA CA92799
5
Sept. 15, 2006Set-Similarity Joins5 Data Cleaning NameStreetCityStateZip INGRAM MICRO 1600 ST ANDREWS PL SANTA ANA CA92799 GTE CORP 1 STAMFORD FORUM STAMFORDCT LOGISOFT 274 GOODMAN ST N ROCHESTER14607 CIEDC 1800 5TH ST LINCONL IL92799 INGRAM MCRO 1600 ST ANDREW’S PL SANTA ANA CA92799
6
Sept. 15, 2006Set-Similarity Joins6 Data Cleaning NameStreetCityStateZip INGRAM MICRO 1600 ST ANDREWS PL SANTA ANA CA92799 GTE CORP 1 STAMFORD FORUM STAMFORDCT06901 LOGISOFT 274 GOODMAN ST N ROCHESTERNY14607 CIEDC 1800 5TH ST LINCOLN IL92799
7
Sept. 15, 2006Set-Similarity Joins7 String Similarity Join CITY ALABASTER ALBERTVILLE … … … LINCOLN … … YUCAIPA Reference Table……City………………… …… LINCONL …… …………… ……………
8
Sept. 15, 2006Set-Similarity Joins8 NameStreetCityStateZip INGRAM MICRO 1600 ST ANDREWS PL SANTA ANA CA92799 GTE CORP 1 STAMFORD FORUM STAMFORDCT LOGISOFT 274 GOODMAN ST N ROCHESTER14607 CIEDC 1800 5TH ST LINCONLIL92799 INGRAM MCRO 1600 ST ANDREW’S PL SANTA ANA CA92799 String Similarity (Self) Join
9
Sept. 15, 2006Set-Similarity Joins9 Strings Sets [CGK ’06] microsoftmcrosoft {mc, cr, ro, os, so, of, ft}{mi, ic, cr, ro, os, so, of, ft} (edit distance ≤ 1) ----> (Δ ≤ 4) 2-grams
10
mcrosoft … … … … … … … microsoft … … … … … … … SR String Sim Join edit distance ≤ 1 Strings Sets
11
mcrosoft … … … … … … … microsoft … … … … … … … Set Sim Join Δ ≤ 4 RS Tokenize Post-Process Strings Sets
12
Sept. 15, 2006Set-Similarity Joins12 String Set: Advantages Generalizes to many string similarity funcs Generalizes to many string similarity funcs Powerful primitive Powerful primitive Sets ≈ Relations Sets ≈ Relations Leverage relational data processing Leverage relational data processing [CGK ‘06] [CGK ‘06]
13
Sept. 15, 2006Set-Similarity Joins13 Contributions New algorithms for set-similarity joins New algorithms for set-similarity joins Exact answers Exact answers Performance guarantees Performance guarantees Outperform previous exact algorithms Outperform previous exact algorithms Orders of magnitude Orders of magnitude Exact answers are important for operators
14
Sept. 15, 2006Set-Similarity Joins14 Outline Introduction Introduction Algorithms Algorithms Experiments Experiments Conclusion Conclusion
15
{ mi, ic, cr, ro, os, so, of, ft } { lo, og, gi, is, so, of, ft } { … } { bo, oe, ei, in, ng }{ mc, cr, ro, os, so, of, ft } { lg, gi, is, so, of, ft } { … } S R
16
{ mi, ic, cr, ro, os, so, of, ft } { lo, og, gi, is, so, of, ft } { … } { bo, oe, ei, in, ng }{ mc, cr, ro, os, so, of, ft } { lg, gi, is, so, of, ft } { … } S R Intersection size ≥ 5
17
{ mi, ic, cr, ro, os, so, of, ft } { lo, og, gi, is, so, of, ft } { … } { bo, oe, ei, in, ng }{ mc, cr, ro, os, so, of, ft } { lg, gi, is, so, of, ft } { … } S R Intersection size ≥ 5
18
{ mi, ic, cr, ro, os, so, of, ft } { lo, og, gi, is, so, of, ft } { … } { bo, oe, ei, in, ng }{ mc, cr, ro, os, so, of, ft } { lg, gi, is, so, of, ft } { … } S R Intersection size ≥ 5
19
{ mi, ic, cr, ro, os, so, of, ft } { lo, og, gi, is, so, of, ft } { … } { bo, oe, ei, in, ng } { mc, cr, ro, os, so, of, ft } { lg, gi, is, so, of, ft } { … } S R { mc, cr, ro, os, so, of, ft } { mi, ic, cr, ro, os, so, of, ft } Intersection size ≥ 5
20
{ mi, ic, cr, ro, os, so, of, ft } { lo, og, gi, is, so, of, ft } { … } { bo, oe, ei, in, ng } { mc, cr, ro, os, so, of, ft } { lg, gi, is, so, of, ft } { … } S R { mc, cr, ro, os, so, of, ft } { mi, ic, cr, ro, os, so, of, ft } Intersection size ≥ 5 { lg, gi, is, so, of, ft } { lo, og, gi, is, so, of, ft }
21
{ … } { bo, oe, ei, in, ng } { … } S R { mc, cr, ro, os, so, of, ft } { mi, ic, cr, ro, os, so, of, ft } Sim ( r i, s j ) ≥ θ { lg, gi, is, so, of, ft } { lo, og, gi, is, so, of, ft } s2s2s2s2 s3s3s3s3 smsmsmsm s1s1s1s1 r2r2r2r2 r3r3r3r3 rnrnrnrn r1r1r1r1
22
{ … } { bo, oe, ei, in, ng } { … } S R { mc, cr, ro, os, so, of, ft } { mi, ic, cr, ro, os, so, of, ft } Sim ( r i, s j ) ≥ θ { lg, gi, is, so, of, ft } { lo, og, gi, is, so, of, ft } s2s2s2s2 s3s3s3s3 smsmsmsm s1s1s1s1 r2r2r2r2 r3r3r3r3 rnrnrnrn r1r1r1r1 Large
23
Input: Input: R: r 1, r 2, …, r n (n sets) R: r 1, r 2, …, r n (n sets) S: s 1, s 2, …, s m (m sets) S: s 1, s 2, …, s m (m sets) Output: All pairs (r i, s j ) such that: Output: All pairs (r i, s j ) such that: |r i Δ s j | ≤ k |r i Δ s j | ≤ k Set-Similarity Join: Symmetric Difference ≤ k Running example: k = 4
24
Sept. 15, 2006Set-Similarity Joins24 Alternate Set Representation s = { 4, 10, 13, 24, 29, 35, 41, 46, 48 }
25
Sept. 15, 2006Set-Similarity Joins25 Alternate Set Representation s = { 4, 10, 13, 24, 29, 35, 41, 46, 48 } 12550
26
Sept. 15, 2006Set-Similarity Joins26 Alternate Set Representation s = { 4, 10, 13, 24, 29, 35, 41, 46, 48 } 12550
27
Sept. 15, 2006Set-Similarity Joins27 Alternate Set Representation s = { 4, 10, 13, 24, 29, 35, 41, 46, 48 } 12550
28
Sept. 15, 2006Set-Similarity Joins28 Alternate Set Representation s = { 4, 10, 13, 24, 29, 35, 41, 46, 48 } 12550
29
Sept. 15, 2006Set-Similarity Joins29 Enumeration s r |r Δ s | ≤ 4
30
Sept. 15, 2006Set-Similarity Joins30 Enumeration s r |r Δ s | ≤ 4
31
Sept. 15, 2006Set-Similarity Joins31 Enumeration s r |r Δ s | ≤ 4 Errors
32
Sept. 15, 2006Set-Similarity Joins32 Enumeration 23451 s r |r Δ s | ≤ 4
33
Sept. 15, 2006Set-Similarity Joins33 Enumeration: Signature Generation s,,,,{} Sig (s )
34
Sept. 15, 2006Set-Similarity Joins34 Enumeration: Signature Generation s,,,,{} Sig (s ) { 0x4f72ba91, 0x29c8af10, 0x594b2c17, 0xa3b0e20f, 0xdd21f32a} Hash32()
35
Sept. 15, 2006Set-Similarity Joins35 Property of Signatures |r Δ s | ≤ 4 Sig (r ) Sig (s ) ≠ Φ U 23451 s r
36
Sept. 15, 2006Set-Similarity Joins36 Enumeration: Algorithm Generate signatures for each r i, s j Generate signatures for each r i, s j Enumerate (r i, s j ) s.t Sig (r i ) Sig (s j ) ≠ Φ Enumerate (r i, s j ) s.t Sig (r i ) Sig (s j ) ≠ Φ Output those satisfying |r i Δ s j | ≤ 4 Output those satisfying |r i Δ s j | ≤ 4 U
37
Sept. 15, 2006Set-Similarity Joins37 Enumeration s1s1 s5s5 s2s2 s3s3 s4s4 Sig (s 2 ) Sig (s 5 ) Sig (s 3 ) Sig (s 4 ) U r1r1 r5r5 r2r2 r3r3 r4r4 Sig (s 1 ) Sig (r 2 ) Sig (r 5 ) Sig (r 3 ) Sig (r 4 ) Sig (r 1 ) Sig (r 2 ) Sig (s 1 ) ≠ Φ
38
Sept. 15, 2006Set-Similarity Joins38 Enumeration s1s1 s5s5 s2s2 s3s3 s4s4 Sig (s 2 ) Sig (s 5 ) Sig (s 3 ) Sig (s 4 ) U r1r1 r5r5 r2r2 r3r3 r4r4 Sig (s 1 ) Sig (r 2 ) Sig (r 5 ) Sig (r 3 ) Sig (r 4 ) Sig (r 1 ) Sig (r 2 ) Sig (s 1 ) ≠ Φ
39
Sept. 15, 2006Set-Similarity Joins39 Enumeration s1s1 s5s5 s2s2 s3s3 s4s4 Sig (s 2 ) Sig (s 5 ) Sig (s 3 ) Sig (s 4 ) U r1r1 r5r5 r2r2 r3r3 r4r4 Sig (s 1 ) Sig (r 2 ) Sig (r 5 ) Sig (r 3 ) Sig (r 4 ) Sig (r 1 ) Sig (r 2 ) Sig (s 1 ) ≠ Φ Output False positive candidate pairs
40
S (Id, Elem) R.Sig = S.Sig δ R.Id, S.Id R (Id, Elem) Post-Process each R.Id, S.Id Gen Signatures S’ (Id, Sig)R’ (Id, Sig)
41
Sept. 15, 2006Set-Similarity Joins41 No False Positive Candidate Pair 23451 s r |r Δ s | = 5
42
Sept. 15, 2006Set-Similarity Joins42 False Positive Candidate Pair s2s2 s1s1 23451 |r Δ s | = 5
43
Sept. 15, 2006Set-Similarity Joins43 Enumeration: Performance k = 4
44
Sept. 15, 2006Set-Similarity Joins44 Enumeration: Performance Ideal Performance k = 4
45
Sept. 15, 2006Set-Similarity Joins45 Enumeration |r Δ s | ≤ 4 s r
46
Sept. 15, 2006Set-Similarity Joins46 Enumeration 234615 s r |r Δ s | ≤ 4
47
Sept. 15, 2006Set-Similarity Joins47 Enumeration: Signature Generation s1s1 234615
48
Sept. 15, 2006Set-Similarity Joins48 Enumeration: Signature Generation s1s1 234615
49
Sept. 15, 2006Set-Similarity Joins49 Enumeration: Signature Generation s1s1 234615
50
Sept. 15, 2006Set-Similarity Joins50 Enumeration: Signature Generation s1s1 234615
51
Sept. 15, 2006Set-Similarity Joins51 Enumeration: Signature Generation s1s1 234615 ( ) 6 2 = 15
52
Sept. 15, 2006Set-Similarity Joins52 Algorithm Generate signatures for each r i, s j Generate signatures for each r i, s j Enumerate (r i, s j ) s.t Sig (r i ) Sig (s j ) ≠ Φ Enumerate (r i, s j ) s.t Sig (r i ) Sig (s j ) ≠ Φ Output those satisfying |r i Δ s j | ≤ 4 Output those satisfying |r i Δ s j | ≤ 4 U Only the signature function changes
53
Sept. 15, 2006Set-Similarity Joins53 Enumeration: Performance k = 4
54
Sept. 15, 2006Set-Similarity Joins54 False Positive Candidate Pair 234615 s r |r Δ s | = 5
55
Sept. 15, 2006Set-Similarity Joins55 Enumeration: Performance k = 4
56
Sept. 15, 2006Set-Similarity Joins56 Enumeration: Performance 5 15 35 4845 k = 4
57
Sept. 15, 2006Set-Similarity Joins57 PartEnum: Divide and Conquer s1s1 21 k = 4 k 2 = 1 k 1 = 2 Generate signatures using Enumeration
58
Sept. 15, 2006Set-Similarity Joins58 PartEnum: Asymptotic Performance Theorem: There is an instance of PartEnum such that: Theorem: There is an instance of PartEnum such that: If |r Δ s | > 7.5 k, then r and s do not share a signature with probability 1 – o(1) If |r Δ s | > 7.5 k, then r and s do not share a signature with probability 1 – o(1) The number of signatures per set: O (k 2 ) The number of signatures per set: O (k 2 )
59
Sept. 15, 2006Set-Similarity Joins59 PartEnum: Summary Set-Similarity Joins with predicate |r Δ s | ≤ k Set-Similarity Joins with predicate |r Δ s | ≤ k Theoretical guarantees Theoretical guarantees First exact algorithm First exact algorithm
60
Sept. 15, 2006Set-Similarity Joins60 Other results PartEnum extensions: PartEnum extensions: Larger class of set-similarity join predicates Larger class of set-similarity join predicates Jaccard Jaccard Basic idea: reduce to symmetric set difference Basic idea: reduce to symmetric set difference WtEnum class of signature functions: WtEnum class of signature functions: Use frequency of elements Use frequency of elements Weighted set-similarity joins Weighted set-similarity joins
61
Sept. 15, 2006Set-Similarity Joins61 Outline Introduction Introduction Algorithms Algorithms Experiments Experiments Conclusion Conclusion
62
S (Id, Elem) R.Sig = S.Sig δ R.Id, S.Id R (Id, Elem) Post-Process each R.Id, S.Id Gen Signatures Implementation DBMS Client + DBMS DBMS Client
63
Sept. 15, 2006Set-Similarity Joins63 Previous Work Prefix Filtering [CGK ’06] Prefix Filtering [CGK ’06] Exact Exact Locality Sensitive Hashing [IM ’98] Locality Sensitive Hashing [IM ’98] Approximate Approximate False negative rate: 5% False negative rate: 5%
64
Sept. 15, 2006Set-Similarity Joins64 Data Sets Organization addresses [MS Sales] Organization addresses [MS Sales] Concatenation: Org name, street, city, zip Concatenation: Org name, street, city, zip Input size: 1 million Input size: 1 million Avg. length: 11 words, 58 chars Avg. length: 11 words, 58 chars Tokenization: Words, n-grams Tokenization: Words, n-grams
65
Sept. 15, 2006Set-Similarity Joins65 Jaccard, 1M, MS Sales 0.80.90.85
66
S (Id, Elem) R.Sig = S.Sig δ R.Id, S.Id R (Id, Elem) Post-Process each R.Id, S.Id Gen Signatures Evaluation DBMS DBMS Intermediate Result size Client + DBMS Client
67
Jaccard, 1M, MS Sales 0.80.90.85
68
Sept. 15, 2006Set-Similarity Joins68 Jaccard, Synthetic
69
Sept. 15, 2006Set-Similarity Joins69 Similar Results for … Other data sets Other data sets DBLP, Synthetic data sets DBLP, Synthetic data sets Other similarity functions Other similarity functions Weighted jaccard Weighted jaccard Edit distance Edit distance
70
Sept. 15, 2006Set-Similarity Joins70 Conclusion New algorithms for set-similarity joins New algorithms for set-similarity joins Exact Exact Performance guarantees Performance guarantees Outperform previous exact algorithms Outperform previous exact algorithms Search: “data cleaning project”
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.